[00:13:23] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[00:31:10] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974645 (owner: 10TrainBranchBot)
[00:38:59] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/975926
[00:39:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/975926 (owner: 10TrainBranchBot)
[01:00:36] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/975926 (owner: 10TrainBranchBot)
[01:03:49] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T351683 (10phaultfinder)
[01:12:49] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2041.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2041.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[01:34:44] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[01:35:08] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[01:35:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T348183)', diff saved to https://phabricator.wikimedia.org/P53646 and previous config saved to /var/cache/conftool/dbconfig/20231121-013514-arnaudb.json
[01:35:20] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[02:03:38] <wikibugs>	 (03PS2) 10MPGuy2824: Disable PageTriage's extended features on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635)
[02:04:10] <wikibugs>	 (03CR) 10Ejegg: "Hi folks, would releng be able to deploy this some time this week? The new entry is only needed for donatewiki, but it seems there is only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971281 (https://phabricator.wikimedia.org/T254808) (owner: 10Ejegg)
[02:05:22] <wikibugs>	 (03CR) 10MPGuy2824: Disable PageTriage's extended features on beta testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635) (owner: 10MPGuy2824)
[02:32:46] <wikibugs>	 (03CR) 10Jsn.sherman: [C: 03+1] "looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635) (owner: 10MPGuy2824)
[02:38:23] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0300)
[03:08:23] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:09:49] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[03:19:43] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:20:13] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:21:17] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:22:33] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:24:17] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:25:07] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:33:22] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0400)
[04:13:23] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:35:54] <wikibugs>	 (03PS1) 10KartikMistry: Enable Content/Section translation on some Wikipedias with potential to be supported with MinT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975924 (https://phabricator.wikimedia.org/T345267)
[05:12:25] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:12:49] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2041.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2041.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:15:53] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:28:12] <wikibugs>	 (03PS1) 10Marostegui: pc1014: Move to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/975946 (https://phabricator.wikimedia.org/T351285)
[06:29:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1014: Move to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/975946 (https://phabricator.wikimedia.org/T351285) (owner: 10Marostegui)
[06:31:33] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:33:03] <marostegui>	 jouncebot: next
[06:33:03] <jouncebot>	 In 0 hour(s) and 26 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0700)
[06:33:03] <jouncebot>	 In 0 hour(s) and 26 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0700)
[06:37:09] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:38:27] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T348183)', diff saved to https://phabricator.wikimedia.org/P53647 and previous config saved to /var/cache/conftool/dbconfig/20231121-063827-arnaudb.json
[06:38:32] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[06:48:22] <wikibugs>	 (03PS2) 10KartikMistry: cxserver: Force 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118)
[06:52:24] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] Remove the oozie integration from hue [puppet] - 10https://gerrit.wikimedia.org/r/974646 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis)
[06:53:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P53648 and previous config saved to /var/cache/conftool/dbconfig/20231121-065333-arnaudb.json
[06:58:52] <wikibugs>	 (03CR) 10Santhosh: [C: 03+1] cxserver: Force 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry)
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0700)
[07:00:05] <jouncebot>	 kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0700).
[07:05:17] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:05:43] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 33452
[07:06:05] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 33452
[07:08:40] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P53649 and previous config saved to /var/cache/conftool/dbconfig/20231121-070840-arnaudb.json
[07:09:50] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[07:10:17] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:12:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1210.eqiad.wmnet with OS bookworm
[07:22:17] <wikibugs>	 (03PS1) 10Ayounsi: Don't alert for v6 AAAA for logstash and kafla-logging [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/976110
[07:23:47] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T348183)', diff saved to https://phabricator.wikimedia.org/P53650 and previous config saved to /var/cache/conftool/dbconfig/20231121-072346-arnaudb.json
[07:23:49] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[07:23:56] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[07:24:02] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[07:24:04] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[07:24:18] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[07:24:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T348183)', diff saved to https://phabricator.wikimedia.org/P53651 and previous config saved to /var/cache/conftool/dbconfig/20231121-072424-arnaudb.json
[07:25:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage
[07:25:39] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host druid1011.eqiad.wmnet with OS bullseye
[07:27:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage
[07:28:40] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene)
[07:29:54] <wikibugs>	 (03PS1) 10Marostegui: db2132: Remove 10.6 declaration [puppet] - 10https://gerrit.wikimedia.org/r/976135
[07:30:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2132: Remove 10.6 declaration [puppet] - 10https://gerrit.wikimedia.org/r/976135 (owner: 10Marostegui)
[07:33:23] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:46:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] changeprop - fixes for beta values [deployment-charts] - 10https://gerrit.wikimedia.org/r/975862 (https://phabricator.wikimedia.org/T351247) (owner: 10Ottomata)
[07:46:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1210.eqiad.wmnet with OS bookworm
[07:47:31] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Release 1.15.14 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/973732 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez)
[07:47:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 1.15.14 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/973732 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez)
[07:47:55] <vgutierrez>	 :?
[07:48:18] <vgutierrez>	 https://integration.wikimedia.org/ci/job/fail-archived-repositories/291/console : This repository has been archived and new patches are not being accepted.
[07:48:24] * vgutierrez feeling old this morning lol
[07:50:16] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1011.eqiad.wmnet with reason: host reimage
[07:52:53] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1011.eqiad.wmnet with reason: host reimage
[07:54:26] <wikibugs>	 (03Abandoned) 10Vgutierrez: Release 1.15.14 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/973732 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez)
[07:59:48] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10CodeReviewBot) vgutierrez opened https://gitlab.wikimedia.org/repos/sre/pybal/-/merge_requests/1  Release 1.15.14
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0800).
[08:00:05] <jouncebot>	 awight, kart_, and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:40] <kostajh>	 hi
[08:00:43] * kart_ is here
[08:02:29] <awight>	 :wave: I'm happy to deploy unless Amir1 or urbanecm are already buckling in?
[08:03:12] <kostajh>	 thanks awight 
[08:03:38] <kostajh>	 mine is beta only, and no need for verification so you can just `scap backport` it and move on from that one :)
[08:04:02] <awight>	 kk let's have some fun!
[08:04:28] <kart_>	 awight: let me know when you're done.
[08:04:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 5%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53652 and previous config saved to /var/cache/conftool/dbconfig/20231121-080455-arnaudb.json
[08:05:12] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10CodeReviewBot) vgutierrez merged https://gitlab.wikimedia.org/repos/sre/pybal/-/merge_requests/1  Release 1.15.14
[08:05:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 5%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53653 and previous config saved to /var/cache/conftool/dbconfig/20231121-080527-arnaudb.json
[08:06:50] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: repool db2178 [puppet] - 10https://gerrit.wikimedia.org/r/975927 (https://phabricator.wikimedia.org/T343674)
[08:07:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch)
[08:07:46] <wikibugs>	 (03PS2) 10Awight: Enable Reference Previews on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch)
[08:07:52] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch)
[08:08:43] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Reference Previews on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch)
[08:09:37] <logmsgbot>	 !log awight@deploy2002 Started scap: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]]
[08:09:42] <stashbot>	 T282999: Enable Reference Previews on all wikis using the Popups extension, on Nov 21 - https://phabricator.wikimedia.org/T282999
[08:10:27] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10CodeReviewBot) vgutierrez opened https://gitlab.wikimedia.org/repos/sre/pybal/-/merge_requests/2  Release 1.15.14
[08:11:02] <logmsgbot>	 !log awight@deploy2002 awight and wmde-fisch: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:11:29] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10CodeReviewBot) vgutierrez opened https://gitlab.wikimedia.org/repos/sre/pybal/-/merge_requests/3  Release 1.15.14
[08:13:48] <wikibugs>	 10SRE, 10Patch-Needs-Improvement: Install private instance of gnomon for greater SRE team - https://phabricator.wikimedia.org/T246062 (10Aklapper) a:05CDanis→03None @CDanis: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of t...
[08:14:50] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:15:42] <wikibugs>	 (03CR) 10Marostegui: mariadb: repool db2178 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975927 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[08:16:07] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1011.eqiad.wmnet with OS bullseye
[08:16:57] <wikibugs>	 (03CR) 10Arnaudb: mariadb: repool db2178 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975927 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[08:18:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: repool db2178 [puppet] - 10https://gerrit.wikimedia.org/r/975927 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[08:18:43] <logmsgbot>	 !log awight@deploy2002 awight and wmde-fisch: Continuing with sync
[08:18:47] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: repool db2178 [puppet] - 10https://gerrit.wikimedia.org/r/975927 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[08:18:49] <wikibugs>	 10SRE, 10serviceops: Update grafana link for mediawiki-error-rate-$cluster in icinga check - https://phabricator.wikimedia.org/T281261 (10Aklapper) a:05jijiki→03None @jijiki: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of...
[08:19:19] <wikibugs>	 10SRE, 10serviceops, 10User-jbond, 10User-jijiki: Refactor memcached modules - https://phabricator.wikimedia.org/T284454 (10Aklapper) a:05jijiki→03None @jijiki: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task...
[08:20:00] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53654 and previous config saved to /var/cache/conftool/dbconfig/20231121-082000-arnaudb.json
[08:20:23] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Aklapper) a:05Legoktm→03None @Legoktm: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am...
[08:20:33] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53655 and previous config saved to /var/cache/conftool/dbconfig/20231121-082032-arnaudb.json
[08:21:25] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465 (10Aklapper) a:05MoritzMuehlenhoff→03None @MoritzMuehlenhoff: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/...
[08:23:15] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Aklapper) a:05Paladox→03None @Paladox: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cl...
[08:24:39] <logmsgbot>	 !log awight@deploy2002 Finished scap: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]] (duration: 15m 02s)
[08:24:44] <stashbot>	 T282999: Enable Reference Previews on all wikis using the Popups extension, on Nov 21 - https://phabricator.wikimedia.org/T282999
[08:26:05] <awight>	 Just had an interesting deployment failure:
[08:26:06] <awight>	 08:10:18 1 K8s nodes failed to pull the multiversion image                                                                                                                   
[08:26:18] <awight>	 connect to host k
[08:26:19] <awight>	 ubernetes2041.codfw.wmnet port 22: No route to host
[08:26:24] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff There is progress, the last change only happened on October 26. This is a long standing task with low...
[08:26:49] <awight>	 I need to try again, I guess?
[08:27:08] <logmsgbot>	 !log awight@deploy2002 Started scap: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]]
[08:27:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ncredir to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976149 (https://phabricator.wikimedia.org/T349619)
[08:27:50] <wikibugs>	 (03PS4) 10Elukey: changeprop: allow to define Kafka settings for Job Queues [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950)
[08:27:58] <awight>	 Same error.
[08:28:14] <kart_>	 :/
[08:28:17] <awight>	 It's not clear to me whether a rollback will even help?
[08:28:27] <logmsgbot>	 !log awight@deploy2002 wmde-fisch and awight: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:28:33] <logmsgbot>	 !log awight@deploy2002 Sync cancelled.
[08:28:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch ncredir to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976149 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:28:59] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Joe) 05Open→03Resolved a:03Joe Given we now have switchovers at regular intervals, we can resolve this task. There is no need to do a lot of c...
[08:29:07] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "Enable Reference Previews on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976151
[08:29:09] <wikibugs>	 (03CR) 10TrainBranchBot: "awight@deploy2002 created a revert of this change as Ib1779fea4fb782eb61b6c84e1b01b6d6c9a3b166" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch)
[08:29:17] <wikibugs>	 (03PS5) 10Elukey: changeprop: allow to define Kafka settings for Job Queues [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950)
[08:29:31] <awight>	 arnaudb: FYI, please see the k8s deployment failure ^
[08:29:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976151 (owner: 10TrainBranchBot)
[08:29:57] <arnaudb>	 seen
[08:30:23] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable Reference Previews on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976151 (owner: 10TrainBranchBot)
[08:30:35] <logmsgbot>	 !log awight@deploy2002 Started scap: Backport for [[gerrit:976151|Revert "Enable Reference Previews on all wikis"]]
[08:31:37] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: ncredir
[08:31:51] <awight>	 kart_: kostajh: sorry, I think scap is broken at the moment.  I'm finishing rollback and then you're free to do whatever is right for your deployments.
[08:31:58] <logmsgbot>	 !log awight@deploy2002 awight and trainbranchbot: Backport for [[gerrit:976151|Revert "Enable Reference Previews on all wikis"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:32:04] <logmsgbot>	 !log awight@deploy2002 awight and trainbranchbot: Continuing with sync
[08:32:28] <kart_>	 awight: no problem. I'll reschedule my deployment.
[08:33:19] <awight>	 kostajh: Would you like the beta-only patch to go out still?  I'm not sure whether scap is smart enough to skip k8s for that?
[08:34:10] <kostajh>	 awight: I can manage it later, thanks
[08:35:06] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 15%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53656 and previous config saved to /var/cache/conftool/dbconfig/20231121-083504-arnaudb.json
[08:35:38] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 15%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53657 and previous config saved to /var/cache/conftool/dbconfig/20231121-083537-arnaudb.json
[08:37:10] <vgutierrez>	 !log upload pybal 1.15.14 to apt.wm.o (bullseye-wikimedia) -  T348837
[08:37:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:15] <stashbot>	 T348837: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837
[08:37:43] <logmsgbot>	 !log awight@deploy2002 Finished scap: Backport for [[gerrit:976151|Revert "Enable Reference Previews on all wikis"]] (duration: 07m 08s)
[08:38:26] <awight>	 !log scap window cancelled due to k8s error
[08:38:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:09] <vgutierrez>	 !log updating pybal to 1.5.14 on lvs4010 - T351069
[08:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:14] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[08:49:26] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Reset spine switch BGP to CR if max prefix tripped after 30 mins [homer/public] - 10https://gerrit.wikimedia.org/r/975799 (https://phabricator.wikimedia.org/T349116) (owner: 10Cathal Mooney)
[08:50:11] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53658 and previous config saved to /var/cache/conftool/dbconfig/20231121-085011-arnaudb.json
[08:50:43] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53659 and previous config saved to /var/cache/conftool/dbconfig/20231121-085042-arnaudb.json
[09:05:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 45%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53660 and previous config saved to /var/cache/conftool/dbconfig/20231121-090516-arnaudb.json
[09:05:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 45%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53661 and previous config saved to /var/cache/conftool/dbconfig/20231121-090547-arnaudb.json
[09:07:28] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/976110 (owner: 10Ayounsi)
[09:09:39] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[4008-4009].ulsfo.wmnet} and A:lvs (T351069)
[09:09:44] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[09:10:20] <vgutierrez>	 !log updating pybal to 1.5.14 on ulsfo - T351069
[09:10:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:35] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[4008-4009].ulsfo.wmnet} and A:lvs (T351069)
[09:12:49] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2041.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2041.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:14:23] <vgutierrez>	 !log updating pybal to 1.5.14 on eqsin - T351069
[09:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:13] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs5006.eqsin.wmnet} and A:lvs (T351069)
[09:15:18] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[09:15:44] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs5006.eqsin.wmnet} and A:lvs (T351069)
[09:15:45] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[09:16:12] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[5004-5005].eqsin.wmnet} and A:lvs (T351069)
[09:17:13] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[5004-5005].eqsin.wmnet} and A:lvs (T351069)
[09:17:53] <vgutierrez>	 !log updating pybal to 1.5.14 on codfw - T351069
[09:17:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:03] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[09:18:08] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2014.codfw.wmnet} and A:lvs (T351069)
[09:18:37] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs2014.codfw.wmnet} and A:lvs (T351069)
[09:19:13] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[2011-2013].codfw.wmnet} and A:lvs (T351069)
[09:20:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53662 and previous config saved to /var/cache/conftool/dbconfig/20231121-092021-arnaudb.json
[09:20:53] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53663 and previous config saved to /var/cache/conftool/dbconfig/20231121-092052-arnaudb.json
[09:22:02] <_joe_>	 jouncebot: nowandnext
[09:22:02] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 37 minute(s)
[09:22:02] <jouncebot>	 In 1 hour(s) and 37 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1100)
[09:24:46] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[2011-2013].codfw.wmnet} and A:lvs (T351069)
[09:24:53] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[09:26:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] centrallog: update tls_netstream_driver to use ossl [puppet] - 10https://gerrit.wikimedia.org/r/975861 (https://phabricator.wikimedia.org/T324623) (owner: 10Jbond)
[09:27:13] <vgutierrez>	 I'll continue with the lvs updates later today (codfw/ulsfo/eqsin) done, (eqiad/esams/drmrs) to go
[09:27:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos: change increase() range for Lift Wing [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[09:29:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile::pyrra::filesystem: new Lift Wing pilot candidate [puppet] - 10https://gerrit.wikimedia.org/r/975833 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[09:29:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos: improve istio sli recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974486 (owner: 10Elukey)
[09:35:05] <wikibugs>	 (03PS5) 10Tim Starling: Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989)
[09:35:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 75%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53664 and previous config saved to /var/cache/conftool/dbconfig/20231121-093526-arnaudb.json
[09:35:36] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling)
[09:35:42] <wikibugs>	 (03PS6) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950)
[09:35:58] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53665 and previous config saved to /var/cache/conftool/dbconfig/20231121-093557-arnaudb.json
[09:41:05] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "Still not ok, but I'll keep working on it :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[09:41:10] <kostajh>	 awight: just catching up now, did you report the error somewhere?
[09:47:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: switch to use the traffic percentage split endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/975816 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto)
[09:47:49] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: switch to use the traffic percentage split endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/975816 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto)
[09:49:35] <wikibugs>	 (03PS7) 10Tim Starling: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989)
[09:50:00] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "this host is currently the live host for all systems running puppet7.  Furthermore it is already running bookworm" [puppet] - 10https://gerrit.wikimedia.org/r/975911 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[09:50:31] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53666 and previous config saved to /var/cache/conftool/dbconfig/20231121-095031-arnaudb.json
[09:51:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53667 and previous config saved to /var/cache/conftool/dbconfig/20231121-095102-arnaudb.json
[09:51:58] <logmsgbot>	 !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[09:52:59] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove the oozie integration from hue [puppet] - 10https://gerrit.wikimedia.org/r/974646 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis)
[09:53:12] <logmsgbot>	 !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[10:00:14] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[10:00:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host gitlab-runner1002.eqiad.wmnet
[10:01:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch gitlab-runner1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976155 (https://phabricator.wikimedia.org/T349619)
[10:01:40] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[10:02:20] <wikibugs>	 (03PS1) 10Majavah: prometheus: node_puppet_agent: improve debugging abilities [puppet] - 10https://gerrit.wikimedia.org/r/976156
[10:02:32] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] acme-chief: Remove acmechief2002 passive host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975911 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[10:02:45] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[10:02:47] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[10:02:49] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] prometheus: node_puppet_agent: improve debugging abilities [puppet] - 10https://gerrit.wikimedia.org/r/976156 (owner: 10Majavah)
[10:03:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] base: switch rsyslog tls_netstream_driver to ossl [puppet] - 10https://gerrit.wikimedia.org/r/975791 (https://phabricator.wikimedia.org/T324623) (owner: 10Jbond)
[10:03:07] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] centrallog: update tls_netstream_driver to use ossl [puppet] - 10https://gerrit.wikimedia.org/r/975861 (https://phabricator.wikimedia.org/T324623) (owner: 10Jbond)
[10:03:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch gitlab-runner1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976155 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:03:30] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[10:03:33] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] Switch gitlab-runner1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976155 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:03:36] <moritzm>	 jbond: I'll puppet-merge your rsyslog patches along, ok?
[10:05:01] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/601/console" [puppet] - 10https://gerrit.wikimedia.org/r/976156 (owner: 10Majavah)
[10:05:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53669 and previous config saved to /var/cache/conftool/dbconfig/20231121-100536-arnaudb.json
[10:06:08] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53670 and previous config saved to /var/cache/conftool/dbconfig/20231121-100607-arnaudb.json
[10:06:23] <jbond>	 moritzm: yes please
[10:07:33] <moritzm>	 ack, done
[10:08:36] <jbond>	 cheers
[10:10:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host gitlab-runner1002.eqiad.wmnet
[10:10:34] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[10:11:22] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[10:11:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Puppet_Internal_CA.pem: rename to Puppet5_Internal_CA.pem [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/975869 (https://phabricator.wikimedia.org/T351653) (owner: 10Jbond)
[10:15:13] <wikibugs>	 (03PS3) 10Hnowlan: envoy: use ENTRYPOINT instead of CMD [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033)
[10:16:03] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] cxserver: Force 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry)
[10:17:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/976156 (owner: 10Majavah)
[10:18:15] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] prometheus: node_puppet_agent: improve debugging abilities (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976156 (owner: 10Majavah)
[10:18:29] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet
[10:18:51] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10ops-monitoring-bot) Host rebooted by jelto@cumin1001 with reason: None
[10:19:24] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+2 C: 03+2] envoy: use ENTRYPOINT instead of CMD [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[10:21:42] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[10:22:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host gerrit2002.wikimedia.org
[10:23:28] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617)
[10:25:02] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1003.eqiad.wmnet
[10:26:27] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) (owner: 10Jcrespo)
[10:28:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Use a native package [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/976159
[10:28:37] <awight>	 kostajh: No I haven't done anything persistent--there' a "!-log" message and I pinged arnaudb who is on clinic duty
[10:29:27] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Use a native package [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/976159 (owner: 10Muehlenhoff)
[10:29:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/976159 (owner: 10Muehlenhoff)
[10:29:40] <arnaudb>	 lets maybe summon _joe_ awight, I think we need somebody with a k8s hat on :)
[10:29:58] <_joe_>	 arnaudb: what's going on?
[10:30:13] <arnaudb>	 it seems that there is some issues on a deployment issued by awight 
[10:30:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch gerrit2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976160 (https://phabricator.wikimedia.org/T349619)
[10:30:53] <awight>	 _joe_: Hi!  yes the error is in the IRC history, "scap backport" was unable to connect to a k8s host and so I had to roll back the deployment to avoid being in an inconsistent state.
[10:31:20] <_joe_>	 awight: at what time, sorry?
[10:31:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch gerrit2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976160 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:31:28] <arnaudb>	 maybe we should create a phab task to handle this properly?
[10:31:46] <awight>	 _joe_: root cause seems to be "kubernetes2041.codfw.wmnet port 22: No route to host"
[10:32:16] <awight>	 (behind which I'm sure there's another deeper cause :-) )
[10:32:42] <_joe_>	 awight: ok, fwiw you shouldn't need to rollback if that happens
[10:33:11] <_joe_>	 that k8s node is down apparently
[10:33:16] <_joe_>	 but that's a non-fatal issue
[10:33:43] <awight>	 _joe_: But if k8s hosts and legacy servers are inconsistent... how can I tell that the cluster is consistent, for example if that host suddenly starts up again or if it was just a network glitch...
[10:33:55] <_joe_>	 awight: that's not an issue on k8s
[10:33:57] <awight>	 Perhaps dead servers can be depooled
[10:34:08] <_joe_>	 the host is down for k8s too so it won't be scheduled jobs
[10:34:20] <_joe_>	 when it comes up again, if a pod is scheduled, it will pull the correct image
[10:34:32] <_joe_>	 and yes ofc, we just didn't notice
[10:34:56] <awight>	 So should the scap logic ignore such an error?
[10:35:11] <jbond>	 !log upload new wmf-certificates packages
[10:35:11] <_joe_>	 it should report it but not suggest to rollback, yes
[10:35:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host gerrit2002.wikimedia.org
[10:35:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:31] <_joe_>	 awight: so if it automatically rolled back, please open a task about that
[10:36:33] <awight>	 _joe_: interesting!  So maybe at "warn" or "info" level.  But this is very helpful information, thanks.  We'll try the deployment again today.
[10:36:49] <_joe_>	 awight: we're also pulling the node back up now ofc
[10:36:51] <awight>	 _joe_: no, this was my noob reaction to seeing an error though
[10:37:08] <_joe_>	 but please open a task, it should be clear the error is not fatal
[10:38:10] <awight>	 Want to bless us sneaking a small deployment window now?  Or is it better to wait for the official window...
[10:40:11] <_joe_>	 jayme: when you've dealt with the dead node, can you give the go-ahead to awight ?
[10:40:33] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling)
[10:40:42] <jayme>	 I've cordoned it for now, so it should be out of the way
[10:41:18] <wikibugs>	 (03Merged) 10jenkins-bot: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling)
[10:42:01] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] wikireplicas: update-views: try to do changes live [cookbooks] - 10https://gerrit.wikimedia.org/r/975796 (owner: 10Majavah)
[10:42:18] <taavi>	 i think the list scap uses is pulled from puppetdb directly, so you will still see the same warning
[10:43:27] <jayme>	 ah, I probably did not have full context. So scap is failing because it can't connect to that node...hmpf
[10:43:41] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] wikireplicas: update-views: try to do changes live [cookbooks] - 10https://gerrit.wikimedia.org/r/975796 (owner: 10Majavah)
[10:43:47] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617)
[10:44:41] <awight>	 _joe_: Here's a skeletal bug report--it's missing the final text, I'm digging around to see where it landed.  https://phabricator.wikimedia.org/T351701
[10:44:54] <wikibugs>	 (03CR) 10Jcrespo: "This is a bit cleaner, stop harcoding the statistics (mysql) file, which was what caused the last issue to start with." [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) (owner: 10Jcrespo)
[10:46:06] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ores extension: set default value of OresLiftWingAddHostHeader to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976161
[10:48:02] <wikibugs>	 (03Merged) 10jenkins-bot: wikireplicas: update-views: try to do changes live [cookbooks] - 10https://gerrit.wikimedia.org/r/975796 (owner: 10Majavah)
[10:49:19] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "This doesn't work: https://puppet-compiler.wmflabs.org/output/976158/602/" [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) (owner: 10Jcrespo)
[10:50:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: gitlab_runner
[10:51:37] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ores extension: set default value of OresLiftWingAddHostHeader to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976161 (https://phabricator.wikimedia.org/T351703)
[10:51:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch gitlab_runner to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976162 (https://phabricator.wikimedia.org/T349619)
[10:55:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch gitlab_runner to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976162 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:56:30] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617)
[10:56:46] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, dedicated hiera entry for gitlab-runner1002 in hieradata/hosts/gitlab-runner1002.yaml can be removed." [puppet] - 10https://gerrit.wikimedia.org/r/976162 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:57:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) (owner: 10Jcrespo)
[10:57:56] <wikibugs>	 (03PS4) 10Jcrespo: dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617)
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1100)
[11:00:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: gitlab_runner
[11:01:35] <wikibugs>	 (03CR) 10Jcrespo: "Looking good: https://puppet-compiler.wmflabs.org/output/976158/603/" [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) (owner: 10Jcrespo)
[11:01:56] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) (owner: 10Jcrespo)
[11:02:10] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:02:16] <wikibugs>	 (03PS1) 10Tim Starling: Revert "Enable LoginNotify seen subnets table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976176
[11:02:32] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Revert "Enable LoginNotify seen subnets table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976176 (owner: 10Tim Starling)
[11:02:43] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:03:18] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable LoginNotify seen subnets table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976176 (owner: 10Tim Starling)
[11:04:56] <wikibugs>	 (03PS1) 10Tim Starling: Reapply "Enable LoginNotify seen subnets table"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989)
[11:05:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mwlog2002.codfw.wmnet
[11:06:08] <wikibugs>	 (03PS1) 10Volans: sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163
[11:06:10] <wikibugs>	 (03PS1) 10Volans: sre.I/F cookbooks: use get_subset() [cookbooks] - 10https://gerrit.wikimedia.org/r/976164
[11:06:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mwlog2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976165 (https://phabricator.wikimedia.org/T349619)
[11:06:31] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:07:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch mwlog2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976165 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:09:03] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): thanos internal TLS failure after puppet 7 update - https://phabricator.wikimedia.org/T351653 (10MatthewVernon) @jbond thanks, that CR has fixed the sad services (and the openssl runes now work too).
[11:09:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): thanos internal TLS failure after puppet 7 update - https://phabricator.wikimedia.org/T351653 (10jbond) 05Open→03Resolved I have rolled out a new wmf-certificates package which i believe has fixed this error.  all swift se...
[11:10:50] <jayme>	 awight: taavi: _joe_: network link is still down on 2041 - this probably needs dcops interaction
[11:11:02] <awight>	 jayme: No rush, thanks for the update!
[11:13:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:13:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mwlog2002.codfw.wmnet
[11:18:20] <wikibugs>	 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10JMeybohm) Hey DCOps, this looks suspiciously like a cable might have been pulled. Could you please take a look?
[11:18:42] <wikibugs>	 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10JMeybohm) a:05JMeybohm→03Papaul
[11:19:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Cleanup now obsolete Hiera entry, applied per role [puppet] - 10https://gerrit.wikimedia.org/r/976188 (https://phabricator.wikimedia.org/T349619)
[11:20:10] <wikibugs>	 (03CR) 10Volans: "For info about get_subset() see https://doc.wikimedia.org/spicerack/master/api/spicerack.remote.html#spicerack.remote.RemoteHosts.get_subs" [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans)
[11:20:34] <Emperor>	 !log depool ms-fe2014 to reimage with new envoy TLS setup T317616
[11:20:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:39] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[11:20:49] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: migrate one node to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:21:11] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:21:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Cleanup now obsolete Hiera entry, applied per role [puppet] - 10https://gerrit.wikimedia.org/r/976188 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:21:37] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:21:42] <wikibugs>	 (03CR) 10Volans: "For info about get_subset() see https://doc.wikimedia.org/spicerack/master/api/spicerack.remote.html#spicerack.remote.RemoteHosts.get_subs" [cookbooks] - 10https://gerrit.wikimedia.org/r/976164 (owner: 10Volans)
[11:22:23] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2014.codfw.wmnet with OS bullseye
[11:22:34] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2014.codfw.wmnet with OS bullseye
[11:23:21] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:24:10] <_joe_>	 jayme: we can just depool it and set to status=inactive for now
[11:24:16] <_joe_>	 sorry I was in a call
[11:28:17] <jayme>	 _joe_: set/pooled=inactive you mean?
[11:29:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: centralserver: remove tls_remedy [puppet] - 10https://gerrit.wikimedia.org/r/976190 (https://phabricator.wikimedia.org/T324623)
[11:30:57] <wikibugs>	 (03PS10) 10D3r1ck01: mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690)
[11:31:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/604/con" [puppet] - 10https://gerrit.wikimedia.org/r/976190 (https://phabricator.wikimedia.org/T324623) (owner: 10Filippo Giunchedi)
[11:31:10] <wikibugs>	 (03PS1) 10Hnowlan: service, kubernetes: mw-jobrunner fixes [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796)
[11:32:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "Will manually reset-failed the timer post-merge" [puppet] - 10https://gerrit.wikimedia.org/r/976190 (https://phabricator.wikimedia.org/T324623) (owner: 10Filippo Giunchedi)
[11:32:37] <wikibugs>	 10SRE, 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10JMeybohm)
[11:34:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/976164 (owner: 10Volans)
[11:34:49] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:35:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host titan2002.codfw.wmnet
[11:35:28] <wikibugs>	 (03PS2) 10Btullis: spark: add support for spark-history on the spark image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896363 (https://phabricator.wikimedia.org/T330176) (owner: 10Nicolas Fraison)
[11:35:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976190 (https://phabricator.wikimedia.org/T324623) (owner: 10Filippo Giunchedi)
[11:36:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] sanitarium_multiinstance: over private_wiki and private_tables vars to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[11:37:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch titan2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976193 (https://phabricator.wikimedia.org/T349619)
[11:37:24] <jbond>	 moritzm: i merged your cleanup change
[11:37:33] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubernetes2041.codfw.wmnet with reason: NIC 1 Port 1 network link is down
[11:37:58] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubernetes2041.codfw.wmnet with reason: NIC 1 Port 1 network link is down
[11:38:05] <wikibugs>	 10SRE, 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=771e4f70-9348-49e4-9f8a-1228c0c3d3dc) set by jayme@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their ser...
[11:39:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch titan2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976193 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:39:37] <moritzm>	 jbond: ack, thx
[11:42:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host titan2002.codfw.wmnet
[11:44:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The probe is correct, the problem is we don't have the realserver IP on the backends :)" [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:47:03] <_joe_>	 jayme: sorry I missed your message, yes
[11:47:32] <jayme>	 _joe_: np. was more meant as a clarification :)
[11:48:49] <wikibugs>	 (03PS13) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008)
[11:48:51] <wikibugs>	 (03PS8) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008)
[11:48:53] <wikibugs>	 (03PS13) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008)
[11:48:55] <wikibugs>	 (03PS8) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008)
[11:48:57] <wikibugs>	 (03PS8) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008)
[11:48:59] <wikibugs>	 (03PS13) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008)
[11:49:01] <wikibugs>	 (03PS13) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008)
[11:49:02] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:51:25] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::cinder_backups
[11:52:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[11:52:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[11:52:52] <wikibugs>	 (03PS1) 10Jbond: wmcs::openstack::eqiad1::cinder_backups: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976197 (https://phabricator.wikimedia.org/T349619)
[11:53:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2014.codfw.wmnet with reason: host reimage
[11:53:48] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye
[11:53:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye
[11:54:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[11:54:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[11:55:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::cinder_backups: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976197 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[11:56:05] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2014.codfw.wmnet with reason: host reimage
[11:56:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[11:59:14] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::cinder_backups
[12:00:15] <wikibugs>	 (03PS1) 10Jbond: docker::reports: change ownership of base rebuild job [puppet] - 10https://gerrit.wikimedia.org/r/976198
[12:00:21] <wikibugs>	 (03PS2) 10Hnowlan: service, kubernetes: mw-jobrunner fixes [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796)
[12:01:40] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::control
[12:03:11] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye
[12:03:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*...
[12:03:28] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye
[12:03:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye
[12:06:08] <wikibugs>	 (03CR) 10Hnowlan: service, kubernetes: mw-jobrunner fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[12:09:22] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2014.codfw.wmnet with OS bullseye
[12:09:28] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2014.codfw.wmnet with OS bullseye completed: - ms-fe2014 (**PASS**)   - Downtimed on Ici...
[12:10:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mediawiki::packages: Clean up absented packages [puppet] - 10https://gerrit.wikimedia.org/r/975451 (owner: 10Muehlenhoff)
[12:11:55] <wikibugs>	 (03PS9) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008)
[12:11:57] <wikibugs>	 (03PS14) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008)
[12:11:59] <wikibugs>	 (03PS9) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008)
[12:12:01] <wikibugs>	 (03PS9) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008)
[12:12:03] <wikibugs>	 (03PS14) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008)
[12:12:05] <wikibugs>	 (03PS14) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008)
[12:12:53] <wikibugs>	 (03PS1) 10Jbond: wmcs::openstack::eqiad1::control: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976201 (https://phabricator.wikimedia.org/T349619)
[12:13:27] <wikibugs>	 (03PS2) 10Jbond: wmcs::openstack::eqiad1::control: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976201 (https://phabricator.wikimedia.org/T349619)
[12:14:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:14:39] <awight>	 jayme: _joe_: Shall I go ahead and deploy nonetheless, or still better to wait for the official depooling?
[12:14:55] <wikibugs>	 (03PS1) 10Muehlenhoff: mediawiki::packages: Drop python-pil [puppet] - 10https://gerrit.wikimedia.org/r/976202 (https://phabricator.wikimedia.org/T268468)
[12:14:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:14:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:15:06] <jayme>	 awight: host is depooled and should no longer be used by scap, you may go ahead
[12:15:18] <wikibugs>	 (03PS10) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008)
[12:15:20] <wikibugs>	 (03PS15) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008)
[12:15:22] <wikibugs>	 (03PS10) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008)
[12:15:24] <wikibugs>	 (03PS10) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008)
[12:15:24] <_joe_>	 even if it still is in the distribution list, just ignore the error
[12:15:26] <wikibugs>	 (03PS15) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008)
[12:15:28] <wikibugs>	 (03PS15) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008)
[12:17:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::control: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976201 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[12:18:23] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:18:32] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage
[12:21:29] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage
[12:22:38] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::control
[12:24:52] <wikibugs>	 (03PS11) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008)
[12:24:54] <wikibugs>	 (03PS16) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008)
[12:24:56] <wikibugs>	 (03PS11) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008)
[12:24:57] <awight>	 jayme: ty!
[12:24:58] <wikibugs>	 (03PS11) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008)
[12:25:00] <wikibugs>	 (03PS16) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008)
[12:25:02] <wikibugs>	 (03PS16) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008)
[12:25:41] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::net
[12:26:07] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: add article-descriptions isvc to experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/975929 (https://phabricator.wikimedia.org/T343123)
[12:26:58] <wikibugs>	 (03PS12) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008)
[12:27:00] <wikibugs>	 (03PS17) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008)
[12:27:02] <wikibugs>	 (03PS12) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008)
[12:27:04] <wikibugs>	 (03PS12) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008)
[12:27:06] <wikibugs>	 (03PS17) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008)
[12:27:06] <logmsgbot>	 !log awight@deploy2002 Started scap: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]]
[12:27:08] <wikibugs>	 (03PS17) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008)
[12:27:11] <stashbot>	 T282999: Enable Reference Previews on all wikis using the Popups extension, on Nov 21 - https://phabricator.wikimedia.org/T282999
[12:28:30] <logmsgbot>	 !log awight@deploy2002 wmde-fisch and awight: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:29:09] <wikibugs>	 (03PS1) 10Jbond: wmcs::openstack::eqiad1::net: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976203 (https://phabricator.wikimedia.org/T349619)
[12:29:19] <awight>	 WMDE-Fisch: our Reference Previews config is on the test servers
[12:29:35] <WMDE-Fisch>	 awight: looking at it
[12:31:05] <logmsgbot>	 !log awight@deploy2002 Sync cancelled.
[12:31:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::net: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976203 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[12:32:43] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[12:32:58] <wikibugs>	 (03PS1) 10Awight: Revert "Revert "Enable Reference Previews on all wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976182
[12:33:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976182 (owner: 10Awight)
[12:33:19] <wikibugs>	 (03CR) 10WMDE-Fisch: [C: 03+1] Revert "Revert "Enable Reference Previews on all wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976182 (owner: 10Awight)
[12:34:01] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Enable Reference Previews on all wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976182 (owner: 10Awight)
[12:34:15] <logmsgbot>	 !log awight@deploy2002 Started scap: Backport for [[gerrit:976182|Revert "Revert "Enable Reference Previews on all wikis""]]
[12:35:35] <logmsgbot>	 !log awight@deploy2002 awight: Backport for [[gerrit:976182|Revert "Revert "Enable Reference Previews on all wikis""]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:35:40] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::net
[12:36:45] <logmsgbot>	 !log awight@deploy2002 awight: Continuing with sync
[12:38:31] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1115.eqiad.wmnet with OS bullseye
[12:38:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye completed: - cp1115 (**PASS**)   - Remo...
[12:39:24] <wikibugs>	 (03PS13) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008)
[12:39:26] <wikibugs>	 (03PS18) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008)
[12:39:28] <wikibugs>	 (03PS13) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008)
[12:39:30] <wikibugs>	 (03PS13) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008)
[12:39:32] <wikibugs>	 (03PS18) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008)
[12:39:34] <wikibugs>	 (03PS18) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008)
[12:40:34] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::rabbitmq
[12:41:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/608/console" [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:42:18] <wikibugs>	 (03PS1) 10Jbond: wmcs::openstack::eqiad1::rabbitmq: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976205 (https://phabricator.wikimedia.org/T349619)
[12:42:28] <logmsgbot>	 !log awight@deploy2002 Finished scap: Backport for [[gerrit:976182|Revert "Revert "Enable Reference Previews on all wikis""]] (duration: 08m 12s)
[12:42:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::rabbitmq: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976205 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[12:43:26] <awight>	 jayme: Deployment looks successful, but FWIW the same errors appeared.
[12:44:50] <jayme>	 ack, thanks
[12:45:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/609/con" [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:49:41] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::rabbitmq
[12:52:20] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::services
[12:52:35] <wikibugs>	 (03PS14) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008)
[12:52:37] <wikibugs>	 (03PS19) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008)
[12:52:39] <wikibugs>	 (03PS14) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008)
[12:52:41] <wikibugs>	 (03PS14) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008)
[12:52:43] <wikibugs>	 (03PS19) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008)
[12:52:45] <wikibugs>	 (03PS19) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008)
[12:52:47] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: bump envoy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/976206 (https://phabricator.wikimedia.org/T324130)
[12:54:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[12:54:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/610/console" [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:54:34] <wikibugs>	 (03PS1) 10Jbond: wmcs::openstack::eqiad1::services: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976207 (https://phabricator.wikimedia.org/T349619)
[12:56:37] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  an-worker11 - jclark@cumin1001"
[12:57:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  an-worker11 - jclark@cumin1001"
[12:57:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:59:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::services: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976207 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[12:59:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] centralserver: remove tls_remedy [puppet] - 10https://gerrit.wikimedia.org/r/976190 (https://phabricator.wikimedia.org/T324623) (owner: 10Filippo Giunchedi)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1300)
[13:00:32] <godog>	 jbond: feel free to merge my patch too if it came up
[13:01:37] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: add article-descriptions isvc to experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/975929 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira)
[13:05:58] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::services
[13:06:30] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host titan2002.codfw.wmnet
[13:13:21] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2002.codfw.wmnet
[13:14:21] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::virt
[13:14:28] <wikibugs>	 (03PS1) 10Ssingh: constants: update ns2 IP address [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219)
[13:14:50] <jinxer-wm>	 (ProbeDown) firing: (6) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:15:17] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: bump to v20231121104610-0b4bfdd [deployment-charts] - 10https://gerrit.wikimedia.org/r/976211
[13:15:57] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for spotting it" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:17:08] <wikibugs>	 (03PS1) 10Jbond: wmcs::openstack::eqiad1::virt: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976214 (https://phabricator.wikimedia.org/T349619)
[13:17:13] <icinga-wm>	 RECOVERY - snapshot of s6 in eqiad on backupmon1001 is OK: Last snapshot for s6 at eqiad (db1225) taken on 2023-11-21 12:20:50 (541 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[13:17:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::virt: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976214 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[13:18:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] constants: update ns2 IP address [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:20:10] <wikibugs>	 (03CR) 10Ssingh: "CI error is: `wmflib/requests.py:6: error: Library stubs not installed for "requests.packages.urllib3.util.retry"  [import-untyped]`" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:20:41] <icinga-wm>	 RECOVERY - snapshot of s6 in codfw on backupmon1001 is OK: Last snapshot for s6 at codfw (db2097) taken on 2023-11-21 12:27:37 (578 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[13:20:41] <wikibugs>	 (03PS1) 10Hnowlan: mw-jobrunner: use proxy in socket definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/976216 (https://phabricator.wikimedia.org/T349796)
[13:21:39] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "don't worry about CI failing, I'll look into it, types-requests is already part of the dependencies and is failing only on py38/39" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:22:07] <wikibugs>	 (03CR) 10Ssingh: constants: update ns2 IP address (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:22:13] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::virt
[13:22:24] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review Tobias :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/975929 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira)
[13:22:51] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::virt_ceph
[13:23:14] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: add article-descriptions isvc to experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/975929 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira)
[13:23:32] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond)
[13:23:46] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] "merging to bypass ci failure, I'll look into it later" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:23:48] <wikibugs>	 10SRE, 10Observability-Logging: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) @fgiunchedi Everything is using openssl now, do you still see the errors?
[13:23:58] <wikibugs>	 10SRE, 10Observability-Logging: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond)
[13:24:04] <wikibugs>	 10SRE, 10Cloud-VPS, 10cloud-services-team, 10observability, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10jbond) 05Open→03Resolved a:03jbond All systems hav now been migrated to ossl
[13:24:12] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[13:24:43] <wikibugs>	 (03PS1) 10Jbond: wmcs::openstack::eqiad1::virt_ceph: migrat to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976217 (https://phabricator.wikimedia.org/T349619)
[13:25:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::virt_ceph: migrat to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976217 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[13:27:35] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Reapply "Enable LoginNotify seen subnets table"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling)
[13:28:34] <wikibugs>	 (03PS15) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008)
[13:28:36] <wikibugs>	 (03PS20) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008)
[13:28:38] <wikibugs>	 (03PS15) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008)
[13:28:40] <wikibugs>	 (03PS15) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008)
[13:28:42] <wikibugs>	 (03PS20) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008)
[13:28:44] <wikibugs>	 (03PS20) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008)
[13:30:06] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1165.mgmt.eqiad.wmnet with reboot policy FORCED
[13:30:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1166.mgmt.eqiad.wmnet with reboot policy FORCED
[13:30:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1167.mgmt.eqiad.wmnet with reboot policy FORCED
[13:30:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED
[13:30:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[13:30:11] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1169.mgmt.eqiad.wmnet with reboot policy FORCED
[13:30:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1170.mgmt.eqiad.wmnet with reboot policy FORCED
[13:31:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-jobrunner: use proxy in socket definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/976216 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[13:32:22] <Emperor>	 !log repool ms-fe2014 with new envoy TLS setup T317616
[13:32:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:27] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[13:32:43] <wikibugs>	 (03CR) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[13:32:52] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability: ossl rsyslog post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi)
[13:33:04] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED
[13:34:29] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/613/con" [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[13:35:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[13:35:40] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi)
[13:36:02] <wikibugs>	 (03CR) 10Elukey: [C: 04-2] "Going to wait a little on this one, the current version seems to work fine, I need to understand why :D" [puppet] - 10https://gerrit.wikimedia.org/r/975833 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[13:38:46] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::virt_ceph
[13:38:47] <wikibugs>	 (03PS3) 10Elukey: profile::thanos: change increase() range for Lift Wing [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390)
[13:38:57] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] cirrus-streaming-updater: bump to v20231121104610-0b4bfdd [deployment-charts] - 10https://gerrit.wikimedia.org/r/976211 (owner: 10DCausse)
[13:39:00] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Jelto) @hashar `gerrit2002` was migrated to puppet7. I restarted gerrit and apache processes and the instance looks fine so far. Could you double check `ger...
[13:39:16] <wikibugs>	 (03CR) 10Elukey: "The change is the same, I just re-added the Pyrra recording rule since it seems to work." [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[13:39:59] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump to v20231121104610-0b4bfdd [deployment-charts] - 10https://gerrit.wikimedia.org/r/976211 (owner: 10DCausse)
[13:41:38] <wikibugs>	 (03CR) 10Elukey: ml-services: add article-descriptions isvc to experimental namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/975929 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira)
[13:43:12] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mobileapps: switch 15% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976218 (https://phabricator.wikimedia.org/T350846)
[13:43:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mobileapps: 20% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976219 (https://phabricator.wikimedia.org/T350846)
[13:43:15] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mobileapps: 30% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976220 (https://phabricator.wikimedia.org/T350846)
[13:43:17] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mobileapps: 45% to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/976221 (https://phabricator.wikimedia.org/T350846)
[13:43:19] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mobileapps: 60% to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/976222 (https://phabricator.wikimedia.org/T350846)
[13:43:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mobileapps: 75% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976223 (https://phabricator.wikimedia.org/T350846)
[13:43:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mobileapps: 90% to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/976224 (https://phabricator.wikimedia.org/T350846)
[13:43:56] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) On the rsyslog side these are the errors:  ` Nov 21 13:42:58 centrallog2002 rsyslogd[2845781]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0] Nov 21 13:42...
[13:45:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Move mw appservers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/975228 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm)
[13:46:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I didn't check the IPs tbh 😊" [homer/public] - 10https://gerrit.wikimedia.org/r/975225 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm)
[13:49:49] <godog>	 !log test upgrade rsyslog on centrallog2002 - T351710
[13:49:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:56] <stashbot>	 T351710: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710
[13:51:00] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T348183)', diff saved to https://phabricator.wikimedia.org/P53672 and previous config saved to /var/cache/conftool/dbconfig/20231121-135059-arnaudb.json
[13:51:28] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[13:51:38] <vgutierrez>	 jouncebot: nowandnext
[13:51:38] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1300)
[13:51:38] <jouncebot>	 In 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1400)
[13:52:27] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://wikitech.wikimedia.org/wiki/Logs
[13:52:37] <godog>	 that's me ^
[13:56:59] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:57:44] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[13:58:08] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1400).
[14:00:04] <jouncebot>	 xSavitar and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:06] <Lucas_WMDE>	 o/
[14:00:15] <Lucas_WMDE>	 xSavitar: are you going to self-service?
[14:00:34] <xSavitar>	 o/
[14:00:44] <Lucas_WMDE>	 otherwise I can deploy as well
[14:01:00] <xSavitar>	 Lucas_WMDE, you're here already :)
[14:01:10] <xSavitar>	 You can go ahead
[14:01:41] <Lucas_WMDE>	 I’m not sure what you mean ^^ go ahead with your change or with my maintenance script?
[14:01:57] <Lucas_WMDE>	 I would wait with my script until yours is done, I have no idea how long it’ll take
[14:02:08] <xSavitar>	 I mean my config patch
[14:02:18] <xSavitar>	 Mine doesn't need testing
[14:02:21] <Lucas_WMDE>	 ok
[14:02:28] <wikibugs>	 (03PS11) 10Lucas Werkmeister (WMDE): mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01)
[14:02:34] <xSavitar>	 But once it's live, I'll signal ServiceOps to make use of it, it's their thing.
[14:02:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01)
[14:02:49] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://wikitech.wikimedia.org/wiki/Logs
[14:03:01] <Lucas_WMDE>	 regarding the change in PS9..10 – this means the default server won’t be used if the env var is set to empty
[14:03:08] <xSavitar>	 Yes
[14:03:25] <Lucas_WMDE>	 (I had first looked at PS9 where it was different, which got me thinking a bit about whether eliminating that temporary variable was worth it)
[14:03:26] <Lucas_WMDE>	 ok ^^
[14:03:33] <wikibugs>	 (03CR) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[14:04:11] <wikibugs>	 (03Merged) 10jenkins-bot: mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01)
[14:04:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:973838|mc: Make it possible to use mcrouter server set by environment (T346690)]]
[14:04:37] <stashbot>	 T346690: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690
[14:05:07] <xSavitar>	 Lucas_WMDE, yeah the one linear seems nicer and straight to the point. So right now, we'll keep using the default until the custom env variable is set.
[14:05:24] <xSavitar>	 and thank you very much for deploying
[14:05:26] <Lucas_WMDE>	 kubernetes2041 is still down it seems
[14:05:30] <xSavitar>	 effie ^^
[14:05:35] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] changeprop - fixes for beta values [deployment-charts] - 10https://gerrit.wikimedia.org/r/975862 (https://phabricator.wikimedia.org/T351247) (owner: 10Ottomata)
[14:05:37] <Lucas_WMDE>	 (known, T351704)
[14:05:38] <stashbot>	 T351704: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704
[14:05:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and d3r1ck01: Backport for [[gerrit:973838|mc: Make it possible to use mcrouter server set by environment (T346690)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:05:46] <xSavitar>	 Roger that!
[14:05:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and d3r1ck01: Continuing with sync
[14:05:56] <wikibugs>	 10SRE, 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T351663 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated power cable on for psu1 on both ends. alert cleared.
[14:06:01] <Lucas_WMDE>	 just mentioning it since it showed up in the scap output
[14:06:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED
[14:06:06] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P53673 and previous config saved to /var/cache/conftool/dbconfig/20231121-140606-arnaudb.json
[14:06:08] <xSavitar>	 Okay!
[14:06:11] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2028-11-11 12:37:08 +0000 (expires in 1816 days) https://wikitech.wikimedia.org/wiki/Logs
[14:06:30] <xSavitar>	 Lucas_WMDE read a bit about the script you want to run and all the magic happening there. Too big for my tiny brain :D
[14:06:38] <vgutierrez>	 !log updating pybal to 1.5.14 on drmrs - T351069
[14:06:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:44] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[14:07:17] <godog>	 !log revert rsyslog upgrade on centrallog2002 - T351710
[14:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:30] <stashbot>	 T351710: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710
[14:07:40] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs6003.drmrs.wmnet} and A:lvs (T351069)
[14:07:42] <Lucas_WMDE>	 :D
[14:07:58] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs6003.drmrs.wmnet} and A:lvs (T351069)
[14:08:01] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED
[14:08:19] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[6001-6002].drmrs.wmnet} and A:lvs (T351069)
[14:08:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1172.mgmt.eqiad.wmnet with reboot policy FORCED
[14:09:41] <icinga-wm>	 RECOVERY - Host kubernetes2041 is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms
[14:10:51] <wikibugs>	 (03PS1) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616)
[14:10:54] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[6001-6002].drmrs.wmnet} and A:lvs (T351069)
[14:11:04] <xSavitar>	 Lucas_WMDE, thanks for deploying
[14:11:12] <Lucas_WMDE>	 np :)
[14:11:15] <Lucas_WMDE>	 it’s almost done
[14:11:23] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 177, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:11:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:973838|mc: Make it possible to use mcrouter server set by environment (T346690)]] (duration: 07m 09s)
[14:11:38] <stashbot>	 T346690: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690
[14:11:58] <Lucas_WMDE>	 alright, I’ll do the maintenance script then
[14:12:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans)
[14:12:09] <Lucas_WMDE>	 uhm
[14:12:17] <Lucas_WMDE>	 although https://orchestrator.wikimedia.org/web/cluster/alias/s8 says two servers aren’t replicating
[14:12:47] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[14:13:41] <wikibugs>	 10SRE, 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10Jhancock.wm) a:05Papaul→03Jhancock.wm the network cable was still attached but loose. I reseated it and pulled on it to make sire it wouldn't come loose again. it did...
[14:13:55] <Lucas_WMDE>	 any DBAs around? (maybe Amir1 or jynus?) Orchestrator says two servers in s8 aren’t replicating (ca. 3h lag); known issue?
[14:14:07] <Lucas_WMDE>	 (I’m guessing I shouldn’t run my maintenance script on s8 while that’s unclear)
[14:14:34] <wikibugs>	 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T351683 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact. should go away when the row is converted to spine/leaf
[14:15:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Don't alert for v6 AAAA for logstash and kafla-logging [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/976110 (owner: 10Ayounsi)
[14:18:03] <Lucas_WMDE>	 the two servers in question aren’t shown on https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1&var-site=eqiad&var-section=s8&from=now-3h&to=now&refresh=1m at all, no idea what that means…
[14:18:04] <wikibugs>	 (03CR) 10Elukey: Don't alert for v6 AAAA for logstash and kafla-logging (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/976110 (owner: 10Ayounsi)
[14:18:14] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1167.mgmt.eqiad.wmnet with reboot policy FORCED
[14:18:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1166.mgmt.eqiad.wmnet with reboot policy FORCED
[14:18:25] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1169.mgmt.eqiad.wmnet with reboot policy FORCED
[14:18:32] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1170.mgmt.eqiad.wmnet with reboot policy FORCED
[14:18:38] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1165.mgmt.eqiad.wmnet with reboot policy FORCED
[14:19:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1173.mgmt.eqiad.wmnet with reboot policy FORCED
[14:19:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1174.mgmt.eqiad.wmnet with reboot policy FORCED
[14:20:05] <wikibugs>	 (03PS2) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616)
[14:20:56] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[14:21:13] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P53674 and previous config saved to /var/cache/conftool/dbconfig/20231121-142112-arnaudb.json
[14:22:54] <xSavitar>	 Lucas_WMDE, not sure but could it be that the servers in question are down? I don't see any task on phab related to no replication happening there.
[14:23:23] <xSavitar>	 A DBA would have the answer
[14:23:50] <Lucas_WMDE>	 if I read Orchestrator correctly, they’re up (“last seen 3s ago”), but not replicating for whatever reason
[14:23:55] <Lucas_WMDE>	 I don’t see any recent SAL entries for them
[14:24:20] <Lucas_WMDE>	 aha, one is showing up on https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1&var-site=eqiad&var-section=s8&from=now-3h&to=now&refresh=1m now at least
[14:24:28] <Lucas_WMDE>	 (db1171)
[14:24:51] <Lucas_WMDE>	 ok, orchestrator now says about db1171 that it’s replicating but has lag
[14:24:54] <Lucas_WMDE>	 so I guess it’s catching up?
[14:25:12] <xSavitar>	 Probably catching up. 
[14:25:14] <Lucas_WMDE>	 that still leaves one other server not replicating for unknown-to-me reasons
[14:25:22] <xSavitar>	 I see db1171 on grafana too
[14:25:39] <Lucas_WMDE>	 (db2098 is the other server I’m looking at; I guess posting the number here can’t hurt)
[14:25:54] <xSavitar>	 :)
[14:25:58] <Lucas_WMDE>	 yeah db1171 is going down on grafana pretty quickly now
[14:26:01] <xSavitar>	 That's what I was looking at too
[14:26:42] <xSavitar>	 Maybe once that catches up, then the other will pick up from there :)
[14:27:35] <xSavitar>	 orchestrator says 20mins 7s (as at now)
[14:27:39] <xSavitar>	 That was quick
[14:27:47] <Lucas_WMDE>	 1m48s for me
[14:27:54] <xSavitar>	 done
[14:27:56] <Lucas_WMDE>	 that’s remarkably quick indeed
[14:28:50] <xSavitar>	 The other is catching up to
[14:28:52] <xSavitar>	 *too
[14:29:15] <Lucas_WMDE>	 oh nice
[14:29:38] <Lucas_WMDE>	 though I still don’t see it in grafana
[14:29:41] <xSavitar>	 2hrs 39mins from my view
[14:29:42] <Lucas_WMDE>	 maybe that’s just a bit behind
[14:29:55] <xSavitar>	 Yeah, maybe grafana inherited the lag :D
[14:29:57] <Lucas_WMDE>	 the replication lag in orhcestrator also seemed to be ahead of grafana by a minute or so
[14:30:12] <xSavitar>	 Makes sense
[14:30:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED
[14:30:39] <taavi>	 Lucas_WMDE: is there a specific reason why you're looking at the dashboard?
[14:31:41] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs1015.eqiad.wmnet
[14:31:41] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1015.eqiad.wmnet
[14:31:51] <Lucas_WMDE>	 taavi: I want to run a maintenance script on s8 that’ll do a bunch of database writes
[14:31:59] <Lucas_WMDE>	 so I want s8 to be healthy first
[14:32:00] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1015.eqiad.wmnet with OS bullseye
[14:32:34] <Lucas_WMDE>	 and Grafana initially seemed like the more useful resource, although at the moment it seems like Orchestrator might be better
[14:32:51] <vgutierrez>	 !log updating pybal to 1.5.14 on esams - T351069
[14:32:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:57] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[14:33:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: centralserver: remove icinga tls listener check [puppet] - 10https://gerrit.wikimedia.org/r/976234 (https://phabricator.wikimedia.org/T351710)
[14:33:22] <Lucas_WMDE>	 oh 🤦
[14:33:28] <Lucas_WMDE>	 it would of course help if I selected codfw in grafana
[14:33:37] <Lucas_WMDE>	 given that db2098 is a codfw server
[14:33:37] <xSavitar>	 Lucas_WMDE, I think the other is done
[14:33:45] <Lucas_WMDE>	 it shows up at https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1&var-site=codfw&var-section=s8&from=now-3h&to=now&refresh=1m just fine
[14:33:50] <xSavitar>	 Everything looks healthy from orchestrator's pov
[14:33:58] <taavi>	 and db2098 is a backup source so it occasionally being replagged is normal
[14:34:22] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs3010.esams.wmnet} and A:lvs (T351069)
[14:34:32] <Lucas_WMDE>	 is there any way I could’ve known that?
[14:34:32] <xSavitar>	 So it seems first server is in eqiad and the second is in codfw? wow :D
[14:34:38] <taavi>	 plus dbs can get schema changes done which causes lag, etc, I wouldn't worry about a couple of hosts on an unfamiliar dashboard
[14:34:39] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs3010.esams.wmnet} and A:lvs (T351069)
[14:35:10] <xSavitar>	 taavi :D
[14:35:15] <Lucas_WMDE>	 taavi: I’d expect to see schema changes in the SAL though
[14:35:29] <Lucas_WMDE>	 like https://sal.toolforge.org/log/DbiVzosBGiVuUzOdZubS last week
[14:35:43] <Lucas_WMDE>	 anyway… it sounds like I can run my maintenance script now
[14:35:50] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[3008-3009].esams.wmnet} and A:lvs (T351069)
[14:35:56] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: use proxy in socket definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/976216 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[14:36:19] <Lucas_WMDE>	 !log START [in tmux] lucaswerkmeister-wmde@mwmaint2002:~$ mwscript Wikibase.Lexeme.Maintenance.FixPagePropsSortkey wikidatawiki --batch-size=1000 # T350224
[14:36:19] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T348183)', diff saved to https://phabricator.wikimedia.org/P53675 and previous config saved to /var/cache/conftool/dbconfig/20231121-143619-arnaudb.json
[14:36:21] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance
[14:36:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:23] <stashbot>	 T350224: [LEX] pp_sortkey is null for wb-claims, wbl-forms and wbl-senses on many Lexemes - https://phabricator.wikimedia.org/T350224
[14:36:28] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[14:36:34] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance
[14:36:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T348183)', diff saved to https://phabricator.wikimedia.org/P53676 and previous config saved to /var/cache/conftool/dbconfig/20231121-143640-arnaudb.json
[14:36:43] <wikibugs>	 (03Merged) 10jenkins-bot: mw-jobrunner: use proxy in socket definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/976216 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[14:36:55] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1172.mgmt.eqiad.wmnet with reboot policy FORCED
[14:37:00] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[14:37:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: centralserver: probe syslog receiver with client auth [puppet] - 10https://gerrit.wikimedia.org/r/976236 (https://phabricator.wikimedia.org/T351710)
[14:37:37] <Lucas_WMDE>	 ok lag is going up a bit (1s, 3s for some clouddb), nothing bad yet I’d say
[14:37:56] <Lucas_WMDE>	 (also the script waits for replication of course. I’m just double-checking to be extra safe ^^)
[14:38:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:38:23] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:26] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[3008-3009].esams.wmnet} and A:lvs (T351069)
[14:38:26] <Lucas_WMDE>	 yeah I think this is looking healthy so far… might even finish before the window is over
[14:38:32] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[14:39:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] centralserver: remove icinga tls listener check [puppet] - 10https://gerrit.wikimedia.org/r/976234 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[14:39:45] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1112.eqiad.wmnet
[14:39:45] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1112.eqiad.wmnet
[14:40:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] centralserver: probe syslog receiver with client auth [puppet] - 10https://gerrit.wikimedia.org/r/976236 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[14:40:24] <xSavitar>	 Lucas_WMDE, not that bad. Averagely 1s lag
[14:41:06] <wikibugs>	 (03PS1) 10Eevans: install_server: do fully unattended aqs installs [puppet] - 10https://gerrit.wikimedia.org/r/976238 (https://phabricator.wikimedia.org/T347738)
[14:41:24] <Lucas_WMDE>	 yeah, it’s staying stable between 0 and 1 s
[14:41:54] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED
[14:41:56] <fabfur>	 !log swapped cp1112 <-> cp1087 (T349244)
[14:42:00] <Lucas_WMDE>	 ok, it’s halfway done already \o/
[14:42:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:06] <stashbot>	 T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244
[14:42:14] <xSavitar>	 \o/
[14:43:05] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1113.eqiad.wmnet
[14:43:06] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1113.eqiad.wmnet
[14:43:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:43:14] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1015.eqiad.wmnet with reason: host reimage
[14:43:42] <Lucas_WMDE>	 heh, you can definitely see the rows written on https://grafana.wikimedia.org/d/000000278/mysql-aggregated?from=now-1h&to=now go up
[14:43:54] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1173.mgmt.eqiad.wmnet with reboot policy FORCED
[14:44:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1174.mgmt.eqiad.wmnet with reboot policy FORCED
[14:44:01] <xSavitar>	 oh my :D
[14:44:06] <kostajh>	 Lucas_WMDE: are you still deploying?
[14:44:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:44:14] <Lucas_WMDE>	 kostajh: still running a maintenance script
[14:44:19] <fabfur>	 !log swapped cp1113 <-> cp1088 (T349244)
[14:44:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[14:44:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:27] <Lucas_WMDE>	 should be done within 10 minutes I think
[14:44:28] <kostajh>	 ok
[14:44:35] <Lucas_WMDE>	 but it probably doesn’t have to block anything else, strictly speaking
[14:44:39] <kostajh>	 I have a patch from the morning window that didn't get through
[14:44:47] <xSavitar>	 s8 says > 66k wr/s :D, insane numbers
[14:44:49] <kostajh>	 it's beta only, so I'd like to sync it possible
[14:44:52] * Lucas_WMDE looks
[14:44:58] <kostajh>	 https://gerrit.wikimedia.org/r/c/975270/
[14:45:05] <Lucas_WMDE>	 yeah I think that’s fine
[14:45:10] <Lucas_WMDE>	 should I `scap backport` it?
[14:45:13] <Lucas_WMDE>	 oh!
[14:45:19] <kostajh>	 Lucas_WMDE: yes please.
[14:45:27] <Lucas_WMDE>	 !log T350224 maintenance script finished (8m46s real time)
[14:45:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:32] <stashbot>	 T350224: [LEX] pp_sortkey is null for wb-claims, wbl-forms and wbl-senses on many Lexemes - https://phabricator.wikimedia.org/T350224
[14:45:47] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): [betalabs] ReportIncident: Relax rate limiting for reportincident action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975270 (https://phabricator.wikimedia.org/T351299) (owner: 10Kosta Harlan)
[14:45:49] <kostajh>	 Lucas_WMDE: I'll move it into this calendar block on the Deployment page
[14:45:55] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1015.eqiad.wmnet with reason: host reimage
[14:45:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975270 (https://phabricator.wikimedia.org/T351299) (owner: 10Kosta Harlan)
[14:45:57] <Lucas_WMDE>	 ok, thanks
[14:46:07] <jinxer-wm>	 (ProbeDown) firing: (3) Service upload-https:443 has failed probes (http_upload-https_ip6) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:46:26] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[14:46:33] <_joe_>	 !incidents
[14:46:33] <sirenbot>	 4274 (UNACKED)  [3x] ProbeDown sre (probes/service eqiad)
[14:46:38] <kamila_>	 wheeeeee
[14:46:41] <_joe_>	 !ack 4274
[14:46:42] <sirenbot>	 4274 (ACKED)  [3x] ProbeDown sre (probes/service eqiad)
[14:46:49] <wikibugs>	 (03Merged) 10jenkins-bot: [betalabs] ReportIncident: Relax rate limiting for reportincident action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975270 (https://phabricator.wikimedia.org/T351299) (owner: 10Kosta Harlan)
[14:47:03] <_joe_>	 the failing probe is eqiad/upload
[14:47:10] <wikibugs>	 10SRE, 10Observability-Logging: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10fgiunchedi) Yes I still see the errors:  ` Nov 21, 2023 @ 14:45:47.621 prometheus1005 target=[2620:0:861:102:10:64...
[14:47:31] <sukhe>	 hello
[14:47:38] <_joe_>	 and specifically ipv6?
[14:47:40] <Lucas_WMDE>	 kostajh: pulled to deploy2002, should show up in beta soon
[14:47:41] * Lucas_WMDE done
[14:48:00] <kostajh>	 Lucas_WMDE: thank you
[14:48:18] <_joe_>	 sukhe: do you see something in the eqiad upload dashboards that would justify the probe failure?
[14:48:22] <jinxer-wm>	 (ProbeDown) firing: (7) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:48:27] <godog>	 _joe_: not only v6 afaics, also v4
[14:48:35] <godog>	 I'm looking at alerts.w.o
[14:49:00] <_joe_>	 godog: oh ok I was looking at prometheus
[14:49:00] <fabfur>	 I've just swapped a cp host in eqiad for upload but should have no impact 
[14:49:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:49:13] <sukhe>	 I am wondering if the recent cp host swap should have had something to do with it, but unlikely
[14:49:15] <vgutierrez>	 we had a nice spike on eqiad
[14:49:16] <sukhe>	 looking
[14:49:17] <_joe_>	 fabfur: uh I'd say it did 
[14:49:24] <_joe_>	 :P
[14:49:27] <vgutierrez>	 https://grafana.wikimedia.org/goto/pmHeCXIIz?orgId=1
[14:49:36] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: cluster::cloud_management
[14:49:56] <_joe_>	 ah so the lbs
[14:50:51] <_joe_>	 I was looking at https://grafana.wikimedia.org/d/000000479/cdn-frontend-traffic?orgId=1&var-site=eqiad&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4
[14:50:58] <_joe_>	 and nothing really stood out immediately
[14:51:17] <sukhe>	 if it was the above, then the spike should have been obvious
[14:51:17] <jinxer-wm>	 (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[14:51:30] <sukhe>	 acked
[14:51:55] <fabfur>	 I can revert the change anyway
[14:52:03] <_joe_>	 fabfur: nah 
[14:52:03] <fabfur>	 cp1113 - cp1088
[14:52:04] <sukhe>	 fabfur: no
[14:52:06] <wikibugs>	 (03PS1) 10Jbond: cluster::cloud_management: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976239 (https://phabricator.wikimedia.org/T349619)
[14:52:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.88% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:52:16] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync
[14:52:16] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync
[14:52:29] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[14:52:29] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync
[14:52:50] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync
[14:52:50] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync
[14:53:01] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[14:53:01] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync
[14:53:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cluster::cloud_management: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976239 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[14:53:24] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:55:30] <wikibugs>	 (03PS3) 10Hnowlan: service, kubernetes: mw-jobrunner fixes [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796)
[14:55:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975913 (https://phabricator.wikimedia.org/T349875) (owner: 10Eevans)
[14:56:18] <jinxer-wm>	 (NELHigh) firing: (2) Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[14:57:06] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "very nice, looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[14:57:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:57:15] <wikibugs>	 (03PS1) 10Elukey: Move kafka-main1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976241 (https://phabricator.wikimedia.org/T349619)
[14:57:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] install_server: do fully unattended aqs installs [puppet] - 10https://gerrit.wikimedia.org/r/976238 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans)
[14:57:36] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: cluster::cloud_management
[14:58:43] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[15:00:03] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: dumps::distribution::server
[15:00:42] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[15:00:51] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10Vgutierrez) @fgiunchedi seems like a mismatch on configured curves between clients and servers, could I suggest providing a more detailed TLS configuration for both rsy...
[15:01:17] <jinxer-wm>	 (NELHigh) firing: (2) Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[15:01:52] <wikibugs>	 (03PS1) 10Jbond: dumps::distribution::server: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976243 (https://phabricator.wikimedia.org/T349619)
[15:02:03] <sukhe>	 !incidents
[15:02:03] <sirenbot>	 4274 (ACKED)  [3x] ProbeDown sre (probes/service eqiad)
[15:02:03] <sirenbot>	 4275 (ACKED)  NELHigh sre (tcp.timed_out)
[15:02:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] dumps::distribution::server: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976243 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:02:54] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[15:04:46] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1015.eqiad.wmnet with OS bullseye
[15:04:53] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs1015.eqiad.wmnet
[15:04:54] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1015.eqiad.wmnet
[15:06:17] <jinxer-wm>	 (NELHigh) firing: (2) Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[15:06:50] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dumps::distribution::server
[15:07:15] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::db::wikireplicas::web_multiinstance
[15:07:52] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] install_server: do fully unattended aqs installs [puppet] - 10https://gerrit.wikimedia.org/r/976238 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans)
[15:09:10] <wikibugs>	 (03PS1) 10Jbond: wmcs::db::wikireplicas::web_multiinstance: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976246 (https://phabricator.wikimedia.org/T349619)
[15:09:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::db::wikireplicas::web_multiinstance: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976246 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:12:27] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS bullseye
[15:13:28] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::db::wikireplicas::web_multiinstance
[15:13:31] <wikibugs>	 (03PS1) 10Ssingh: depool eqiad for upload-addrs [dns] - 10https://gerrit.wikimedia.org/r/976247
[15:14:57] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::db::wikireplicas::analytics_multiinstance
[15:15:51] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[15:18:23] <jinxer-wm>	 (ProbeDown) firing: (7) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:18:57] <wikibugs>	 (03PS1) 10Jbond: wmcs::db::wikireplicas::analytics_multiinstance: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976250 (https://phabricator.wikimedia.org/T349619)
[15:19:40] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: prepend_as_out for eqiad/eqord [homer/public] - 10https://gerrit.wikimedia.org/r/976251
[15:20:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::db::wikireplicas::analytics_multiinstance: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976250 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:21:07] <jinxer-wm>	 (ProbeDown) resolved: (3) Service upload-https:443 has failed probes (http_upload-https_ip6) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:21:13] <_joe_>	 oh heh
[15:21:15] <_joe_>	 no need
[15:21:15] <fabfur>	 !log depooled cp1113 
[15:21:16] <sukhe>	 ha
[15:21:17] <jinxer-wm>	 (NELHigh) resolved: (2) Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[15:21:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:17] <wikibugs>	 (03Abandoned) 10Ssingh: sites.yaml: prepend_as_out for eqiad/eqord [homer/public] - 10https://gerrit.wikimedia.org/r/976251 (owner: 10Ssingh)
[15:22:38] <wikibugs>	 (03Abandoned) 10Ssingh: depool eqiad for upload-addrs [dns] - 10https://gerrit.wikimedia.org/r/976247 (owner: 10Ssingh)
[15:23:45] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage
[15:24:01] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::db::wikireplicas::analytics_multiinstance
[15:24:53] <wikibugs>	 (03CR) 10Jbond: "PCC: https://gerrit.wikimedia.org/r/c/operations/puppet/+/971476/19" [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[15:25:16] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::cloudlb
[15:26:17] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage
[15:27:24] <wikibugs>	 (03PS1) 10Jbond: wmcs::cloudlb: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976253 (https://phabricator.wikimedia.org/T349619)
[15:27:27] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED
[15:29:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::cloudlb: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976253 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:32:51] <icinga-wm>	 RECOVERY - snapshot of s1 in eqiad on backupmon1001 is OK: Last snapshot for s1 at eqiad (db1140) taken on 2023-11-21 13:43:40 (1220 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[15:33:57] <wikibugs>	 (03PS3) 10Ssingh: P:dns::auth::update: add support for setting ferm rules via confd [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054)
[15:33:59] <wikibugs>	 (03PS1) 10Ssingh: P:dns::auth::update: add support for authdns-update hosts via confd [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054)
[15:34:37] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::cloudlb
[15:34:45] <icinga-wm>	 RECOVERY - snapshot of s1 in codfw on backupmon1001 is OK: Last snapshot for s1 at codfw (db2141) taken on 2023-11-21 13:38:03 (1205 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[15:35:15] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:37:01] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::cloudgw
[15:38:12] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[15:38:36] <wikibugs>	 (03PS1) 10Jbond: wmcs::cloudgw: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976256 (https://phabricator.wikimedia.org/T349619)
[15:39:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::cloudgw: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976256 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:43:30] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::cloudgw
[15:44:28] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: insetup::wmcs
[15:46:29] <wikibugs>	 (03PS1) 10Jbond: insetup::wmcs: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976258 (https://phabricator.wikimedia.org/T349619)
[15:46:45] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1121 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:03] <fabfur>	 !log repooled cp1088
[15:47:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] insetup::wmcs: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976258 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:48:01] <jynus>	 Lucas_WMDE: backups are running today by day
[15:48:24] <Lucas_WMDE>	 ok
[15:48:29] <Lucas_WMDE>	 and replication is stopped while the backup runs?
[15:48:32] <jynus>	 those are not mediawiki servers, they are backup sources dbs
[15:48:40] <jynus>	 on some hosts yes, to speed up backup
[15:49:13] <jynus>	 they would alert otherwise, if they are not alerting through icinga, it means stopping them is a normal operation
[15:50:12] <Lucas_WMDE>	 alright, thanks
[15:50:53] <jynus>	 please note that mw is only like 2/3s of dbs, there are many times dbs depooled or for other functions (wikireplicas, analytics, backups, even of s* sections)
[15:51:30] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup::wmcs
[15:53:23] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1153 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:25] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:53:51] <icinga-wm>	 RECOVERY - snapshot of s4 in eqiad on backupmon1001 is OK: Last snapshot for s4 at eqiad (db1150) taken on 2023-11-21 13:50:42 (1749 GiB, -0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[15:55:34] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED
[15:55:50] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[15:55:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[15:55:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[15:56:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[15:56:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[15:56:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[15:57:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:35] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:58:46] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Move kafka-main1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976241 (https://phabricator.wikimedia.org/T349619) (owner: 10Elukey)
[15:59:09] <wikibugs>	 10Puppet, 10Instrument-ClientError: Google Translate and other translate services triggering client error alert - https://phabricator.wikimedia.org/T351738 (10Jdlrobson)
[15:59:32] <wikibugs>	 (03PS2) 10Jdlrobson: Filter translation service errors [puppet] - 10https://gerrit.wikimedia.org/r/975836 (https://phabricator.wikimedia.org/T351738)
[15:59:48] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.dns.netbox
[16:00:04] <jouncebot>	 eoghan, jelto, and arnoldokoth: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1600).
[16:00:28] <wikibugs>	 (03PS1) 10Jbond: vrts: use correct variable [puppet] - 10https://gerrit.wikimedia.org/r/976261 (https://phabricator.wikimedia.org/T350008)
[16:02:41] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: allocate cloud-private svc ips to wiki replicas - taavi@cumin1001"
[16:03:31] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: allocate cloud-private svc ips to wiki replicas - taavi@cumin1001"
[16:03:31] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:04:36] <wikibugs>	 (03PS2) 10Jbond: vrts: use correct variable [puppet] - 10https://gerrit.wikimedia.org/r/976261 (https://phabricator.wikimedia.org/T350008)
[16:05:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp1113
[16:05:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1113
[16:05:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] vrts: use correct variable [puppet] - 10https://gerrit.wikimedia.org/r/976261 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[16:05:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/618/console" [puppet] - 10https://gerrit.wikimedia.org/r/976261 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[16:06:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Move kafka-main1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976241 (https://phabricator.wikimedia.org/T349619) (owner: 10Elukey)
[16:06:59] <icinga-wm>	 RECOVERY - snapshot of s2 in codfw on backupmon1001 is OK: Last snapshot for s2 at codfw (db2097) taken on 2023-11-21 15:06:15 (948 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[16:07:35] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) Thank you @Vgutierrez for the suggestion, I've dug a little bit into the situation and the code and I believe the message is a red-herring, in the sense tha...
[16:07:44] <wikibugs>	 (03PS21) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008)
[16:07:47] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.migrate-host for host kafka-main1001.eqiad.wmnet
[16:08:55] <wikibugs>	 10SRE, 10Observability-Logging, 10User-fgiunchedi: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10fgiunchedi)
[16:11:10] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host kafka-main1001.eqiad.wmnet
[16:11:59] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10Vgutierrez) nice, but please set a sane TLS configuration :) ideally nothing lower than TLSv1.2 and solid ciphersuites
[16:13:20] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10hashar) >>! In T349619#9348943, @Jelto wrote: > @hashar `gerrit2002` was migrated to puppet7. I restarted gerrit and apache processes and the instance looks...
[16:16:45] <icinga-wm>	 RECOVERY - snapshot of s8 in eqiad on backupmon1001 is OK: Last snapshot for s8 at eqiad (db1171) taken on 2023-11-21 14:22:51 (1495 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[16:18:23] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:19:33] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1153 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:21:55] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:22:04] <wikibugs>	 (03PS1) 10Jbond: pki: add mtls profile [puppet] - 10https://gerrit.wikimedia.org/r/976267 (https://phabricator.wikimedia.org/T351624)
[16:22:35] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:23:44] <wikibugs>	 (03PS1) 10Majavah: wikimedia.cloud: include zone file for svc records [dns] - 10https://gerrit.wikimedia.org/r/976268
[16:24:19] <icinga-wm>	 RECOVERY - snapshot of s8 in codfw on backupmon1001 is OK: Last snapshot for s8 at codfw (db2098) taken on 2023-11-21 14:28:29 (1536 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[16:24:29] <wikibugs>	 (03CR) 10Reedy: [C: 03+1] mediawiki::packages: Drop python-pil [puppet] - 10https://gerrit.wikimedia.org/r/976202 (https://phabricator.wikimedia.org/T268468) (owner: 10Muehlenhoff)
[16:24:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/621/con" [puppet] - 10https://gerrit.wikimedia.org/r/976267 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond)
[16:24:53] <wikibugs>	 (03CR) 10Majavah: "Not sure if there's a better way to do this? The file names feel a bit odd." [dns] - 10https://gerrit.wikimedia.org/r/976268 (owner: 10Majavah)
[16:26:27] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/976268 (owner: 10Majavah)
[16:28:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] pki: add mtls profile [puppet] - 10https://gerrit.wikimedia.org/r/976267 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond)
[16:28:15] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:28:23] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:29:41] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1086 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:29:58] <wikibugs>	 (03PS25) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[16:30:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[16:30:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] wikimedia.cloud: include zone file for svc records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/976268 (owner: 10Majavah)
[16:33:04] <vgutierrez>	 !log updating pybal to 1.5.14 on eqiad - T351069
[16:33:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:17] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[16:33:35] <wikibugs>	 (03CR) 10Dwisehaupt: "A couple more minor changes to make this production ready:" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[16:34:15] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020.eqiad.wmnet} and A:lvs (T351069)
[16:34:44] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020.eqiad.wmnet} and A:lvs (T351069)
[16:34:51] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:34:59] <wikibugs>	 (03PS26) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[16:35:11] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:36:02] <wikibugs>	 (03PS6) 10Majavah: P:wmcs: wikireplicas: allow access from cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427)
[16:36:04] <wikibugs>	 (03PS12) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427)
[16:36:14] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[1017-1019].eqiad.wmnet} and A:lvs (T351069)
[16:36:25] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:36:31] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:36:54] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] wikimedia.cloud: include zone file for svc records [dns] - 10https://gerrit.wikimedia.org/r/976268 (owner: 10Majavah)
[16:37:14] <wikibugs>	 (03PS1) 10Jbond: prometheus: update to request testing certs from pki [puppet] - 10https://gerrit.wikimedia.org/r/976273 (https://phabricator.wikimedia.org/T351624)
[16:37:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[16:38:33] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[16:39:07] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[16:41:04] <wikibugs>	 (03PS2) 10Jbond: prometheus: update to request testing certs from pki [puppet] - 10https://gerrit.wikimedia.org/r/976273 (https://phabricator.wikimedia.org/T351624)
[16:41:12] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4 NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah)
[16:41:45] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[1017-1019].eqiad.wmnet} and A:lvs (T351069)
[16:41:50] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[16:41:59] <wikibugs>	 (03PS7) 10Majavah: P:wmcs: wikireplicas: allow access from cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427)
[16:42:01] <wikibugs>	 (03PS13) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427)
[16:43:15] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:43:26] <wikibugs>	 (03CR) 10Majavah: Add wiki replicas to cloudlb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah)
[16:44:21] <icinga-wm>	 RECOVERY - snapshot of s2 in eqiad on backupmon1001 is OK: Last snapshot for s2 at eqiad (db1225) taken on 2023-11-21 15:19:40 (1150 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[16:44:39] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1005 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[16:44:40] <wikibugs>	 (03PS3) 10Jbond: prometheus: update to request testing certs from pki [puppet] - 10https://gerrit.wikimedia.org/r/976273 (https://phabricator.wikimedia.org/T351624)
[16:46:36] <wikibugs>	 (03CR) 10Majavah: P:wmcs: wikireplicas: allow access from cloudlb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah)
[16:47:34] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[16:48:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/626/con" [puppet] - 10https://gerrit.wikimedia.org/r/976273 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond)
[16:49:45] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "reasoning looks good but please get rzl to check it as well :)" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[16:50:25] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review, 10User-fgiunchedi: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) @fgiunchedi [[ https://gerrit.wikimedia.org/r/c/operations/puppe...
[16:51:29] <icinga-wm>	 RECOVERY - snapshot of s4 in codfw on backupmon1001 is OK: Last snapshot for s4 at codfw (db2099) taken on 2023-11-21 14:01:28 (1797 GiB, -0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[16:53:37] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review, 10User-fgiunchedi: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) p:05Triage→03Medium
[16:54:35] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:55:14] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/976276
[16:55:27] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/976276 (owner: 10Ebernhardson)
[16:55:59] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:56:19] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/976276 (owner: 10Ebernhardson)
[16:57:19] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:00:04] <jouncebot>	 jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1700)
[17:00:04] <jouncebot>	 phuedx: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:08] <wikibugs>	 (03PS1) 10Volans: requests: fix import of urllib Retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976278
[17:00:10] <wikibugs>	 (03PS1) 10Volans: tox.ini: remove optimization for tox <4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976279
[17:00:38] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[17:00:43] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[17:00:59] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[17:01:40] <rzl>	 phuedx: are you here under some other name? :)
[17:03:26] <wikibugs>	 (03CR) 10BCornwall: "We're going to wait until there's been more data collection so we have a more complete picture of the SLO." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[17:03:34] <wikibugs>	 (03CR) 10BCornwall: "We're going to wait until there's been more data collection so we have a more complete picture of the SLO." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[17:03:55] <wikibugs>	 (03PS2) 10Volans: requests: fix import of urllib Retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976278
[17:04:18] <wikibugs>	 (03PS3) 10Volans: requests: fix import of urllib Retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976278
[17:04:20] <wikibugs>	 (03PS2) 10Volans: tox.ini: remove optimization for tox <4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976279
[17:04:32] <wikibugs>	 (03CR) 10BCornwall: acme-chief: Remove acmechief2002 passive host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975911 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[17:04:38] <wikibugs>	 (03Abandoned) 10BCornwall: acme-chief: Remove acmechief2002 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975911 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[17:06:53] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:06:59] <wikibugs>	 (03CR) 10Jbond: "on more nit about file locations i missed.  ultimately this is down to how and where you want files so feel free to just close it down" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[17:07:13] <phuedx>	 o/
[17:07:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] tox.ini: remove optimization for tox <4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976279 (owner: 10Volans)
[17:08:04] <RhinosF1>	 rzl: 
[17:08:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] requests: fix import of urllib Retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976278 (owner: 10Volans)
[17:08:23] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:09:59] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:10:41] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) 05Open→03Resolved a:03Dzahn I am going to be bold and call it resolved. Based on my previous comments. We created a Hiera k...
[17:10:51] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Dzahn)
[17:12:00] <rzl>	 phuedx: hello! these look good on their face but I don't know this system well -- are you able to test them post-merge and make sure everything's in good shape?
[17:12:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1158']
[17:13:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1168']
[17:13:28] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1169']
[17:13:31] <wikibugs>	 (03PS1) 10JHathaway: rsync: ensure daemon is started after config [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830)
[17:13:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1170']
[17:13:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1171']
[17:13:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1172']
[17:13:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1173']
[17:14:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1174']
[17:14:10] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[17:14:25] <phuedx>	 rzl: I guess the good news and the bad news is that they _should_ be no ops. I can check in #wikimedia-analytics that the legacy EventLogging refinement systems are still functioning
[17:14:26] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1171']
[17:14:28] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1172']
[17:14:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1171']
[17:14:54] <rzl>	 haha sure -- I just want to have good confidence that there isn't some other unintended effect somewhere
[17:14:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1172']
[17:14:57] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1171']
[17:15:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1172']
[17:15:15] <rzl>	 do they need to be merged in any particular order or should I just go for it?
[17:15:43] <phuedx>	 Just go for it :) Any particular order should be fine
[17:15:55] <rzl>	 👍
[17:16:29] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: bugfix mysql_upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/975931 (https://phabricator.wikimedia.org/T343674)
[17:17:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rsync: ensure daemon is started after config [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[17:17:49] <wikibugs>	 (03CR) 10Volans: [C: 03+2] requests: fix import of urllib Retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976278 (owner: 10Volans)
[17:17:53] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tox.ini: remove optimization for tox <4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976279 (owner: 10Volans)
[17:18:52] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] eventlogging: Remove obsolete FeaturePolicyViolation schema [puppet] - 10https://gerrit.wikimedia.org/r/908382 (https://phabricator.wikimedia.org/T209572) (owner: 10Krinkle)
[17:18:55] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:19:09] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Stop refining SpecialMuteSubmit events [puppet] - 10https://gerrit.wikimedia.org/r/894000 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx)
[17:19:15] <Amir1>	 Lucas_WMDE: I don't know if you got the answer but sometimes if you see two replicas not getting replication (they are red not yellow) and there is one per dc, it usually means a backup is running (double check if they are pooled in https://noc.wikimedia.org/dbconfig/eqiad.json) and nothing to be worried about
[17:19:35] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:19:40] <Amir1>	 sorry for a really long response
[17:19:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:20:00] <wikibugs>	 (03CR) 10Volans: mariadb: bugfix mysql_upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/975931 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[17:20:21] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1168']
[17:20:21] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1169']
[17:20:24] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1170']
[17:20:34] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] mariadb: bugfix mysql_upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/975931 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[17:21:17] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1173']
[17:21:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1158']
[17:21:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1159']
[17:21:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1160']
[17:21:59] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1159']
[17:22:02] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1160']
[17:22:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1160']
[17:22:26] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1160']
[17:22:39] <wikibugs>	 (03Merged) 10jenkins-bot: requests: fix import of urllib Retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976278 (owner: 10Volans)
[17:23:01] <phuedx>	 rzl: Confirmed in #wikimedia-analytics that there's alerting set up for the affected system. We'll get emails shortly if something breaks as a result of these changes
[17:23:10] <rzl>	 okay great
[17:23:16] <rzl>	 puppet's just finishing up
[17:23:37] <rzl>	 there it goes! both patches merged, and I ran puppet on an-launcher1002,eventlog1003 -- did I miss any? and how's everything looking
[17:23:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1157.mgmt.eqiad.wmnet with reboot policy FORCED
[17:23:45] <wikibugs>	 (03PS10) 10Jbond: C:rsync::server:  convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618)
[17:23:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1158.mgmt.eqiad.wmnet with reboot policy FORCED
[17:24:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1159.mgmt.eqiad.wmnet with reboot policy FORCED
[17:24:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1160.mgmt.eqiad.wmnet with reboot policy FORCED
[17:24:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1161.mgmt.eqiad.wmnet with reboot policy FORCED
[17:24:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1162.mgmt.eqiad.wmnet with reboot policy FORCED
[17:24:19] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:24:38] <wikibugs>	 (03CR) 10Jbond: C:rsync::server:  convert to concat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond)
[17:24:56] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1157.mgmt.eqiad.wmnet with reboot policy FORCED
[17:24:59] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1159.mgmt.eqiad.wmnet with reboot policy FORCED
[17:25:01] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: bugfix mysql_upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/975931 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[17:25:03] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1158.mgmt.eqiad.wmnet with reboot policy FORCED
[17:25:07] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1160.mgmt.eqiad.wmnet with reboot policy FORCED
[17:25:10] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1162.mgmt.eqiad.wmnet with reboot policy FORCED
[17:25:19] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1161.mgmt.eqiad.wmnet with reboot policy FORCED
[17:25:30] <wikibugs>	 (03CR) 10Arnaudb: [V: 03+1 C: 03+2] mariadb: bugfix mysql_upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/975931 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[17:25:32] <wikibugs>	 (03CR) 10Arnaudb: [V: 03+2 C: 03+2] mariadb: bugfix mysql_upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/975931 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[17:25:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED
[17:25:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED
[17:25:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1165.mgmt.eqiad.wmnet with reboot policy FORCED
[17:26:03] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:26:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1166.mgmt.eqiad.wmnet with reboot policy FORCED
[17:26:06] <wikibugs>	 (03CR) 10Jbond: "hi i think i have pretty much the exact same change :) but also with the spec tests fixed (i think, just rebased so will see)" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[17:26:08] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1167.mgmt.eqiad.wmnet with reboot policy FORCED
[17:26:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED
[17:26:13] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.161 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:26:21] <wikibugs>	 (03Merged) 10jenkins-bot: tox.ini: remove optimization for tox <4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976279 (owner: 10Volans)
[17:26:55] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1165.mgmt.eqiad.wmnet with reboot policy FORCED
[17:26:59] <wikibugs>	 (03PS11) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172
[17:27:09] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1167.mgmt.eqiad.wmnet with reboot policy FORCED
[17:27:12] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED
[17:27:16] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED
[17:27:53] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED
[17:27:57] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1166.mgmt.eqiad.wmnet with reboot policy FORCED
[17:28:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1172.mgmt.eqiad.wmnet with reboot policy FORCED
[17:28:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1169.mgmt.eqiad.wmnet with reboot policy FORCED
[17:28:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1170.mgmt.eqiad.wmnet with reboot policy FORCED
[17:28:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED
[17:28:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1174.mgmt.eqiad.wmnet with reboot policy FORCED
[17:28:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1173.mgmt.eqiad.wmnet with reboot policy FORCED
[17:29:18] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1169.mgmt.eqiad.wmnet with reboot policy FORCED
[17:29:21] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1172.mgmt.eqiad.wmnet with reboot policy FORCED
[17:29:34] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1173.mgmt.eqiad.wmnet with reboot policy FORCED
[17:29:46] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1170.mgmt.eqiad.wmnet with reboot policy FORCED
[17:29:51] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1174.mgmt.eqiad.wmnet with reboot policy FORCED
[17:30:09] <wikibugs>	 (03CR) 10JHathaway: rsync: ensure daemon is started after config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[17:30:25] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:30:33] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] ores extension: set default value of OresLiftWingAddHostHeader to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976161 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos)
[17:30:40] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED
[17:30:41] <icinga-wm>	 RECOVERY - snapshot of x1 in codfw on backupmon1001 is OK: Last snapshot for x1 at codfw (db2097) taken on 2023-11-21 16:57:23 (443 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[17:31:37] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1158']
[17:31:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1159']
[17:31:44] <phuedx>	 rzl: I don't think that there are any others. I'll keep an eye out for alerts about the legacy EventLogging refinement :)
[17:31:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1160']
[17:31:57] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:31:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1161']
[17:32:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1162']
[17:32:08] <rzl>	 phuedx: okay sounds good! I'll be around if you need any followups merged
[17:32:08] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1163']
[17:32:15] <phuedx>	 Thanks <3 
[17:32:19] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Increase the amount of maximum volumes for regular backups to 140 [puppet] - 10https://gerrit.wikimedia.org/r/976286 (https://phabricator.wikimedia.org/T351725)
[17:32:22] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1161']
[17:32:23] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1162']
[17:32:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1164']
[17:33:20] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1165']
[17:33:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1166']
[17:35:03] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:35:03] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1155 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:35:11] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1155 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:35:46] <wikibugs>	 (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/output/971476/619/" [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[17:36:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:36:15] <icinga-wm>	 PROBLEM - Check systemd state on ganeti5005 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:52] <wikibugs>	 (03PS9) 10Jbond: rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173
[17:37:36] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1163']
[17:37:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 (owner: 10Jbond)
[17:37:49] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:37:57] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1155 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:38:04] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1159']
[17:38:05] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1160']
[17:38:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1167']
[17:38:27] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1168']
[17:38:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1169']
[17:39:17] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1165']
[17:39:19] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1164']
[17:40:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1170']
[17:40:03] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.293 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:40:05] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50862 bytes in 1.137 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:40:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1171']
[17:40:14] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1171']
[17:40:17] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:40:24] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1171']
[17:40:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1171']
[17:40:32] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[17:40:34] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1166']
[17:40:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1172']
[17:40:54] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1172']
[17:42:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1171']
[17:42:37] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1171']
[17:42:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1172']
[17:42:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1173']
[17:43:03] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1172']
[17:43:16] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1174']
[17:44:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[17:44:38] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1167']
[17:44:39] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1168']
[17:44:41] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1169']
[17:44:57] <icinga-wm>	 RECOVERY - snapshot of x1 in eqiad on backupmon1001 is OK: Last snapshot for x1 at eqiad (db1225) taken on 2023-11-21 17:10:07 (382 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[17:46:15] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:46:22] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1170']
[17:48:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1112 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:48:11] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:48:21] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:48:29] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:49:12] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1173']
[17:50:24] <Amir1>	 jouncebot: nowandnext
[17:50:24] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1700)
[17:50:24] <jouncebot>	 In 0 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1800)
[17:52:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Undeploy DoubleWiki, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975912 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup)
[17:52:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975912 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup)
[17:53:22] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy DoubleWiki, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975912 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup)
[17:53:37] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:975912|Undeploy DoubleWiki, Part I (T351675)]]
[17:53:42] <stashbot>	 T351675: Undeploy DoubleWiki - https://phabricator.wikimedia.org/T351675
[17:53:50] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1174']
[17:54:01] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1174']
[17:54:24] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Increase the amount of maximum volumes for regular backups to 140 [puppet] - 10https://gerrit.wikimedia.org/r/976286 (https://phabricator.wikimedia.org/T351725) (owner: 10Jcrespo)
[17:54:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr)
[17:54:56] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:975912|Undeploy DoubleWiki, Part I (T351675)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:56:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[17:56:17] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[17:56:27] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.402 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:56:37] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:58:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating restbase servers in codfw - jhancock@cumin2002"
[17:59:07] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:59:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:59:13] <wikibugs>	 (03PS1) 10Ladsgroup: Undeploy DoubleWiki, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976288 (https://phabricator.wikimedia.org/T351675)
[17:59:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating restbase servers in codfw - jhancock@cumin2002"
[17:59:29] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:59:59] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1800)
[18:02:05] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:975912|Undeploy DoubleWiki, Part I (T351675)]] (duration: 08m 27s)
[18:02:24] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[18:02:28] <stashbot>	 T351675: Undeploy DoubleWiki - https://phabricator.wikimedia.org/T351675
[18:02:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976288 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup)
[18:03:12] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy DoubleWiki, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976288 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup)
[18:03:28] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:976288|Undeploy DoubleWiki, Part II (T351675)]]
[18:03:46] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:04:44] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:976288|Undeploy DoubleWiki, Part II (T351675)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[18:06:09] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[18:10:44] <wikibugs>	 (03PS1) 10Jclark-ctr: Add an-worker1157-75.yaml file T349936 [puppet] - 10https://gerrit.wikimedia.org/r/976289 (https://phabricator.wikimedia.org/T349936)
[18:11:29] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] Add an-worker1157-75.yaml file T349936 [puppet] - 10https://gerrit.wikimedia.org/r/976289 (https://phabricator.wikimedia.org/T349936) (owner: 10Jclark-ctr)
[18:11:52] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:976288|Undeploy DoubleWiki, Part II (T351675)]] (duration: 08m 24s)
[18:12:13] <stashbot>	 T351675: Undeploy DoubleWiki - https://phabricator.wikimedia.org/T351675
[18:13:12] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1158']
[18:13:52] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1158']
[18:14:34] <wikibugs>	 (03PS1) 10Ladsgroup: Undeploy DoubleWiki, Part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976291 (https://phabricator.wikimedia.org/T351675)
[18:15:15] <icinga-wm>	 RECOVERY - snapshot of s5 in eqiad on backupmon1001 is OK: Last snapshot for s5 at eqiad (db1216) taken on 2023-11-21 17:03:48 (558 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[18:15:19] <jynus>	 !log restart of bacula-sd on backup1009 T351725
[18:15:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:37] <stashbot>	 T351725: Daily backup job not running for gerrit1003 - https://phabricator.wikimedia.org/T351725
[18:15:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976291 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup)
[18:16:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:16:27] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy DoubleWiki, Part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976291 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup)
[18:16:43] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:976291|Undeploy DoubleWiki, Part III (T351675)]]
[18:16:47] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1127 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:17:53] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:18:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1157.eqiad.wmnet with OS bullseye
[18:18:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye
[18:25:37] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:25:41] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:29:48] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:976291|Undeploy DoubleWiki, Part III (T351675)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[18:29:53] <stashbot>	 T351675: Undeploy DoubleWiki - https://phabricator.wikimedia.org/T351675
[18:30:46] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[18:31:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:31:51] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:34:13] <icinga-wm>	 RECOVERY - Check systemd state on ganeti5005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:38:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1158.eqiad.wmnet with OS bullseye
[18:38:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye
[18:38:41] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:40:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:40:53] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:41:23] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1111 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:41:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:42:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:42:24] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:976291|Undeploy DoubleWiki, Part III (T351675)]] (duration: 25m 41s)
[18:42:28] <stashbot>	 T351675: Undeploy DoubleWiki - https://phabricator.wikimedia.org/T351675
[18:42:59] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1105 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:44:29] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1146 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:44:57] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1146 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:45:05] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:45:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:45:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1105 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:45:47] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1077 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:45:53] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:46:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1078 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:46:23] <icinga-wm>	 PROBLEM - Check systemd state on analytics1077 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:46:25] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:46:27] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1090 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:46:49] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1133 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:46:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:46:55] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:49:39] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:49:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:50:31] <icinga-wm>	 RECOVERY - Check systemd state on analytics1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:56:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:56:45] <icinga-wm>	 RECOVERY - snapshot of s5 in codfw on backupmon1001 is OK: Last snapshot for s5 at codfw (db2101) taken on 2023-11-21 17:45:41 (679 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[18:57:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:58:01] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1111 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:58:03] <wikibugs>	 (03PS1) 10Jbond: facter: add new wmflib.site fact [puppet] - 10https://gerrit.wikimedia.org/r/976299
[18:58:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:00:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] facter: add new wmflib.site fact [puppet] - 10https://gerrit.wikimedia.org/r/976299 (owner: 10Jbond)
[19:01:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1146 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:01:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1146 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:01:40] <wikibugs>	 (03PS1) 10Jbond: java: update certificate name [puppet] - 10https://gerrit.wikimedia.org/r/976300
[19:03:05] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:03:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:04:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:04:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:04:49] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1133 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:04:53] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1082 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:04:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:05:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:05:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] java: update certificate name [puppet] - 10https://gerrit.wikimedia.org/r/976300 (owner: 10Jbond)
[19:08:01] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1016.eqiad.wmnet with OS bullseye
[19:08:59] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:09:39] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:09:57] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1090 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:09:59] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1123 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:10:23] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:10:31] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1082 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:10:40] <wikibugs>	 (03PS1) 10Ottomata: refine - Put SpecialMuteSubmit and FeaturePolicyViolation in eventlogging analytics exclude list [puppet] - 10https://gerrit.wikimedia.org/r/976303 (https://phabricator.wikimedia.org/T209572)
[19:10:50] <wikibugs>	 (03CR) 10Dzahn: "thanks, seems reasoanble to me" [puppet] - 10https://gerrit.wikimedia.org/r/976286 (https://phabricator.wikimedia.org/T351725) (owner: 10Jcrespo)
[19:11:29] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1017.eqiad.wmnet with OS bullseye
[19:12:18] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] refine - Put SpecialMuteSubmit and FeaturePolicyViolation in eventlogging analytics exclude list [puppet] - 10https://gerrit.wikimedia.org/r/976303 (https://phabricator.wikimedia.org/T209572) (owner: 10Ottomata)
[19:12:42] <wikibugs>	 (03CR) 10Jbond: "i wonder if the test failures relate to the puppet version" [puppet] - 10https://gerrit.wikimedia.org/r/976299 (owner: 10Jbond)
[19:13:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] refine - Put SpecialMuteSubmit and FeaturePolicyViolation in eventlogging analytics exclude list [puppet] - 10https://gerrit.wikimedia.org/r/976303 (https://phabricator.wikimedia.org/T209572) (owner: 10Ottomata)
[19:13:46] <wikibugs>	 (03PS2) 10Ottomata: refine - Put SpecialMuteSubmit and FeaturePolicyViolation in exclude list [puppet] - 10https://gerrit.wikimedia.org/r/976303 (https://phabricator.wikimedia.org/T209572)
[19:14:11] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:14:17] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:14:19] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:14:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:19:35] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:19:43] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:19:49] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:25:21] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:25:29] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:26:09] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1017.eqiad.wmnet with reason: host reimage
[19:26:32] <wikibugs>	 (03PS2) 10Jbond: facter: add new wmflib.site fact [puppet] - 10https://gerrit.wikimedia.org/r/976299
[19:26:34] <wikibugs>	 (03PS1) 10Jbond: Gemfile: update to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976304
[19:26:39] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.635 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:26:41] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:28:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Gemfile: update to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976304 (owner: 10Jbond)
[19:28:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] facter: add new wmflib.site fact [puppet] - 10https://gerrit.wikimedia.org/r/976299 (owner: 10Jbond)
[19:28:47] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1017.eqiad.wmnet with reason: host reimage
[19:29:26] <wikibugs>	 (03PS3) 10Jbond: facter: add new wmflib.site fact [puppet] - 10https://gerrit.wikimedia.org/r/976299
[19:33:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] facter: add new wmflib.site fact [puppet] - 10https://gerrit.wikimedia.org/r/976299 (owner: 10Jbond)
[19:33:31] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:34:40] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] Add alert for CirrusSearch reported memory issues [puppet] - 10https://gerrit.wikimedia.org/r/830240 (https://phabricator.wikimedia.org/T316712) (owner: 10Ebernhardson)
[19:35:39] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:36:23] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1123 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:37:25] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:38:18] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1157.eqiad.wmnet with OS bullseye
[19:38:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye executed with errors: - an-worke...
[19:41:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:41:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:41:57] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:42:01] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1083 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:44:41] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:44:55] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1145 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:45:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:46:07] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1137 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:46:29] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:46:53] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:47:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:48:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1086 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:49:35] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1017.eqiad.wmnet with OS bullseye
[19:49:59] <icinga-wm>	 RECOVERY - snapshot of s3 in eqiad on backupmon1001 is OK: Last snapshot for s3 at eqiad (db1150) taken on 2023-11-21 17:44:38 (1318 GiB, +0.5 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[19:50:15] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:50:19] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:51:55] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1018.eqiad.wmnet with OS bullseye
[19:54:07] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1081 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:54:41] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:58:19] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1158.eqiad.wmnet with OS bullseye
[19:58:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye executed with errors: - an-worke...
[19:59:21] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T348183)', diff saved to https://phabricator.wikimedia.org/P53679 and previous config saved to /var/cache/conftool/dbconfig/20231121-195920-arnaudb.json
[19:59:27] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[20:01:10] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:34] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:02:34] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:02:45] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1018.eqiad.wmnet with reason: host reimage
[20:03:10] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:03:34] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:04:36] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1137 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:04:50] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:05:46] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1018.eqiad.wmnet with reason: host reimage
[20:06:39] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:06:51] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:07:55] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:08:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:08:13] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:12:17] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:12:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:12:59] <wikibugs>	 (03PS1) 10Subramanya Sastry: ParserOutputPostCacheTransform: Don't reprocess content [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976330 (https://phabricator.wikimedia.org/T351461)
[20:14:02] <wikibugs>	 (03PS1) 10Subramanya Sastry: [parser] Broaden TOC placeholder regular expression [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976331
[20:14:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P53680 and previous config saved to /var/cache/conftool/dbconfig/20231121-201427-arnaudb.json
[20:16:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:16:49] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:18:23] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:18:36] <wikibugs>	 (03CR) 10C. Scott Ananian: [C: 03+1] "backporting to unblock visual diff testing of parsoid read views" [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976331 (owner: 10Subramanya Sastry)
[20:18:53] <wikibugs>	 (03CR) 10C. Scott Ananian: [C: 03+1] "backporting to unblock visual diff testing of parsoid read views" [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976330 (https://phabricator.wikimedia.org/T351461) (owner: 10Subramanya Sastry)
[20:19:19] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[20:19:36] <cscott>	 RoanKattouw: CTT added a pair of late backport patches to the window in ~50 minutes
[20:20:07] <cscott>	 I'll be here for the first part of the window but I'll need to leave to bring a kid to his clarinet lesson; subbu will be the primary on-call for the backport.
[20:20:30] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Add alert for CirrusSearch reported memory issues [puppet] - 10https://gerrit.wikimedia.org/r/830240 (https://phabricator.wikimedia.org/T316712) (owner: 10Ebernhardson)
[20:24:29] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:27:07] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1018.eqiad.wmnet with OS bullseye
[20:29:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P53681 and previous config saved to /var/cache/conftool/dbconfig/20231121-202933-arnaudb.json
[20:31:18] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs1018.eqiad.wmnet
[20:31:19] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1018.eqiad.wmnet
[20:32:34] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1019.eqiad.wmnet with OS bullseye
[20:32:34] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1142 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:34:26] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: remove old nginx-level bans [puppet] - 10https://gerrit.wikimedia.org/r/976308
[20:37:04] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: remove old nginx-level bans [puppet] - 10https://gerrit.wikimedia.org/r/976308
[20:39:10] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1149 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:39:38] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:41:28] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1019.eqiad.wmnet with OS bullseye
[20:41:42] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1019.eqiad.wmnet with OS bullseye
[20:44:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T348183)', diff saved to https://phabricator.wikimedia.org/P53682 and previous config saved to /var/cache/conftool/dbconfig/20231121-204440-arnaudb.json
[20:44:42] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[20:44:47] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[20:44:55] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[20:45:02] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:45:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T348183)', diff saved to https://phabricator.wikimedia.org/P53683 and previous config saved to /var/cache/conftool/dbconfig/20231121-204501-arnaudb.json
[20:45:36] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1150 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:46:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:46:16] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:50:22] <icinga-wm>	 RECOVERY - snapshot of s7 in eqiad on backupmon1001 is OK: Last snapshot for s7 at eqiad (db1171) taken on 2023-11-21 18:16:13 (1105 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[20:52:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:52:23] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1019.eqiad.wmnet with reason: host reimage
[20:53:00] <mutante>	 !log gerrit1003 - deleted /root/backup_of_srv_gerrit_plugins - disk usage down to 56% (T351658)
[20:53:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:02] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1019.eqiad.wmnet with reason: host reimage
[20:55:48] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:55:56] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:59:42] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T2100). nyaa~
[21:00:05] <jouncebot>	 subbu and subbu: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:08] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:00:22] <subbu>	 o/
[21:01:34] <RoanKattouw>	 I can deploy
[21:02:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976331 (owner: 10Subramanya Sastry)
[21:03:14] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:03:44] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:04:26] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:04:28] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1149 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:05:14] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:05:22] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:10:30] <icinga-wm>	 RECOVERY - snapshot of s7 in codfw on backupmon1001 is OK: Last snapshot for s7 at codfw (db2098) taken on 2023-11-21 18:36:28 (1269 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[21:14:20] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1019.eqiad.wmnet with OS bullseye
[21:15:59] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1020.eqiad.wmnet with OS bullseye
[21:16:24] <wikibugs>	 (03Merged) 10jenkins-bot: [parser] Broaden TOC placeholder regular expression [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976331 (owner: 10Subramanya Sastry)
[21:16:39] <logmsgbot>	 !log catrope@deploy2002 Started scap: Backport for [[gerrit:976331|[parser] Broaden TOC placeholder regular expression]]
[21:16:42] <wikibugs>	 (03PS1) 10Ssingh: pybal: do not install from component [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837)
[21:18:02] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/627/con" [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) (owner: 10Ssingh)
[21:18:03] <logmsgbot>	 !log catrope@deploy2002 catrope and ssastry: Backport for [[gerrit:976331|[parser] Broaden TOC placeholder regular expression]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:18:22] <subbu>	 ready to test on mwdebug?
[21:18:28] <RoanKattouw>	 subbu: Your first patch (Broaden TOC placeholder regular expression) is ready for testing
[21:18:33] <subbu>	 ok
[21:18:39] <RoanKattouw>	 Yup you beat me to it :)
[21:18:55] <subbu>	 is it on 2001.codfw?
[21:18:56] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] ParserOutputPostCacheTransform: Don't reprocess content [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976330 (https://phabricator.wikimedia.org/T351461) (owner: 10Subramanya Sastry)
[21:19:09] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "sukhe@apt1001:~$ sudo -i reprepro lsbycomponent pybal" [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) (owner: 10Ssingh)
[21:19:15] <RoanKattouw>	 It should be on all the test servers
[21:21:51] <subbu>	 ok, lgtm tested a few different ways.
[21:22:28] <icinga-wm>	 RECOVERY - snapshot of s3 in codfw on backupmon1001 is OK: Last snapshot for s3 at codfw (db2139) taken on 2023-11-21 17:48:43 (1291 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[21:23:27] <subbu>	 RoanKattouw, ok to sync the core one.
[21:23:28] <logmsgbot>	 !log catrope@deploy2002 catrope and ssastry: Continuing with sync
[21:24:51] <wikibugs>	 (03Merged) 10jenkins-bot: ParserOutputPostCacheTransform: Don't reprocess content [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976330 (https://phabricator.wikimedia.org/T351461) (owner: 10Subramanya Sastry)
[21:25:43] <RoanKattouw>	 Oh wow that DT patch merged quickly! I +2ed it earlier to speed things up since I expected it to take 15 mins like the core one
[21:26:16] <subbu>	 ya .. :) looks like extension patches have fewer gate jobs.
[21:26:17] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "There is also python-prometheus-client in the Pybal component that we are installing but I think we will leave that. (Moritz, I think you " [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) (owner: 10Ssingh)
[21:29:19] <logmsgbot>	 !log catrope@deploy2002 Finished scap: Backport for [[gerrit:976331|[parser] Broaden TOC placeholder regular expression]] (duration: 12m 40s)
[21:29:58] <logmsgbot>	 !log catrope@deploy2002 Started scap: Backport for [[gerrit:976330|ParserOutputPostCacheTransform: Don't reprocess content (T351461)]]
[21:30:12] <stashbot>	 T351461: InvalidArgumentException: Multiple conflicting values given for wgDiscussionToolsPageThreads - https://phabricator.wikimedia.org/T351461
[21:30:18] <RoanKattouw>	 Alright moving on to the DiscussionTools patch
[21:31:18] <logmsgbot>	 !log catrope@deploy2002 ssastry and catrope: Backport for [[gerrit:976330|ParserOutputPostCacheTransform: Don't reprocess content (T351461)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:31:30] <RoanKattouw>	 subbu: The DT patch is ready to test on the mwdebug servers
[21:31:39] <subbu>	 ty . .testing ..
[21:34:24] <subbu>	 hmm .. not sure it fixed anything .. testing some other pages.
[21:35:49] <subbu>	 nah ... it actually made things slightly worse for Parsoid & DT ... :)  .. let's skip this one.
[21:35:57] <RoanKattouw>	 OK we'll roll this one back
[21:35:59] <logmsgbot>	 !log catrope@deploy2002 Sync cancelled.
[21:36:02] <subbu>	 thanks.
[21:36:14] <subbu>	 i'll have to go digging elsewhere for this.
[21:36:28] <RoanKattouw>	 subbu: Please submit a revert of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/976330/ with a brief explanation of what went wrong
[21:36:41] <RoanKattouw>	 (even just one sentence after "Reason for revert:" is fine)
[21:36:49] <subbu>	 will do.
[21:37:53] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1020.eqiad.wmnet with reason: host reimage
[21:37:56] <wikibugs>	 (03PS1) 10Subramanya Sastry: Revert "ParserOutputPostCacheTransform: Don't reprocess content" [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976332
[21:38:03] <icinga-wm>	 RECOVERY - Check systemd state on mw2442 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:38:11] <subbu>	 RoanKattouw, will you +2 that or should I?
[21:38:16] <RoanKattouw>	 I will
[21:38:27] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Revert "ParserOutputPostCacheTransform: Don't reprocess content" [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976332 (owner: 10Subramanya Sastry)
[21:38:56] <subbu>	 good we tried to backport it today .. i have some time to fix it tomorrow / monday. anyway, have a good evening all!
[21:40:55] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1020.eqiad.wmnet with reason: host reimage
[21:44:15] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ParserOutputPostCacheTransform: Don't reprocess content" [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976332 (owner: 10Subramanya Sastry)
[21:45:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T348183)', diff saved to https://phabricator.wikimedia.org/P53684 and previous config saved to /var/cache/conftool/dbconfig/20231121-214534-arnaudb.json
[21:45:40] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[22:00:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P53685 and previous config saved to /var/cache/conftool/dbconfig/20231121-220040-arnaudb.json
[22:02:25] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1020.eqiad.wmnet with OS bullseye
[22:06:07] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1021.eqiad.wmnet with OS bullseye
[22:07:59] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED
[22:10:45] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:11:08] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1036.mgmt.eqiad.wmnet with reboot policy FORCED
[22:12:48] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1037.mgmt.eqiad.wmnet with reboot policy FORCED
[22:15:19] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED
[22:15:45] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:15:47] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P53686 and previous config saved to /var/cache/conftool/dbconfig/20231121-221547-arnaudb.json
[22:15:50] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1038.mgmt.eqiad.wmnet with reboot policy FORCED
[22:17:55] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host aqs1021.eqiad.wmnet with OS bullseye
[22:18:19] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1021.eqiad.wmnet with OS bullseye
[22:18:58] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1036.mgmt.eqiad.wmnet with reboot policy FORCED
[22:20:09] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1037.mgmt.eqiad.wmnet with reboot policy FORCED
[22:23:10] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1038.mgmt.eqiad.wmnet with reboot policy FORCED
[22:27:37] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:29:06] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1021.eqiad.wmnet with reason: host reimage
[22:30:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T348183)', diff saved to https://phabricator.wikimedia.org/P53687 and previous config saved to /var/cache/conftool/dbconfig/20231121-223053-arnaudb.json
[22:31:05] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[22:32:09] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1021.eqiad.wmnet with reason: host reimage
[22:34:50] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:35:12] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:43:02] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:46:16] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1035
[22:47:32] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1035
[22:48:33] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.dns.netbox
[22:48:56] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:49:16] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:50:00] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.125 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:50:20] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:53:22] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1021.eqiad.wmnet with OS bullseye
[22:54:34] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs1021.eqiad.wmnet
[22:54:35] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1021.eqiad.wmnet
[22:55:08] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:55:20] <wikibugs>	 (03PS2) 10JHathaway: rsync: ensure daemon is started after config [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830)
[22:55:41] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[22:55:58] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1124 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:56:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:56:46] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:58:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:58:31] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: message - vriley@cumin1001"
[22:58:34] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1089 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:59:23] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: message - vriley@cumin1001"
[22:59:23] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:00:04] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1036
[23:00:23] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1037
[23:00:37] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1038
[23:00:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:01:08] <icinga-wm>	 PROBLEM - Check systemd state on analytics1073 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:01:48] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1037
[23:01:49] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1038
[23:02:00] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1036
[23:02:40] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.dns.netbox
[23:04:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:04:43] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: message - vriley@cumin1001"
[23:04:50] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:05:18] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1147 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:05:36] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: message - vriley@cumin1001"
[23:05:36] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:05:52] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:08:36] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1138 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:09:52] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:10:59] <wikibugs>	 (03PS1) 10DDesouza: Undeploy Reader Demographics 2 survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976325 (https://phabricator.wikimedia.org/T344393)
[23:13:00] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1148 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:13:02] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1148 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:13:38] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:13:58] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:14:24] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:15:04] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:15:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1148 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:15:50] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1148 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:16:08] <wikibugs>	 (03CR) 10Krinkle: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[23:17:45] <wikibugs>	 (03CR) 10Krinkle: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[23:21:42] <icinga-wm>	 RECOVERY - Check systemd state on analytics1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:22:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:22:50] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:23:10] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:23:23] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:31:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:31:26] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1078 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:31:54] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:32:00] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1138 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:39:57] <hashar>	 pff
[23:40:49] <hashar>	 so I think I might have the start of a fix for the old T282893 (which really most probably has always been there):  https://github.com/jenkinsci/parameterized-trigger-plugin/pull/363/files
[23:40:50] <stashbot>	 T282893: Various CI jobs failing after "mkdir: cannot create directory ‘log’: Permission denied" - https://phabricator.wikimedia.org/T282893
[23:54:27] <wikibugs>	 (03PS1) 10Brennen Bearnes: allow all images from docker-registry.tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/976355 (https://phabricator.wikimedia.org/T334512)
[23:57:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:57:34] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:58:01] <wikibugs>	 (03CR) 10Brennen Bearnes: "Someone should definitely check me on this; also I'm trying to remember where else a list of allowed images lives at this point." [puppet] - 10https://gerrit.wikimedia.org/r/976355 (https://phabricator.wikimedia.org/T334512) (owner: 10Brennen Bearnes)