[00:13:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:31:10] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974645 (owner: 10TrainBranchBot) [00:38:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/975926 [00:39:01] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/975926 (owner: 10TrainBranchBot) [01:00:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/975926 (owner: 10TrainBranchBot) [01:03:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T351683 (10phaultfinder) [01:12:49] (KubernetesCalicoDown) firing: kubernetes2041.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2041.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:34:44] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [01:35:08] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [01:35:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T348183)', diff saved to https://phabricator.wikimedia.org/P53646 and previous config saved to /var/cache/conftool/dbconfig/20231121-013514-arnaudb.json [01:35:20] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [02:03:38] (03PS2) 10MPGuy2824: Disable PageTriage's extended features on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635) [02:04:10] (03CR) 10Ejegg: "Hi folks, would releng be able to deploy this some time this week? The new entry is only needed for donatewiki, but it seems there is only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971281 (https://phabricator.wikimedia.org/T254808) (owner: 10Ejegg) [02:05:22] (03CR) 10MPGuy2824: Disable PageTriage's extended features on beta testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635) (owner: 10MPGuy2824) [02:32:46] (03CR) 10Jsn.sherman: [C: 03+1] "looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635) (owner: 10MPGuy2824) [02:38:23] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0300) [03:08:23] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:19:43] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:20:13] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:21:17] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:22:33] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:24:17] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:25:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:33:22] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0400) [04:13:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:35:54] (03PS1) 10KartikMistry: Enable Content/Section translation on some Wikipedias with potential to be supported with MinT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975924 (https://phabricator.wikimedia.org/T345267) [05:12:25] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:12:49] (KubernetesCalicoDown) firing: kubernetes2041.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2041.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:15:53] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:28:12] (03PS1) 10Marostegui: pc1014: Move to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/975946 (https://phabricator.wikimedia.org/T351285) [06:29:21] (03CR) 10Marostegui: [C: 03+2] pc1014: Move to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/975946 (https://phabricator.wikimedia.org/T351285) (owner: 10Marostegui) [06:31:33] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:33:03] jouncebot: next [06:33:03] In 0 hour(s) and 26 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0700) [06:33:03] In 0 hour(s) and 26 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0700) [06:37:09] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:38:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T348183)', diff saved to https://phabricator.wikimedia.org/P53647 and previous config saved to /var/cache/conftool/dbconfig/20231121-063827-arnaudb.json [06:38:32] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [06:48:22] (03PS2) 10KartikMistry: cxserver: Force 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) [06:52:24] (03CR) 10Stevemunene: [C: 03+1] Remove the oozie integration from hue [puppet] - 10https://gerrit.wikimedia.org/r/974646 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [06:53:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P53648 and previous config saved to /var/cache/conftool/dbconfig/20231121-065333-arnaudb.json [06:58:52] (03CR) 10Santhosh: [C: 03+1] cxserver: Force 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry) [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0700) [07:00:05] kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0700). [07:05:17] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:05:43] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 33452 [07:06:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 33452 [07:08:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P53649 and previous config saved to /var/cache/conftool/dbconfig/20231121-070840-arnaudb.json [07:09:50] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:10:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:12:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1210.eqiad.wmnet with OS bookworm [07:22:17] (03PS1) 10Ayounsi: Don't alert for v6 AAAA for logstash and kafla-logging [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/976110 [07:23:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T348183)', diff saved to https://phabricator.wikimedia.org/P53650 and previous config saved to /var/cache/conftool/dbconfig/20231121-072346-arnaudb.json [07:23:49] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [07:23:56] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [07:24:02] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [07:24:04] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:24:18] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:24:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T348183)', diff saved to https://phabricator.wikimedia.org/P53651 and previous config saved to /var/cache/conftool/dbconfig/20231121-072424-arnaudb.json [07:25:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage [07:25:39] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host druid1011.eqiad.wmnet with OS bullseye [07:27:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage [07:28:40] (03CR) 10Stevemunene: [C: 03+2] switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [07:29:54] (03PS1) 10Marostegui: db2132: Remove 10.6 declaration [puppet] - 10https://gerrit.wikimedia.org/r/976135 [07:30:35] (03CR) 10Marostegui: [C: 03+2] db2132: Remove 10.6 declaration [puppet] - 10https://gerrit.wikimedia.org/r/976135 (owner: 10Marostegui) [07:33:23] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:46:13] (03CR) 10Elukey: [C: 03+1] changeprop - fixes for beta values [deployment-charts] - 10https://gerrit.wikimedia.org/r/975862 (https://phabricator.wikimedia.org/T351247) (owner: 10Ottomata) [07:46:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1210.eqiad.wmnet with OS bookworm [07:47:31] (03CR) 10Vgutierrez: [C: 03+2] Release 1.15.14 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/973732 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez) [07:47:41] (03CR) 10CI reject: [V: 04-1] Release 1.15.14 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/973732 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez) [07:47:55] :? [07:48:18] https://integration.wikimedia.org/ci/job/fail-archived-repositories/291/console : This repository has been archived and new patches are not being accepted. [07:48:24] * vgutierrez feeling old this morning lol [07:50:16] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1011.eqiad.wmnet with reason: host reimage [07:52:53] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1011.eqiad.wmnet with reason: host reimage [07:54:26] (03Abandoned) 10Vgutierrez: Release 1.15.14 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/973732 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez) [07:59:48] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10CodeReviewBot) vgutierrez opened https://gitlab.wikimedia.org/repos/sre/pybal/-/merge_requests/1 Release 1.15.14 [08:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T0800). [08:00:05] awight, kart_, and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:40] hi [08:00:43] * kart_ is here [08:02:29] :wave: I'm happy to deploy unless Amir1 or urbanecm are already buckling in? [08:03:12] thanks awight [08:03:38] mine is beta only, and no need for verification so you can just `scap backport` it and move on from that one :) [08:04:02] kk let's have some fun! [08:04:28] awight: let me know when you're done. [08:04:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 5%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53652 and previous config saved to /var/cache/conftool/dbconfig/20231121-080455-arnaudb.json [08:05:12] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10CodeReviewBot) vgutierrez merged https://gitlab.wikimedia.org/repos/sre/pybal/-/merge_requests/1 Release 1.15.14 [08:05:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 5%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53653 and previous config saved to /var/cache/conftool/dbconfig/20231121-080527-arnaudb.json [08:06:50] (03PS1) 10Arnaudb: mariadb: repool db2178 [puppet] - 10https://gerrit.wikimedia.org/r/975927 (https://phabricator.wikimedia.org/T343674) [08:07:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch) [08:07:46] (03PS2) 10Awight: Enable Reference Previews on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch) [08:07:52] (03CR) 10TrainBranchBot: "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch) [08:08:43] (03Merged) 10jenkins-bot: Enable Reference Previews on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch) [08:09:37] !log awight@deploy2002 Started scap: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]] [08:09:42] T282999: Enable Reference Previews on all wikis using the Popups extension, on Nov 21 - https://phabricator.wikimedia.org/T282999 [08:10:27] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10CodeReviewBot) vgutierrez opened https://gitlab.wikimedia.org/repos/sre/pybal/-/merge_requests/2 Release 1.15.14 [08:11:02] !log awight@deploy2002 awight and wmde-fisch: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:11:29] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10CodeReviewBot) vgutierrez opened https://gitlab.wikimedia.org/repos/sre/pybal/-/merge_requests/3 Release 1.15.14 [08:13:48] 10SRE, 10Patch-Needs-Improvement: Install private instance of gnomon for greater SRE team - https://phabricator.wikimedia.org/T246062 (10Aklapper) a:05CDanis→03None @CDanis: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of t... [08:14:50] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:15:42] (03CR) 10Marostegui: mariadb: repool db2178 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975927 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:16:07] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1011.eqiad.wmnet with OS bullseye [08:16:57] (03CR) 10Arnaudb: mariadb: repool db2178 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975927 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:18:15] (03CR) 10Marostegui: [C: 03+1] mariadb: repool db2178 [puppet] - 10https://gerrit.wikimedia.org/r/975927 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:18:43] !log awight@deploy2002 awight and wmde-fisch: Continuing with sync [08:18:47] (03CR) 10Arnaudb: [C: 03+2] mariadb: repool db2178 [puppet] - 10https://gerrit.wikimedia.org/r/975927 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:18:49] 10SRE, 10serviceops: Update grafana link for mediawiki-error-rate-$cluster in icinga check - https://phabricator.wikimedia.org/T281261 (10Aklapper) a:05jijiki→03None @jijiki: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of... [08:19:19] 10SRE, 10serviceops, 10User-jbond, 10User-jijiki: Refactor memcached modules - https://phabricator.wikimedia.org/T284454 (10Aklapper) a:05jijiki→03None @jijiki: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task... [08:20:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53654 and previous config saved to /var/cache/conftool/dbconfig/20231121-082000-arnaudb.json [08:20:23] 10SRE, 10serviceops, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Aklapper) a:05Legoktm→03None @Legoktm: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am... [08:20:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53655 and previous config saved to /var/cache/conftool/dbconfig/20231121-082032-arnaudb.json [08:21:25] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465 (10Aklapper) a:05MoritzMuehlenhoff→03None @MoritzMuehlenhoff: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/... [08:23:15] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Aklapper) a:05Paladox→03None @Paladox: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cl... [08:24:39] !log awight@deploy2002 Finished scap: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]] (duration: 15m 02s) [08:24:44] T282999: Enable Reference Previews on all wikis using the Popups extension, on Nov 21 - https://phabricator.wikimedia.org/T282999 [08:26:05] Just had an interesting deployment failure: [08:26:06] 08:10:18 1 K8s nodes failed to pull the multiversion image [08:26:18] connect to host k [08:26:19] ubernetes2041.codfw.wmnet port 22: No route to host [08:26:24] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff There is progress, the last change only happened on October 26. This is a long standing task with low... [08:26:49] I need to try again, I guess? [08:27:08] !log awight@deploy2002 Started scap: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]] [08:27:20] (03PS1) 10Muehlenhoff: Switch ncredir to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976149 (https://phabricator.wikimedia.org/T349619) [08:27:50] (03PS4) 10Elukey: changeprop: allow to define Kafka settings for Job Queues [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [08:27:58] Same error. [08:28:14] :/ [08:28:17] It's not clear to me whether a rollback will even help? [08:28:27] !log awight@deploy2002 wmde-fisch and awight: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:28:33] !log awight@deploy2002 Sync cancelled. [08:28:46] (03CR) 10Muehlenhoff: [C: 03+2] Switch ncredir to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976149 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:28:59] 10SRE, 10serviceops, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Joe) 05Open→03Resolved a:03Joe Given we now have switchovers at regular intervals, we can resolve this task. There is no need to do a lot of c... [08:29:07] (03PS1) 10TrainBranchBot: Revert "Enable Reference Previews on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976151 [08:29:09] (03CR) 10TrainBranchBot: "awight@deploy2002 created a revert of this change as Ib1779fea4fb782eb61b6c84e1b01b6d6c9a3b166" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch) [08:29:17] (03PS5) 10Elukey: changeprop: allow to define Kafka settings for Job Queues [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [08:29:31] arnaudb: FYI, please see the k8s deployment failure ^ [08:29:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976151 (owner: 10TrainBranchBot) [08:29:57] seen [08:30:23] (03Merged) 10jenkins-bot: Revert "Enable Reference Previews on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976151 (owner: 10TrainBranchBot) [08:30:35] !log awight@deploy2002 Started scap: Backport for [[gerrit:976151|Revert "Enable Reference Previews on all wikis"]] [08:31:37] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: ncredir [08:31:51] kart_: kostajh: sorry, I think scap is broken at the moment. I'm finishing rollback and then you're free to do whatever is right for your deployments. [08:31:58] !log awight@deploy2002 awight and trainbranchbot: Backport for [[gerrit:976151|Revert "Enable Reference Previews on all wikis"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:32:04] !log awight@deploy2002 awight and trainbranchbot: Continuing with sync [08:32:28] awight: no problem. I'll reschedule my deployment. [08:33:19] kostajh: Would you like the beta-only patch to go out still? I'm not sure whether scap is smart enough to skip k8s for that? [08:34:10] awight: I can manage it later, thanks [08:35:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 15%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53656 and previous config saved to /var/cache/conftool/dbconfig/20231121-083504-arnaudb.json [08:35:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 15%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53657 and previous config saved to /var/cache/conftool/dbconfig/20231121-083537-arnaudb.json [08:37:10] !log upload pybal 1.15.14 to apt.wm.o (bullseye-wikimedia) - T348837 [08:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:15] T348837: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 [08:37:43] !log awight@deploy2002 Finished scap: Backport for [[gerrit:976151|Revert "Enable Reference Previews on all wikis"]] (duration: 07m 08s) [08:38:26] !log scap window cancelled due to k8s error [08:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:09] !log updating pybal to 1.5.14 on lvs4010 - T351069 [08:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:14] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [08:49:26] (03CR) 10Ayounsi: [C: 03+1] Reset spine switch BGP to CR if max prefix tripped after 30 mins [homer/public] - 10https://gerrit.wikimedia.org/r/975799 (https://phabricator.wikimedia.org/T349116) (owner: 10Cathal Mooney) [08:50:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53658 and previous config saved to /var/cache/conftool/dbconfig/20231121-085011-arnaudb.json [08:50:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53659 and previous config saved to /var/cache/conftool/dbconfig/20231121-085042-arnaudb.json [09:05:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 45%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53660 and previous config saved to /var/cache/conftool/dbconfig/20231121-090516-arnaudb.json [09:05:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 45%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53661 and previous config saved to /var/cache/conftool/dbconfig/20231121-090547-arnaudb.json [09:07:28] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/976110 (owner: 10Ayounsi) [09:09:39] !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[4008-4009].ulsfo.wmnet} and A:lvs (T351069) [09:09:44] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [09:10:20] !log updating pybal to 1.5.14 on ulsfo - T351069 [09:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:35] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[4008-4009].ulsfo.wmnet} and A:lvs (T351069) [09:12:49] (KubernetesCalicoDown) firing: kubernetes2041.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2041.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:14:23] !log updating pybal to 1.5.14 on eqsin - T351069 [09:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:13] !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs5006.eqsin.wmnet} and A:lvs (T351069) [09:15:18] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [09:15:44] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs5006.eqsin.wmnet} and A:lvs (T351069) [09:15:45] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:16:12] !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[5004-5005].eqsin.wmnet} and A:lvs (T351069) [09:17:13] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[5004-5005].eqsin.wmnet} and A:lvs (T351069) [09:17:53] !log updating pybal to 1.5.14 on codfw - T351069 [09:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:03] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [09:18:08] !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2014.codfw.wmnet} and A:lvs (T351069) [09:18:37] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs2014.codfw.wmnet} and A:lvs (T351069) [09:19:13] !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[2011-2013].codfw.wmnet} and A:lvs (T351069) [09:20:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53662 and previous config saved to /var/cache/conftool/dbconfig/20231121-092021-arnaudb.json [09:20:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53663 and previous config saved to /var/cache/conftool/dbconfig/20231121-092052-arnaudb.json [09:22:02] <_joe_> jouncebot: nowandnext [09:22:02] No deployments scheduled for the next 1 hour(s) and 37 minute(s) [09:22:02] In 1 hour(s) and 37 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1100) [09:24:46] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[2011-2013].codfw.wmnet} and A:lvs (T351069) [09:24:53] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [09:26:51] (03CR) 10Filippo Giunchedi: [C: 03+1] centrallog: update tls_netstream_driver to use ossl [puppet] - 10https://gerrit.wikimedia.org/r/975861 (https://phabricator.wikimedia.org/T324623) (owner: 10Jbond) [09:27:13] I'll continue with the lvs updates later today (codfw/ulsfo/eqsin) done, (eqiad/esams/drmrs) to go [09:27:48] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos: change increase() range for Lift Wing [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [09:29:21] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::pyrra::filesystem: new Lift Wing pilot candidate [puppet] - 10https://gerrit.wikimedia.org/r/975833 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [09:29:34] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos: improve istio sli recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974486 (owner: 10Elukey) [09:35:05] (03PS5) 10Tim Starling: Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989) [09:35:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 75%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53664 and previous config saved to /var/cache/conftool/dbconfig/20231121-093526-arnaudb.json [09:35:36] (03CR) 10Tim Starling: [C: 03+2] Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [09:35:42] (03PS6) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [09:35:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53665 and previous config saved to /var/cache/conftool/dbconfig/20231121-093557-arnaudb.json [09:41:05] (03CR) 10Elukey: [C: 04-1] "Still not ok, but I'll keep working on it :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [09:41:10] awight: just catching up now, did you report the error somewhere? [09:47:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: switch to use the traffic percentage split endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/975816 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [09:47:49] (03Merged) 10jenkins-bot: mobileapps: switch to use the traffic percentage split endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/975816 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [09:49:35] (03PS7) 10Tim Starling: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) [09:50:00] (03CR) 10Jbond: [C: 04-1] "this host is currently the live host for all systems running puppet7. Furthermore it is already running bookworm" [puppet] - 10https://gerrit.wikimedia.org/r/975911 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [09:50:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53666 and previous config saved to /var/cache/conftool/dbconfig/20231121-095031-arnaudb.json [09:51:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53667 and previous config saved to /var/cache/conftool/dbconfig/20231121-095102-arnaudb.json [09:51:58] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:52:59] (03CR) 10Btullis: [C: 03+2] Remove the oozie integration from hue [puppet] - 10https://gerrit.wikimedia.org/r/974646 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [09:53:12] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:00:14] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [10:00:15] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host gitlab-runner1002.eqiad.wmnet [10:01:36] (03PS1) 10Muehlenhoff: Switch gitlab-runner1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976155 (https://phabricator.wikimedia.org/T349619) [10:01:40] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [10:02:20] (03PS1) 10Majavah: prometheus: node_puppet_agent: improve debugging abilities [puppet] - 10https://gerrit.wikimedia.org/r/976156 [10:02:32] (03CR) 10Jbond: [C: 04-1] acme-chief: Remove acmechief2002 passive host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975911 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [10:02:45] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [10:02:47] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [10:02:49] (03CR) 10Majavah: [C: 03+1] prometheus: node_puppet_agent: improve debugging abilities [puppet] - 10https://gerrit.wikimedia.org/r/976156 (owner: 10Majavah) [10:03:03] (03CR) 10Jbond: [V: 03+1 C: 03+2] base: switch rsyslog tls_netstream_driver to ossl [puppet] - 10https://gerrit.wikimedia.org/r/975791 (https://phabricator.wikimedia.org/T324623) (owner: 10Jbond) [10:03:07] (03CR) 10Jbond: [V: 03+1 C: 03+2] centrallog: update tls_netstream_driver to use ossl [puppet] - 10https://gerrit.wikimedia.org/r/975861 (https://phabricator.wikimedia.org/T324623) (owner: 10Jbond) [10:03:12] (03CR) 10Muehlenhoff: [C: 03+2] Switch gitlab-runner1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976155 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:03:30] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [10:03:33] (03CR) 10Jelto: [C: 03+1] Switch gitlab-runner1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976155 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:03:36] jbond: I'll puppet-merge your rsyslog patches along, ok? [10:05:01] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/601/console" [puppet] - 10https://gerrit.wikimedia.org/r/976156 (owner: 10Majavah) [10:05:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53669 and previous config saved to /var/cache/conftool/dbconfig/20231121-100536-arnaudb.json [10:06:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53670 and previous config saved to /var/cache/conftool/dbconfig/20231121-100607-arnaudb.json [10:06:23] moritzm: yes please [10:07:33] ack, done [10:08:36] cheers [10:10:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host gitlab-runner1002.eqiad.wmnet [10:10:34] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [10:11:22] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [10:11:55] (03CR) 10Jbond: [V: 03+2 C: 03+2] Puppet_Internal_CA.pem: rename to Puppet5_Internal_CA.pem [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/975869 (https://phabricator.wikimedia.org/T351653) (owner: 10Jbond) [10:15:13] (03PS3) 10Hnowlan: envoy: use ENTRYPOINT instead of CMD [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033) [10:16:03] (03CR) 10Hnowlan: [C: 03+1] cxserver: Force 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry) [10:17:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/976156 (owner: 10Majavah) [10:18:15] (03CR) 10Majavah: [V: 03+1 C: 03+2] prometheus: node_puppet_agent: improve debugging abilities (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976156 (owner: 10Majavah) [10:18:29] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet [10:18:51] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10ops-monitoring-bot) Host rebooted by jelto@cumin1001 with reason: None [10:19:24] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] envoy: use ENTRYPOINT instead of CMD [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [10:21:42] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [10:22:27] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host gerrit2002.wikimedia.org [10:23:28] (03PS1) 10Jcrespo: dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) [10:25:02] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1003.eqiad.wmnet [10:26:27] (03CR) 10Arnaudb: [C: 03+1] dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) (owner: 10Jcrespo) [10:28:31] (03PS1) 10Muehlenhoff: Use a native package [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/976159 [10:28:37] kostajh: No I haven't done anything persistent--there' a "!-log" message and I pinged arnaudb who is on clinic duty [10:29:27] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Use a native package [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/976159 (owner: 10Muehlenhoff) [10:29:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/976159 (owner: 10Muehlenhoff) [10:29:40] lets maybe summon _joe_ awight, I think we need somebody with a k8s hat on :) [10:29:58] <_joe_> arnaudb: what's going on? [10:30:13] it seems that there is some issues on a deployment issued by awight [10:30:51] (03PS1) 10Muehlenhoff: Switch gerrit2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976160 (https://phabricator.wikimedia.org/T349619) [10:30:53] _joe_: Hi! yes the error is in the IRC history, "scap backport" was unable to connect to a k8s host and so I had to roll back the deployment to avoid being in an inconsistent state. [10:31:20] <_joe_> awight: at what time, sorry? [10:31:26] (03CR) 10Muehlenhoff: [C: 03+2] Switch gerrit2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976160 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:31:28] maybe we should create a phab task to handle this properly? [10:31:46] _joe_: root cause seems to be "kubernetes2041.codfw.wmnet port 22: No route to host" [10:32:16] (behind which I'm sure there's another deeper cause :-) ) [10:32:42] <_joe_> awight: ok, fwiw you shouldn't need to rollback if that happens [10:33:11] <_joe_> that k8s node is down apparently [10:33:16] <_joe_> but that's a non-fatal issue [10:33:43] _joe_: But if k8s hosts and legacy servers are inconsistent... how can I tell that the cluster is consistent, for example if that host suddenly starts up again or if it was just a network glitch... [10:33:55] <_joe_> awight: that's not an issue on k8s [10:33:57] Perhaps dead servers can be depooled [10:34:08] <_joe_> the host is down for k8s too so it won't be scheduled jobs [10:34:20] <_joe_> when it comes up again, if a pod is scheduled, it will pull the correct image [10:34:32] <_joe_> and yes ofc, we just didn't notice [10:34:56] So should the scap logic ignore such an error? [10:35:11] !log upload new wmf-certificates packages [10:35:11] <_joe_> it should report it but not suggest to rollback, yes [10:35:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host gerrit2002.wikimedia.org [10:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:31] <_joe_> awight: so if it automatically rolled back, please open a task about that [10:36:33] _joe_: interesting! So maybe at "warn" or "info" level. But this is very helpful information, thanks. We'll try the deployment again today. [10:36:49] <_joe_> awight: we're also pulling the node back up now ofc [10:36:51] _joe_: no, this was my noob reaction to seeing an error though [10:37:08] <_joe_> but please open a task, it should be clear the error is not fatal [10:38:10] Want to bless us sneaking a small deployment window now? Or is it better to wait for the official window... [10:40:11] <_joe_> jayme: when you've dealt with the dead node, can you give the go-ahead to awight ? [10:40:33] (03CR) 10Tim Starling: [C: 03+2] Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [10:40:42] I've cordoned it for now, so it should be out of the way [10:41:18] (03Merged) 10jenkins-bot: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [10:42:01] (03CR) 10Btullis: [C: 03+1] wikireplicas: update-views: try to do changes live [cookbooks] - 10https://gerrit.wikimedia.org/r/975796 (owner: 10Majavah) [10:42:18] i think the list scap uses is pulled from puppetdb directly, so you will still see the same warning [10:43:27] ah, I probably did not have full context. So scap is failing because it can't connect to that node...hmpf [10:43:41] (03CR) 10Majavah: [C: 03+2] wikireplicas: update-views: try to do changes live [cookbooks] - 10https://gerrit.wikimedia.org/r/975796 (owner: 10Majavah) [10:43:47] (03PS2) 10Jcrespo: dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) [10:44:41] _joe_: Here's a skeletal bug report--it's missing the final text, I'm digging around to see where it landed. https://phabricator.wikimedia.org/T351701 [10:44:54] (03CR) 10Jcrespo: "This is a bit cleaner, stop harcoding the statistics (mysql) file, which was what caused the last issue to start with." [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) (owner: 10Jcrespo) [10:46:06] (03PS1) 10Ilias Sarantopoulos: ores extension: set default value of OresLiftWingAddHostHeader to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976161 [10:48:02] (03Merged) 10jenkins-bot: wikireplicas: update-views: try to do changes live [cookbooks] - 10https://gerrit.wikimedia.org/r/975796 (owner: 10Majavah) [10:49:19] (03CR) 10Jcrespo: [C: 04-1] "This doesn't work: https://puppet-compiler.wmflabs.org/output/976158/602/" [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) (owner: 10Jcrespo) [10:50:41] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: gitlab_runner [10:51:37] (03PS2) 10Ilias Sarantopoulos: ores extension: set default value of OresLiftWingAddHostHeader to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976161 (https://phabricator.wikimedia.org/T351703) [10:51:48] (03PS1) 10Muehlenhoff: Switch gitlab_runner to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976162 (https://phabricator.wikimedia.org/T349619) [10:55:19] (03CR) 10Muehlenhoff: [C: 03+2] Switch gitlab_runner to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976162 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:56:30] (03PS3) 10Jcrespo: dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) [10:56:46] (03CR) 10Jelto: [C: 03+1] "lgtm, dedicated hiera entry for gitlab-runner1002 in hieradata/hosts/gitlab-runner1002.yaml can be removed." [puppet] - 10https://gerrit.wikimedia.org/r/976162 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:57:03] (03CR) 10CI reject: [V: 04-1] dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) (owner: 10Jcrespo) [10:57:56] (03PS4) 10Jcrespo: dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1100) [11:00:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: gitlab_runner [11:01:35] (03CR) 10Jcrespo: "Looking good: https://puppet-compiler.wmflabs.org/output/976158/603/" [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) (owner: 10Jcrespo) [11:01:56] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Update config for mysql backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/976158 (https://phabricator.wikimedia.org/T351617) (owner: 10Jcrespo) [11:02:10] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:02:16] (03PS1) 10Tim Starling: Revert "Enable LoginNotify seen subnets table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976176 [11:02:32] (03CR) 10Tim Starling: [C: 03+2] Revert "Enable LoginNotify seen subnets table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976176 (owner: 10Tim Starling) [11:02:43] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:03:18] (03Merged) 10jenkins-bot: Revert "Enable LoginNotify seen subnets table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976176 (owner: 10Tim Starling) [11:04:56] (03PS1) 10Tim Starling: Reapply "Enable LoginNotify seen subnets table"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) [11:05:11] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mwlog2002.codfw.wmnet [11:06:08] (03PS1) 10Volans: sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 [11:06:10] (03PS1) 10Volans: sre.I/F cookbooks: use get_subset() [cookbooks] - 10https://gerrit.wikimedia.org/r/976164 [11:06:22] (03PS1) 10Muehlenhoff: Switch mwlog2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976165 (https://phabricator.wikimedia.org/T349619) [11:06:31] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:10] (03CR) 10Muehlenhoff: [C: 03+2] Switch mwlog2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976165 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:09:03] 10SRE, 10SRE-swift-storage, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): thanos internal TLS failure after puppet 7 update - https://phabricator.wikimedia.org/T351653 (10MatthewVernon) @jbond thanks, that CR has fixed the sad services (and the openssl runes now work too). [11:09:12] 10SRE, 10SRE-swift-storage, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): thanos internal TLS failure after puppet 7 update - https://phabricator.wikimedia.org/T351653 (10jbond) 05Open→03Resolved I have rolled out a new wmf-certificates package which i believe has fixed this error. all swift se... [11:10:50] awight: taavi: _joe_: network link is still down on 2041 - this probably needs dcops interaction [11:11:02] jayme: No rush, thanks for the update! [11:13:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:13:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mwlog2002.codfw.wmnet [11:18:20] 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10JMeybohm) Hey DCOps, this looks suspiciously like a cable might have been pulled. Could you please take a look? [11:18:42] 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10JMeybohm) a:05JMeybohm→03Papaul [11:19:39] (03PS1) 10Muehlenhoff: Cleanup now obsolete Hiera entry, applied per role [puppet] - 10https://gerrit.wikimedia.org/r/976188 (https://phabricator.wikimedia.org/T349619) [11:20:10] (03CR) 10Volans: "For info about get_subset() see https://doc.wikimedia.org/spicerack/master/api/spicerack.remote.html#spicerack.remote.RemoteHosts.get_subs" [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [11:20:34] !log depool ms-fe2014 to reimage with new envoy TLS setup T317616 [11:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:39] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [11:20:49] (03CR) 10MVernon: [C: 03+2] swift: migrate one node to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:21:11] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:21:36] (03CR) 10Muehlenhoff: [C: 03+2] Cleanup now obsolete Hiera entry, applied per role [puppet] - 10https://gerrit.wikimedia.org/r/976188 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:21:37] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:21:42] (03CR) 10Volans: "For info about get_subset() see https://doc.wikimedia.org/spicerack/master/api/spicerack.remote.html#spicerack.remote.RemoteHosts.get_subs" [cookbooks] - 10https://gerrit.wikimedia.org/r/976164 (owner: 10Volans) [11:22:23] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2014.codfw.wmnet with OS bullseye [11:22:34] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2014.codfw.wmnet with OS bullseye [11:23:21] (HelmReleaseBadStatus) resolved: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:24:10] <_joe_> jayme: we can just depool it and set to status=inactive for now [11:24:16] <_joe_> sorry I was in a call [11:28:17] _joe_: set/pooled=inactive you mean? [11:29:19] (03PS1) 10Filippo Giunchedi: centralserver: remove tls_remedy [puppet] - 10https://gerrit.wikimedia.org/r/976190 (https://phabricator.wikimedia.org/T324623) [11:30:57] (03PS10) 10D3r1ck01: mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) [11:31:00] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/604/con" [puppet] - 10https://gerrit.wikimedia.org/r/976190 (https://phabricator.wikimedia.org/T324623) (owner: 10Filippo Giunchedi) [11:31:10] (03PS1) 10Hnowlan: service, kubernetes: mw-jobrunner fixes [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796) [11:32:08] (03CR) 10Filippo Giunchedi: [V: 03+1] "Will manually reset-failed the timer post-merge" [puppet] - 10https://gerrit.wikimedia.org/r/976190 (https://phabricator.wikimedia.org/T324623) (owner: 10Filippo Giunchedi) [11:32:37] 10SRE, 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10JMeybohm) [11:34:24] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/976164 (owner: 10Volans) [11:34:49] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:35:19] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host titan2002.codfw.wmnet [11:35:28] (03PS2) 10Btullis: spark: add support for spark-history on the spark image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896363 (https://phabricator.wikimedia.org/T330176) (owner: 10Nicolas Fraison) [11:35:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976190 (https://phabricator.wikimedia.org/T324623) (owner: 10Filippo Giunchedi) [11:36:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] sanitarium_multiinstance: over private_wiki and private_tables vars to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [11:37:11] (03PS1) 10Muehlenhoff: Switch titan2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976193 (https://phabricator.wikimedia.org/T349619) [11:37:24] moritzm: i merged your cleanup change [11:37:33] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubernetes2041.codfw.wmnet with reason: NIC 1 Port 1 network link is down [11:37:58] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubernetes2041.codfw.wmnet with reason: NIC 1 Port 1 network link is down [11:38:05] 10SRE, 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=771e4f70-9348-49e4-9f8a-1228c0c3d3dc) set by jayme@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their ser... [11:39:11] (03CR) 10Muehlenhoff: [C: 03+2] Switch titan2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976193 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:39:37] jbond: ack, thx [11:42:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host titan2002.codfw.wmnet [11:44:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The probe is correct, the problem is we don't have the realserver IP on the backends :)" [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:47:03] <_joe_> jayme: sorry I missed your message, yes [11:47:32] _joe_: np. was more meant as a clarification :) [11:48:49] (03PS13) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [11:48:51] (03PS8) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [11:48:53] (03PS13) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [11:48:55] (03PS8) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [11:48:57] (03PS8) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [11:48:59] (03PS13) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [11:49:01] (03PS13) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [11:49:02] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:51:25] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::cinder_backups [11:52:06] (03CR) 10CI reject: [V: 04-1] mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [11:52:08] (03CR) 10CI reject: [V: 04-1] airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [11:52:52] (03PS1) 10Jbond: wmcs::openstack::eqiad1::cinder_backups: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976197 (https://phabricator.wikimedia.org/T349619) [11:53:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2014.codfw.wmnet with reason: host reimage [11:53:48] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [11:53:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [11:54:11] (03CR) 10CI reject: [V: 04-1] phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [11:54:13] (03CR) 10CI reject: [V: 04-1] vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [11:55:01] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::cinder_backups: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976197 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [11:56:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2014.codfw.wmnet with reason: host reimage [11:56:10] (03CR) 10Jbond: [C: 03+2] realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [11:59:14] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::cinder_backups [12:00:15] (03PS1) 10Jbond: docker::reports: change ownership of base rebuild job [puppet] - 10https://gerrit.wikimedia.org/r/976198 [12:00:21] (03PS2) 10Hnowlan: service, kubernetes: mw-jobrunner fixes [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796) [12:01:40] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::control [12:03:11] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye [12:03:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*... [12:03:28] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [12:03:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [12:06:08] (03CR) 10Hnowlan: service, kubernetes: mw-jobrunner fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:09:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2014.codfw.wmnet with OS bullseye [12:09:28] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2014.codfw.wmnet with OS bullseye completed: - ms-fe2014 (**PASS**) - Downtimed on Ici... [12:10:25] (03CR) 10Muehlenhoff: [C: 03+2] mediawiki::packages: Clean up absented packages [puppet] - 10https://gerrit.wikimedia.org/r/975451 (owner: 10Muehlenhoff) [12:11:55] (03PS9) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [12:11:57] (03PS14) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [12:11:59] (03PS9) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [12:12:01] (03PS9) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [12:12:03] (03PS14) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [12:12:05] (03PS14) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [12:12:53] (03PS1) 10Jbond: wmcs::openstack::eqiad1::control: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976201 (https://phabricator.wikimedia.org/T349619) [12:13:27] (03PS2) 10Jbond: wmcs::openstack::eqiad1::control: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976201 (https://phabricator.wikimedia.org/T349619) [12:14:32] (03CR) 10CI reject: [V: 04-1] mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:14:39] jayme: _joe_: Shall I go ahead and deploy nonetheless, or still better to wait for the official depooling? [12:14:55] (03PS1) 10Muehlenhoff: mediawiki::packages: Drop python-pil [puppet] - 10https://gerrit.wikimedia.org/r/976202 (https://phabricator.wikimedia.org/T268468) [12:14:57] (03CR) 10CI reject: [V: 04-1] airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:14:59] (03CR) 10CI reject: [V: 04-1] phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:15:06] awight: host is depooled and should no longer be used by scap, you may go ahead [12:15:18] (03PS10) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [12:15:20] (03PS15) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [12:15:22] (03PS10) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [12:15:24] (03PS10) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [12:15:24] <_joe_> even if it still is in the distribution list, just ignore the error [12:15:26] (03PS15) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [12:15:28] (03PS15) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [12:17:51] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::control: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976201 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [12:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:18:32] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage [12:21:29] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage [12:22:38] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::control [12:24:52] (03PS11) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [12:24:54] (03PS16) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [12:24:56] (03PS11) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [12:24:57] jayme: ty! [12:24:58] (03PS11) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [12:25:00] (03PS16) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [12:25:02] (03PS16) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [12:25:41] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::net [12:26:07] (03PS1) 10Kevin Bazira: ml-services: add article-descriptions isvc to experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/975929 (https://phabricator.wikimedia.org/T343123) [12:26:58] (03PS12) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [12:27:00] (03PS17) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [12:27:02] (03PS12) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [12:27:04] (03PS12) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [12:27:06] (03PS17) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [12:27:06] !log awight@deploy2002 Started scap: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]] [12:27:08] (03PS17) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [12:27:11] T282999: Enable Reference Previews on all wikis using the Popups extension, on Nov 21 - https://phabricator.wikimedia.org/T282999 [12:28:30] !log awight@deploy2002 wmde-fisch and awight: Backport for [[gerrit:971882|Enable Reference Previews on all wikis (T282999)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:29:09] (03PS1) 10Jbond: wmcs::openstack::eqiad1::net: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976203 (https://phabricator.wikimedia.org/T349619) [12:29:19] WMDE-Fisch: our Reference Previews config is on the test servers [12:29:35] awight: looking at it [12:31:05] !log awight@deploy2002 Sync cancelled. [12:31:35] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::net: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976203 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [12:32:43] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [12:32:58] (03PS1) 10Awight: Revert "Revert "Enable Reference Previews on all wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976182 [12:33:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976182 (owner: 10Awight) [12:33:19] (03CR) 10WMDE-Fisch: [C: 03+1] Revert "Revert "Enable Reference Previews on all wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976182 (owner: 10Awight) [12:34:01] (03Merged) 10jenkins-bot: Revert "Revert "Enable Reference Previews on all wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976182 (owner: 10Awight) [12:34:15] !log awight@deploy2002 Started scap: Backport for [[gerrit:976182|Revert "Revert "Enable Reference Previews on all wikis""]] [12:35:35] !log awight@deploy2002 awight: Backport for [[gerrit:976182|Revert "Revert "Enable Reference Previews on all wikis""]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:35:40] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::net [12:36:45] !log awight@deploy2002 awight: Continuing with sync [12:38:31] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1115.eqiad.wmnet with OS bullseye [12:38:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye completed: - cp1115 (**PASS**) - Remo... [12:39:24] (03PS13) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [12:39:26] (03PS18) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [12:39:28] (03PS13) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [12:39:30] (03PS13) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [12:39:32] (03PS18) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [12:39:34] (03PS18) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [12:40:34] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::rabbitmq [12:41:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/608/console" [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:42:18] (03PS1) 10Jbond: wmcs::openstack::eqiad1::rabbitmq: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976205 (https://phabricator.wikimedia.org/T349619) [12:42:28] !log awight@deploy2002 Finished scap: Backport for [[gerrit:976182|Revert "Revert "Enable Reference Previews on all wikis""]] (duration: 08m 12s) [12:42:42] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::rabbitmq: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976205 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [12:43:26] jayme: Deployment looks successful, but FWIW the same errors appeared. [12:44:50] ack, thanks [12:45:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/609/con" [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:49:41] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::rabbitmq [12:52:20] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::services [12:52:35] (03PS14) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [12:52:37] (03PS19) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [12:52:39] (03PS14) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [12:52:41] (03PS14) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [12:52:43] (03PS19) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [12:52:45] (03PS19) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [12:52:47] (03PS1) 10Hnowlan: api-gateway: bump envoy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/976206 (https://phabricator.wikimedia.org/T324130) [12:54:01] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [12:54:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/610/console" [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:54:34] (03PS1) 10Jbond: wmcs::openstack::eqiad1::services: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976207 (https://phabricator.wikimedia.org/T349619) [12:56:37] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker11 - jclark@cumin1001" [12:57:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker11 - jclark@cumin1001" [12:57:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:59:19] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::services: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976207 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [12:59:42] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] centralserver: remove tls_remedy [puppet] - 10https://gerrit.wikimedia.org/r/976190 (https://phabricator.wikimedia.org/T324623) (owner: 10Filippo Giunchedi) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1300) [13:00:32] jbond: feel free to merge my patch too if it came up [13:01:37] (03CR) 10Klausman: [C: 03+1] ml-services: add article-descriptions isvc to experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/975929 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [13:05:58] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::services [13:06:30] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host titan2002.codfw.wmnet [13:13:21] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2002.codfw.wmnet [13:14:21] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::virt [13:14:28] (03PS1) 10Ssingh: constants: update ns2 IP address [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) [13:14:50] (ProbeDown) firing: (6) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:15:17] (03PS1) 10DCausse: cirrus-streaming-updater: bump to v20231121104610-0b4bfdd [deployment-charts] - 10https://gerrit.wikimedia.org/r/976211 [13:15:57] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for spotting it" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:17:08] (03PS1) 10Jbond: wmcs::openstack::eqiad1::virt: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976214 (https://phabricator.wikimedia.org/T349619) [13:17:13] RECOVERY - snapshot of s6 in eqiad on backupmon1001 is OK: Last snapshot for s6 at eqiad (db1225) taken on 2023-11-21 12:20:50 (541 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [13:17:29] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::virt: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976214 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [13:18:35] (03CR) 10CI reject: [V: 04-1] constants: update ns2 IP address [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:20:10] (03CR) 10Ssingh: "CI error is: `wmflib/requests.py:6: error: Library stubs not installed for "requests.packages.urllib3.util.retry" [import-untyped]`" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:20:41] RECOVERY - snapshot of s6 in codfw on backupmon1001 is OK: Last snapshot for s6 at codfw (db2097) taken on 2023-11-21 12:27:37 (578 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [13:20:41] (03PS1) 10Hnowlan: mw-jobrunner: use proxy in socket definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/976216 (https://phabricator.wikimedia.org/T349796) [13:21:39] (03CR) 10Volans: [C: 03+1] "don't worry about CI failing, I'll look into it, types-requests is already part of the dependencies and is failing only on py38/39" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:22:07] (03CR) 10Ssingh: constants: update ns2 IP address (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:22:13] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::virt [13:22:24] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review Tobias :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/975929 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [13:22:51] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::virt_ceph [13:23:14] (03Merged) 10jenkins-bot: ml-services: add article-descriptions isvc to experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/975929 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [13:23:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) [13:23:46] (03CR) 10Volans: [V: 03+2 C: 03+2] "merging to bypass ci failure, I'll look into it later" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976210 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:23:48] 10SRE, 10Observability-Logging: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) @fgiunchedi Everything is using openssl now, do you still see the errors? [13:23:58] 10SRE, 10Observability-Logging: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) [13:24:04] 10SRE, 10Cloud-VPS, 10cloud-services-team, 10observability, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10jbond) 05Open→03Resolved a:03jbond All systems hav now been migrated to ossl [13:24:12] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:24:43] (03PS1) 10Jbond: wmcs::openstack::eqiad1::virt_ceph: migrat to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976217 (https://phabricator.wikimedia.org/T349619) [13:25:00] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::eqiad1::virt_ceph: migrat to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976217 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [13:27:35] (03CR) 10Ladsgroup: [C: 03+1] Reapply "Enable LoginNotify seen subnets table"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [13:28:34] (03PS15) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [13:28:36] (03PS20) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [13:28:38] (03PS15) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [13:28:40] (03PS15) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [13:28:42] (03PS20) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [13:28:44] (03PS20) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [13:30:06] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1165.mgmt.eqiad.wmnet with reboot policy FORCED [13:30:07] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1166.mgmt.eqiad.wmnet with reboot policy FORCED [13:30:09] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1167.mgmt.eqiad.wmnet with reboot policy FORCED [13:30:10] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED [13:30:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [13:30:11] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1169.mgmt.eqiad.wmnet with reboot policy FORCED [13:30:13] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1170.mgmt.eqiad.wmnet with reboot policy FORCED [13:31:49] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-jobrunner: use proxy in socket definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/976216 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [13:32:22] !log repool ms-fe2014 with new envoy TLS setup T317616 [13:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:27] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [13:32:43] (03CR) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [13:32:52] 10SRE, 10Cloud-VPS, 10observability: ossl rsyslog post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) [13:33:04] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED [13:34:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/613/con" [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [13:35:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [13:35:40] 10SRE, 10Cloud-VPS, 10observability: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) [13:36:02] (03CR) 10Elukey: [C: 04-2] "Going to wait a little on this one, the current version seems to work fine, I need to understand why :D" [puppet] - 10https://gerrit.wikimedia.org/r/975833 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:38:46] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::virt_ceph [13:38:47] (03PS3) 10Elukey: profile::thanos: change increase() range for Lift Wing [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390) [13:38:57] (03CR) 10DCausse: [C: 03+2] cirrus-streaming-updater: bump to v20231121104610-0b4bfdd [deployment-charts] - 10https://gerrit.wikimedia.org/r/976211 (owner: 10DCausse) [13:39:00] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Jelto) @hashar `gerrit2002` was migrated to puppet7. I restarted gerrit and apache processes and the instance looks fine so far. Could you double check `ger... [13:39:16] (03CR) 10Elukey: "The change is the same, I just re-added the Pyrra recording rule since it seems to work." [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:39:59] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump to v20231121104610-0b4bfdd [deployment-charts] - 10https://gerrit.wikimedia.org/r/976211 (owner: 10DCausse) [13:41:38] (03CR) 10Elukey: ml-services: add article-descriptions isvc to experimental namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/975929 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [13:43:12] (03PS1) 10Giuseppe Lavagetto: mobileapps: switch 15% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976218 (https://phabricator.wikimedia.org/T350846) [13:43:13] (03PS1) 10Giuseppe Lavagetto: mobileapps: 20% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976219 (https://phabricator.wikimedia.org/T350846) [13:43:15] (03PS1) 10Giuseppe Lavagetto: mobileapps: 30% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976220 (https://phabricator.wikimedia.org/T350846) [13:43:17] (03PS1) 10Giuseppe Lavagetto: mobileapps: 45% to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/976221 (https://phabricator.wikimedia.org/T350846) [13:43:19] (03PS1) 10Giuseppe Lavagetto: mobileapps: 60% to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/976222 (https://phabricator.wikimedia.org/T350846) [13:43:22] (03PS1) 10Giuseppe Lavagetto: mobileapps: 75% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976223 (https://phabricator.wikimedia.org/T350846) [13:43:23] (03PS1) 10Giuseppe Lavagetto: mobileapps: 90% to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/976224 (https://phabricator.wikimedia.org/T350846) [13:43:56] 10SRE, 10Cloud-VPS, 10observability: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) On the rsyslog side these are the errors: ` Nov 21 13:42:58 centrallog2002 rsyslogd[2845781]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0] Nov 21 13:42... [13:45:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Move mw appservers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/975228 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [13:46:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I didn't check the IPs tbh 😊" [homer/public] - 10https://gerrit.wikimedia.org/r/975225 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [13:49:49] !log test upgrade rsyslog on centrallog2002 - T351710 [13:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:56] T351710: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 [13:51:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T348183)', diff saved to https://phabricator.wikimedia.org/P53672 and previous config saved to /var/cache/conftool/dbconfig/20231121-135059-arnaudb.json [13:51:28] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [13:51:38] jouncebot: nowandnext [13:51:38] For the next 0 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1300) [13:51:38] In 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1400) [13:52:27] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://wikitech.wikimedia.org/wiki/Logs [13:52:37] that's me ^ [13:56:59] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:57:44] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:58:08] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1400). [14:00:04] xSavitar and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:06] o/ [14:00:15] xSavitar: are you going to self-service? [14:00:34] o/ [14:00:44] otherwise I can deploy as well [14:01:00] Lucas_WMDE, you're here already :) [14:01:10] You can go ahead [14:01:41] I’m not sure what you mean ^^ go ahead with your change or with my maintenance script? [14:01:57] I would wait with my script until yours is done, I have no idea how long it’ll take [14:02:08] I mean my config patch [14:02:18] Mine doesn't need testing [14:02:21] ok [14:02:28] (03PS11) 10Lucas Werkmeister (WMDE): mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [14:02:34] But once it's live, I'll signal ServiceOps to make use of it, it's their thing. [14:02:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [14:02:49] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://wikitech.wikimedia.org/wiki/Logs [14:03:01] regarding the change in PS9..10 – this means the default server won’t be used if the env var is set to empty [14:03:08] Yes [14:03:25] (I had first looked at PS9 where it was different, which got me thinking a bit about whether eliminating that temporary variable was worth it) [14:03:26] ok ^^ [14:03:33] (03CR) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:04:11] (03Merged) 10jenkins-bot: mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [14:04:24] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:973838|mc: Make it possible to use mcrouter server set by environment (T346690)]] [14:04:37] T346690: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 [14:05:07] Lucas_WMDE, yeah the one linear seems nicer and straight to the point. So right now, we'll keep using the default until the custom env variable is set. [14:05:24] and thank you very much for deploying [14:05:26] kubernetes2041 is still down it seems [14:05:30] effie ^^ [14:05:35] (03CR) 10Ottomata: [C: 03+2] changeprop - fixes for beta values [deployment-charts] - 10https://gerrit.wikimedia.org/r/975862 (https://phabricator.wikimedia.org/T351247) (owner: 10Ottomata) [14:05:37] (known, T351704) [14:05:38] T351704: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 [14:05:45] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and d3r1ck01: Backport for [[gerrit:973838|mc: Make it possible to use mcrouter server set by environment (T346690)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:05:46] Roger that! [14:05:56] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and d3r1ck01: Continuing with sync [14:05:56] 10SRE, 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T351663 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated power cable on for psu1 on both ends. alert cleared. [14:06:01] just mentioning it since it showed up in the scap output [14:06:04] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED [14:06:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P53673 and previous config saved to /var/cache/conftool/dbconfig/20231121-140606-arnaudb.json [14:06:08] Okay! [14:06:11] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2028-11-11 12:37:08 +0000 (expires in 1816 days) https://wikitech.wikimedia.org/wiki/Logs [14:06:30] Lucas_WMDE read a bit about the script you want to run and all the magic happening there. Too big for my tiny brain :D [14:06:38] !log updating pybal to 1.5.14 on drmrs - T351069 [14:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:44] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [14:07:17] !log revert rsyslog upgrade on centrallog2002 - T351710 [14:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:30] T351710: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 [14:07:40] !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs6003.drmrs.wmnet} and A:lvs (T351069) [14:07:42] :D [14:07:58] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs6003.drmrs.wmnet} and A:lvs (T351069) [14:08:01] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED [14:08:19] !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[6001-6002].drmrs.wmnet} and A:lvs (T351069) [14:08:22] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1172.mgmt.eqiad.wmnet with reboot policy FORCED [14:09:41] RECOVERY - Host kubernetes2041 is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [14:10:51] (03PS1) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616) [14:10:54] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[6001-6002].drmrs.wmnet} and A:lvs (T351069) [14:11:04] Lucas_WMDE, thanks for deploying [14:11:12] np :) [14:11:15] it’s almost done [14:11:23] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 177, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:11:34] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:973838|mc: Make it possible to use mcrouter server set by environment (T346690)]] (duration: 07m 09s) [14:11:38] T346690: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 [14:11:58] alright, I’ll do the maintenance script then [14:12:07] (03CR) 10Elukey: [C: 03+1] sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [14:12:09] uhm [14:12:17] although https://orchestrator.wikimedia.org/web/cluster/alias/s8 says two servers aren’t replicating [14:12:47] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [14:13:41] 10SRE, 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10Jhancock.wm) a:05Papaul→03Jhancock.wm the network cable was still attached but loose. I reseated it and pulled on it to make sire it wouldn't come loose again. it did... [14:13:55] any DBAs around? (maybe Amir1 or jynus?) Orchestrator says two servers in s8 aren’t replicating (ca. 3h lag); known issue? [14:14:07] (I’m guessing I shouldn’t run my maintenance script on s8 while that’s unclear) [14:14:34] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T351683 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact. should go away when the row is converted to spine/leaf [14:15:12] (03CR) 10Jbond: [C: 03+1] Don't alert for v6 AAAA for logstash and kafla-logging [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/976110 (owner: 10Ayounsi) [14:18:03] the two servers in question aren’t shown on https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1&var-site=eqiad&var-section=s8&from=now-3h&to=now&refresh=1m at all, no idea what that means… [14:18:04] (03CR) 10Elukey: Don't alert for v6 AAAA for logstash and kafla-logging (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/976110 (owner: 10Ayounsi) [14:18:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1167.mgmt.eqiad.wmnet with reboot policy FORCED [14:18:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1166.mgmt.eqiad.wmnet with reboot policy FORCED [14:18:25] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1169.mgmt.eqiad.wmnet with reboot policy FORCED [14:18:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1170.mgmt.eqiad.wmnet with reboot policy FORCED [14:18:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1165.mgmt.eqiad.wmnet with reboot policy FORCED [14:19:46] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1173.mgmt.eqiad.wmnet with reboot policy FORCED [14:19:47] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1174.mgmt.eqiad.wmnet with reboot policy FORCED [14:20:05] (03PS2) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616) [14:20:56] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [14:21:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P53674 and previous config saved to /var/cache/conftool/dbconfig/20231121-142112-arnaudb.json [14:22:54] Lucas_WMDE, not sure but could it be that the servers in question are down? I don't see any task on phab related to no replication happening there. [14:23:23] A DBA would have the answer [14:23:50] if I read Orchestrator correctly, they’re up (“last seen 3s ago”), but not replicating for whatever reason [14:23:55] I don’t see any recent SAL entries for them [14:24:20] aha, one is showing up on https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1&var-site=eqiad&var-section=s8&from=now-3h&to=now&refresh=1m now at least [14:24:28] (db1171) [14:24:51] ok, orchestrator now says about db1171 that it’s replicating but has lag [14:24:54] so I guess it’s catching up? [14:25:12] Probably catching up. [14:25:14] that still leaves one other server not replicating for unknown-to-me reasons [14:25:22] I see db1171 on grafana too [14:25:39] (db2098 is the other server I’m looking at; I guess posting the number here can’t hurt) [14:25:54] :) [14:25:58] yeah db1171 is going down on grafana pretty quickly now [14:26:01] That's what I was looking at too [14:26:42] Maybe once that catches up, then the other will pick up from there :) [14:27:35] orchestrator says 20mins 7s (as at now) [14:27:39] That was quick [14:27:47] 1m48s for me [14:27:54] done [14:27:56] that’s remarkably quick indeed [14:28:50] The other is catching up to [14:28:52] *too [14:29:15] oh nice [14:29:38] though I still don’t see it in grafana [14:29:41] 2hrs 39mins from my view [14:29:42] maybe that’s just a bit behind [14:29:55] Yeah, maybe grafana inherited the lag :D [14:29:57] the replication lag in orhcestrator also seemed to be ahead of grafana by a minute or so [14:30:12] Makes sense [14:30:30] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED [14:30:39] Lucas_WMDE: is there a specific reason why you're looking at the dashboard? [14:31:41] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs1015.eqiad.wmnet [14:31:41] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1015.eqiad.wmnet [14:31:51] taavi: I want to run a maintenance script on s8 that’ll do a bunch of database writes [14:31:59] so I want s8 to be healthy first [14:32:00] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1015.eqiad.wmnet with OS bullseye [14:32:34] and Grafana initially seemed like the more useful resource, although at the moment it seems like Orchestrator might be better [14:32:51] !log updating pybal to 1.5.14 on esams - T351069 [14:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:57] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [14:33:08] (03PS1) 10Filippo Giunchedi: centralserver: remove icinga tls listener check [puppet] - 10https://gerrit.wikimedia.org/r/976234 (https://phabricator.wikimedia.org/T351710) [14:33:22] oh 🤦 [14:33:28] it would of course help if I selected codfw in grafana [14:33:37] given that db2098 is a codfw server [14:33:37] Lucas_WMDE, I think the other is done [14:33:45] it shows up at https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1&var-site=codfw&var-section=s8&from=now-3h&to=now&refresh=1m just fine [14:33:50] Everything looks healthy from orchestrator's pov [14:33:58] and db2098 is a backup source so it occasionally being replagged is normal [14:34:22] !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs3010.esams.wmnet} and A:lvs (T351069) [14:34:32] is there any way I could’ve known that? [14:34:32] So it seems first server is in eqiad and the second is in codfw? wow :D [14:34:38] plus dbs can get schema changes done which causes lag, etc, I wouldn't worry about a couple of hosts on an unfamiliar dashboard [14:34:39] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs3010.esams.wmnet} and A:lvs (T351069) [14:35:10] taavi :D [14:35:15] taavi: I’d expect to see schema changes in the SAL though [14:35:29] like https://sal.toolforge.org/log/DbiVzosBGiVuUzOdZubS last week [14:35:43] anyway… it sounds like I can run my maintenance script now [14:35:50] !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[3008-3009].esams.wmnet} and A:lvs (T351069) [14:35:56] (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: use proxy in socket definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/976216 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [14:36:19] !log START [in tmux] lucaswerkmeister-wmde@mwmaint2002:~$ mwscript Wikibase.Lexeme.Maintenance.FixPagePropsSortkey wikidatawiki --batch-size=1000 # T350224 [14:36:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T348183)', diff saved to https://phabricator.wikimedia.org/P53675 and previous config saved to /var/cache/conftool/dbconfig/20231121-143619-arnaudb.json [14:36:21] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [14:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:23] T350224: [LEX] pp_sortkey is null for wb-claims, wbl-forms and wbl-senses on many Lexemes - https://phabricator.wikimedia.org/T350224 [14:36:28] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:36:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [14:36:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T348183)', diff saved to https://phabricator.wikimedia.org/P53676 and previous config saved to /var/cache/conftool/dbconfig/20231121-143640-arnaudb.json [14:36:43] (03Merged) 10jenkins-bot: mw-jobrunner: use proxy in socket definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/976216 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [14:36:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1172.mgmt.eqiad.wmnet with reboot policy FORCED [14:37:00] (03CR) 10Arnaudb: [C: 03+1] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [14:37:34] (03PS1) 10Filippo Giunchedi: centralserver: probe syslog receiver with client auth [puppet] - 10https://gerrit.wikimedia.org/r/976236 (https://phabricator.wikimedia.org/T351710) [14:37:37] ok lag is going up a bit (1s, 3s for some clouddb), nothing bad yet I’d say [14:37:56] (also the script waits for replication of course. I’m just double-checking to be extra safe ^^) [14:38:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:38:23] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:26] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[3008-3009].esams.wmnet} and A:lvs (T351069) [14:38:26] yeah I think this is looking healthy so far… might even finish before the window is over [14:38:32] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [14:39:05] (03CR) 10Filippo Giunchedi: [C: 03+2] centralserver: remove icinga tls listener check [puppet] - 10https://gerrit.wikimedia.org/r/976234 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [14:39:45] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1112.eqiad.wmnet [14:39:45] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1112.eqiad.wmnet [14:40:02] (03CR) 10Filippo Giunchedi: [C: 03+2] centralserver: probe syslog receiver with client auth [puppet] - 10https://gerrit.wikimedia.org/r/976236 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [14:40:24] Lucas_WMDE, not that bad. Averagely 1s lag [14:41:06] (03PS1) 10Eevans: install_server: do fully unattended aqs installs [puppet] - 10https://gerrit.wikimedia.org/r/976238 (https://phabricator.wikimedia.org/T347738) [14:41:24] yeah, it’s staying stable between 0 and 1 s [14:41:54] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED [14:41:56] !log swapped cp1112 <-> cp1087 (T349244) [14:42:00] ok, it’s halfway done already \o/ [14:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:06] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [14:42:14] \o/ [14:43:05] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1113.eqiad.wmnet [14:43:06] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1113.eqiad.wmnet [14:43:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:43:14] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1015.eqiad.wmnet with reason: host reimage [14:43:42] heh, you can definitely see the rows written on https://grafana.wikimedia.org/d/000000278/mysql-aggregated?from=now-1h&to=now go up [14:43:54] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1173.mgmt.eqiad.wmnet with reboot policy FORCED [14:44:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1174.mgmt.eqiad.wmnet with reboot policy FORCED [14:44:01] oh my :D [14:44:06] Lucas_WMDE: are you still deploying? [14:44:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:44:14] kostajh: still running a maintenance script [14:44:19] !log swapped cp1113 <-> cp1088 (T349244) [14:44:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [14:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:27] should be done within 10 minutes I think [14:44:28] ok [14:44:35] but it probably doesn’t have to block anything else, strictly speaking [14:44:39] I have a patch from the morning window that didn't get through [14:44:47] s8 says > 66k wr/s :D, insane numbers [14:44:49] it's beta only, so I'd like to sync it possible [14:44:52] * Lucas_WMDE looks [14:44:58] https://gerrit.wikimedia.org/r/c/975270/ [14:45:05] yeah I think that’s fine [14:45:10] should I `scap backport` it? [14:45:13] oh! [14:45:19] Lucas_WMDE: yes please. [14:45:27] !log T350224 maintenance script finished (8m46s real time) [14:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:32] T350224: [LEX] pp_sortkey is null for wb-claims, wbl-forms and wbl-senses on many Lexemes - https://phabricator.wikimedia.org/T350224 [14:45:47] (03PS2) 10Lucas Werkmeister (WMDE): [betalabs] ReportIncident: Relax rate limiting for reportincident action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975270 (https://phabricator.wikimedia.org/T351299) (owner: 10Kosta Harlan) [14:45:49] Lucas_WMDE: I'll move it into this calendar block on the Deployment page [14:45:55] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1015.eqiad.wmnet with reason: host reimage [14:45:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975270 (https://phabricator.wikimedia.org/T351299) (owner: 10Kosta Harlan) [14:45:57] ok, thanks [14:46:07] (ProbeDown) firing: (3) Service upload-https:443 has failed probes (http_upload-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:26] (03CR) 10Eevans: [C: 03+1] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [14:46:33] <_joe_> !incidents [14:46:33] 4274 (UNACKED) [3x] ProbeDown sre (probes/service eqiad) [14:46:38] wheeeeee [14:46:41] <_joe_> !ack 4274 [14:46:42] 4274 (ACKED) [3x] ProbeDown sre (probes/service eqiad) [14:46:49] (03Merged) 10jenkins-bot: [betalabs] ReportIncident: Relax rate limiting for reportincident action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975270 (https://phabricator.wikimedia.org/T351299) (owner: 10Kosta Harlan) [14:47:03] <_joe_> the failing probe is eqiad/upload [14:47:10] 10SRE, 10Observability-Logging: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10fgiunchedi) Yes I still see the errors: ` Nov 21, 2023 @ 14:45:47.621 prometheus1005 target=[2620:0:861:102:10:64... [14:47:31] hello [14:47:38] <_joe_> and specifically ipv6? [14:47:40] kostajh: pulled to deploy2002, should show up in beta soon [14:47:41] * Lucas_WMDE done [14:48:00] Lucas_WMDE: thank you [14:48:18] <_joe_> sukhe: do you see something in the eqiad upload dashboards that would justify the probe failure? [14:48:22] (ProbeDown) firing: (7) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:48:27] _joe_: not only v6 afaics, also v4 [14:48:35] I'm looking at alerts.w.o [14:49:00] <_joe_> godog: oh ok I was looking at prometheus [14:49:00] I've just swapped a cp host in eqiad for upload but should have no impact [14:49:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:49:13] I am wondering if the recent cp host swap should have had something to do with it, but unlikely [14:49:15] we had a nice spike on eqiad [14:49:16] looking [14:49:17] <_joe_> fabfur: uh I'd say it did [14:49:24] <_joe_> :P [14:49:27] https://grafana.wikimedia.org/goto/pmHeCXIIz?orgId=1 [14:49:36] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: cluster::cloud_management [14:49:56] <_joe_> ah so the lbs [14:50:51] <_joe_> I was looking at https://grafana.wikimedia.org/d/000000479/cdn-frontend-traffic?orgId=1&var-site=eqiad&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4 [14:50:58] <_joe_> and nothing really stood out immediately [14:51:17] if it was the above, then the spike should have been obvious [14:51:17] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:51:30] acked [14:51:55] I can revert the change anyway [14:52:03] <_joe_> fabfur: nah [14:52:03] cp1113 - cp1088 [14:52:04] fabfur: no [14:52:06] (03PS1) 10Jbond: cluster::cloud_management: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976239 (https://phabricator.wikimedia.org/T349619) [14:52:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.88% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:52:16] !log hnowlan@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [14:52:16] !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [14:52:29] !log hnowlan@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [14:52:29] !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [14:52:50] !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [14:52:50] !log hnowlan@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [14:53:01] !log hnowlan@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [14:53:01] !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [14:53:12] (03CR) 10Jbond: [C: 03+2] cluster::cloud_management: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976239 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:53:24] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:30] (03PS3) 10Hnowlan: service, kubernetes: mw-jobrunner fixes [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796) [14:55:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975913 (https://phabricator.wikimedia.org/T349875) (owner: 10Eevans) [14:56:18] (NELHigh) firing: (2) Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:57:06] (03CR) 10JHathaway: [C: 03+1] "very nice, looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:57:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:57:15] (03PS1) 10Elukey: Move kafka-main1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976241 (https://phabricator.wikimedia.org/T349619) [14:57:23] (03CR) 10Jbond: [C: 03+1] install_server: do fully unattended aqs installs [puppet] - 10https://gerrit.wikimedia.org/r/976238 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [14:57:36] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: cluster::cloud_management [14:58:43] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:00:03] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: dumps::distribution::server [15:00:42] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:00:51] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10Vgutierrez) @fgiunchedi seems like a mismatch on configured curves between clients and servers, could I suggest providing a more detailed TLS configuration for both rsy... [15:01:17] (NELHigh) firing: (2) Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [15:01:52] (03PS1) 10Jbond: dumps::distribution::server: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976243 (https://phabricator.wikimedia.org/T349619) [15:02:03] !incidents [15:02:03] 4274 (ACKED) [3x] ProbeDown sre (probes/service eqiad) [15:02:03] 4275 (ACKED) NELHigh sre (tcp.timed_out) [15:02:42] (03CR) 10Jbond: [C: 03+2] dumps::distribution::server: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976243 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:02:54] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:04:46] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1015.eqiad.wmnet with OS bullseye [15:04:53] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs1015.eqiad.wmnet [15:04:54] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1015.eqiad.wmnet [15:06:17] (NELHigh) firing: (2) Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [15:06:50] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dumps::distribution::server [15:07:15] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::db::wikireplicas::web_multiinstance [15:07:52] (03CR) 10Eevans: [C: 03+2] install_server: do fully unattended aqs installs [puppet] - 10https://gerrit.wikimedia.org/r/976238 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [15:09:10] (03PS1) 10Jbond: wmcs::db::wikireplicas::web_multiinstance: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976246 (https://phabricator.wikimedia.org/T349619) [15:09:44] (03CR) 10Jbond: [C: 03+2] wmcs::db::wikireplicas::web_multiinstance: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976246 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:12:27] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS bullseye [15:13:28] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::db::wikireplicas::web_multiinstance [15:13:31] (03PS1) 10Ssingh: depool eqiad for upload-addrs [dns] - 10https://gerrit.wikimedia.org/r/976247 [15:14:57] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::db::wikireplicas::analytics_multiinstance [15:15:51] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [15:18:23] (ProbeDown) firing: (7) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:57] (03PS1) 10Jbond: wmcs::db::wikireplicas::analytics_multiinstance: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976250 (https://phabricator.wikimedia.org/T349619) [15:19:40] (03PS1) 10Ssingh: sites.yaml: prepend_as_out for eqiad/eqord [homer/public] - 10https://gerrit.wikimedia.org/r/976251 [15:20:19] (03CR) 10Jbond: [C: 03+2] wmcs::db::wikireplicas::analytics_multiinstance: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976250 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:21:07] (ProbeDown) resolved: (3) Service upload-https:443 has failed probes (http_upload-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:21:13] <_joe_> oh heh [15:21:15] <_joe_> no need [15:21:15] !log depooled cp1113 [15:21:16] ha [15:21:17] (NELHigh) resolved: (2) Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [15:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:17] (03Abandoned) 10Ssingh: sites.yaml: prepend_as_out for eqiad/eqord [homer/public] - 10https://gerrit.wikimedia.org/r/976251 (owner: 10Ssingh) [15:22:38] (03Abandoned) 10Ssingh: depool eqiad for upload-addrs [dns] - 10https://gerrit.wikimedia.org/r/976247 (owner: 10Ssingh) [15:23:45] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage [15:24:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::db::wikireplicas::analytics_multiinstance [15:24:53] (03CR) 10Jbond: "PCC: https://gerrit.wikimedia.org/r/c/operations/puppet/+/971476/19" [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:25:16] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::cloudlb [15:26:17] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage [15:27:24] (03PS1) 10Jbond: wmcs::cloudlb: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976253 (https://phabricator.wikimedia.org/T349619) [15:27:27] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED [15:29:05] (03CR) 10Jbond: [C: 03+2] wmcs::cloudlb: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976253 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:32:51] RECOVERY - snapshot of s1 in eqiad on backupmon1001 is OK: Last snapshot for s1 at eqiad (db1140) taken on 2023-11-21 13:43:40 (1220 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:33:57] (03PS3) 10Ssingh: P:dns::auth::update: add support for setting ferm rules via confd [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) [15:33:59] (03PS1) 10Ssingh: P:dns::auth::update: add support for authdns-update hosts via confd [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) [15:34:37] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::cloudlb [15:34:45] RECOVERY - snapshot of s1 in codfw on backupmon1001 is OK: Last snapshot for s1 at codfw (db2141) taken on 2023-11-21 13:38:03 (1205 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:35:15] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:37:01] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::cloudgw [15:38:12] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:38:36] (03PS1) 10Jbond: wmcs::cloudgw: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976256 (https://phabricator.wikimedia.org/T349619) [15:39:40] (03CR) 10Jbond: [C: 03+2] wmcs::cloudgw: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976256 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:43:30] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::cloudgw [15:44:28] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: insetup::wmcs [15:46:29] (03PS1) 10Jbond: insetup::wmcs: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976258 (https://phabricator.wikimedia.org/T349619) [15:46:45] PROBLEM - Check systemd state on an-worker1121 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:03] !log repooled cp1088 [15:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:28] (03CR) 10Jbond: [C: 03+2] insetup::wmcs: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976258 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:48:01] Lucas_WMDE: backups are running today by day [15:48:24] ok [15:48:29] and replication is stopped while the backup runs? [15:48:32] those are not mediawiki servers, they are backup sources dbs [15:48:40] on some hosts yes, to speed up backup [15:49:13] they would alert otherwise, if they are not alerting through icinga, it means stopping them is a normal operation [15:50:12] alright, thanks [15:50:53] please note that mw is only like 2/3s of dbs, there are many times dbs depooled or for other functions (wikireplicas, analytics, backups, even of s* sections) [15:51:30] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup::wmcs [15:53:23] PROBLEM - Check systemd state on an-worker1153 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:25] PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:53:51] RECOVERY - snapshot of s4 in eqiad on backupmon1001 is OK: Last snapshot for s4 at eqiad (db1150) taken on 2023-11-21 13:50:42 (1749 GiB, -0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:55:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED [15:55:50] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [15:55:56] (03CR) 10Jbond: [C: 03+2] mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:55:56] (03CR) 10Jbond: [C: 03+2] airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:56:02] (03CR) 10Jbond: [C: 03+2] phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:56:05] (03CR) 10Jbond: [C: 03+2] vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:56:14] (03CR) 10Jbond: [C: 03+2] realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:57:51] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:35] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:58:46] (03CR) 10JMeybohm: [C: 03+1] Move kafka-main1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976241 (https://phabricator.wikimedia.org/T349619) (owner: 10Elukey) [15:59:09] 10Puppet, 10Instrument-ClientError: Google Translate and other translate services triggering client error alert - https://phabricator.wikimedia.org/T351738 (10Jdlrobson) [15:59:32] (03PS2) 10Jdlrobson: Filter translation service errors [puppet] - 10https://gerrit.wikimedia.org/r/975836 (https://phabricator.wikimedia.org/T351738) [15:59:48] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [16:00:04] eoghan, jelto, and arnoldokoth: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1600). [16:00:28] (03PS1) 10Jbond: vrts: use correct variable [puppet] - 10https://gerrit.wikimedia.org/r/976261 (https://phabricator.wikimedia.org/T350008) [16:02:41] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: allocate cloud-private svc ips to wiki replicas - taavi@cumin1001" [16:03:31] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: allocate cloud-private svc ips to wiki replicas - taavi@cumin1001" [16:03:31] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:04:36] (03PS2) 10Jbond: vrts: use correct variable [puppet] - 10https://gerrit.wikimedia.org/r/976261 (https://phabricator.wikimedia.org/T350008) [16:05:02] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp1113 [16:05:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1113 [16:05:48] (03CR) 10Jbond: [C: 03+2] vrts: use correct variable [puppet] - 10https://gerrit.wikimedia.org/r/976261 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:05:54] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/618/console" [puppet] - 10https://gerrit.wikimedia.org/r/976261 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:06:32] (03CR) 10Elukey: [C: 03+2] Move kafka-main1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976241 (https://phabricator.wikimedia.org/T349619) (owner: 10Elukey) [16:06:59] RECOVERY - snapshot of s2 in codfw on backupmon1001 is OK: Last snapshot for s2 at codfw (db2097) taken on 2023-11-21 15:06:15 (948 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:07:35] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) Thank you @Vgutierrez for the suggestion, I've dug a little bit into the situation and the code and I believe the message is a red-herring, in the sense tha... [16:07:44] (03PS21) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [16:07:47] !log elukey@cumin1001 START - Cookbook sre.puppet.migrate-host for host kafka-main1001.eqiad.wmnet [16:08:55] 10SRE, 10Observability-Logging, 10User-fgiunchedi: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10fgiunchedi) [16:11:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host kafka-main1001.eqiad.wmnet [16:11:59] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10Vgutierrez) nice, but please set a sane TLS configuration :) ideally nothing lower than TLSv1.2 and solid ciphersuites [16:13:20] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10hashar) >>! In T349619#9348943, @Jelto wrote: > @hashar `gerrit2002` was migrated to puppet7. I restarted gerrit and apache processes and the instance looks... [16:16:45] RECOVERY - snapshot of s8 in eqiad on backupmon1001 is OK: Last snapshot for s8 at eqiad (db1171) taken on 2023-11-21 14:22:51 (1495 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:19:33] RECOVERY - Check systemd state on an-worker1153 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:35] RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:21:55] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:22:04] (03PS1) 10Jbond: pki: add mtls profile [puppet] - 10https://gerrit.wikimedia.org/r/976267 (https://phabricator.wikimedia.org/T351624) [16:22:35] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:44] (03PS1) 10Majavah: wikimedia.cloud: include zone file for svc records [dns] - 10https://gerrit.wikimedia.org/r/976268 [16:24:19] RECOVERY - snapshot of s8 in codfw on backupmon1001 is OK: Last snapshot for s8 at codfw (db2098) taken on 2023-11-21 14:28:29 (1536 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:24:29] (03CR) 10Reedy: [C: 03+1] mediawiki::packages: Drop python-pil [puppet] - 10https://gerrit.wikimedia.org/r/976202 (https://phabricator.wikimedia.org/T268468) (owner: 10Muehlenhoff) [16:24:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/621/con" [puppet] - 10https://gerrit.wikimedia.org/r/976267 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond) [16:24:53] (03CR) 10Majavah: "Not sure if there's a better way to do this? The file names feel a bit odd." [dns] - 10https://gerrit.wikimedia.org/r/976268 (owner: 10Majavah) [16:26:27] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/976268 (owner: 10Majavah) [16:28:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] pki: add mtls profile [puppet] - 10https://gerrit.wikimedia.org/r/976267 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond) [16:28:15] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:28:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:29:41] PROBLEM - Check systemd state on an-worker1086 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:58] (03PS25) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [16:30:43] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [16:30:43] PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:30] (03CR) 10Cathal Mooney: [C: 03+1] wikimedia.cloud: include zone file for svc records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/976268 (owner: 10Majavah) [16:33:04] !log updating pybal to 1.5.14 on eqiad - T351069 [16:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:17] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [16:33:35] (03CR) 10Dwisehaupt: "A couple more minor changes to make this production ready:" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [16:34:15] !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020.eqiad.wmnet} and A:lvs (T351069) [16:34:44] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020.eqiad.wmnet} and A:lvs (T351069) [16:34:51] RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:34:59] (03PS26) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [16:35:11] RECOVERY - Check systemd state on an-worker1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:02] (03PS6) 10Majavah: P:wmcs: wikireplicas: allow access from cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427) [16:36:04] (03PS12) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) [16:36:14] !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[1017-1019].eqiad.wmnet} and A:lvs (T351069) [16:36:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:36:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:36:54] (03CR) 10Majavah: [C: 03+2] wikimedia.cloud: include zone file for svc records [dns] - 10https://gerrit.wikimedia.org/r/976268 (owner: 10Majavah) [16:37:14] (03PS1) 10Jbond: prometheus: update to request testing certs from pki [puppet] - 10https://gerrit.wikimedia.org/r/976273 (https://phabricator.wikimedia.org/T351624) [16:37:28] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [16:38:33] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:39:07] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [16:41:04] (03PS2) 10Jbond: prometheus: update to request testing certs from pki [puppet] - 10https://gerrit.wikimedia.org/r/976273 (https://phabricator.wikimedia.org/T351624) [16:41:12] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4 NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [16:41:45] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[1017-1019].eqiad.wmnet} and A:lvs (T351069) [16:41:50] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [16:41:59] (03PS7) 10Majavah: P:wmcs: wikireplicas: allow access from cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427) [16:42:01] (03PS13) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) [16:43:15] RECOVERY - Check systemd state on an-worker1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:26] (03CR) 10Majavah: Add wiki replicas to cloudlb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [16:44:21] RECOVERY - snapshot of s2 in eqiad on backupmon1001 is OK: Last snapshot for s2 at eqiad (db1225) taken on 2023-11-21 15:19:40 (1150 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:44:39] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1005 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [16:44:40] (03PS3) 10Jbond: prometheus: update to request testing certs from pki [puppet] - 10https://gerrit.wikimedia.org/r/976273 (https://phabricator.wikimedia.org/T351624) [16:46:36] (03CR) 10Majavah: P:wmcs: wikireplicas: allow access from cloudlb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [16:47:34] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:48:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/626/con" [puppet] - 10https://gerrit.wikimedia.org/r/976273 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond) [16:49:45] (03CR) 10Vgutierrez: [C: 03+1] "reasoning looks good but please get rzl to check it as well :)" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [16:50:25] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10User-fgiunchedi: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) @fgiunchedi [[ https://gerrit.wikimedia.org/r/c/operations/puppe... [16:51:29] RECOVERY - snapshot of s4 in codfw on backupmon1001 is OK: Last snapshot for s4 at codfw (db2099) taken on 2023-11-21 14:01:28 (1797 GiB, -0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:53:37] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10User-fgiunchedi: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) p:05Triage→03Medium [16:54:35] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:14] (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/976276 [16:55:27] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/976276 (owner: 10Ebernhardson) [16:55:59] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:19] (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/976276 (owner: 10Ebernhardson) [16:57:19] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:00:04] jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1700) [17:00:04] phuedx: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:08] (03PS1) 10Volans: requests: fix import of urllib Retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976278 [17:00:10] (03PS1) 10Volans: tox.ini: remove optimization for tox <4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976279 [17:00:38] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:00:43] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:00:59] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:01:40] phuedx: are you here under some other name? :) [17:03:26] (03CR) 10BCornwall: "We're going to wait until there's been more data collection so we have a more complete picture of the SLO." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [17:03:34] (03CR) 10BCornwall: "We're going to wait until there's been more data collection so we have a more complete picture of the SLO." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [17:03:55] (03PS2) 10Volans: requests: fix import of urllib Retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976278 [17:04:18] (03PS3) 10Volans: requests: fix import of urllib Retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976278 [17:04:20] (03PS2) 10Volans: tox.ini: remove optimization for tox <4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976279 [17:04:32] (03CR) 10BCornwall: acme-chief: Remove acmechief2002 passive host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975911 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:04:38] (03Abandoned) 10BCornwall: acme-chief: Remove acmechief2002 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975911 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:06:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:06:59] (03CR) 10Jbond: "on more nit about file locations i missed. ultimately this is down to how and where you want files so feel free to just close it down" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [17:07:13] o/ [17:07:59] (03CR) 10Jbond: [C: 03+1] tox.ini: remove optimization for tox <4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976279 (owner: 10Volans) [17:08:04] rzl: [17:08:17] (03CR) 10Jbond: [C: 03+1] requests: fix import of urllib Retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976278 (owner: 10Volans) [17:08:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:09:59] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:10:41] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) 05Open→03Resolved a:03Dzahn I am going to be bold and call it resolved. Based on my previous comments. We created a Hiera k... [17:10:51] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Dzahn) [17:12:00] phuedx: hello! these look good on their face but I don't know this system well -- are you able to test them post-merge and make sure everything's in good shape? [17:12:22] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1158'] [17:13:21] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1168'] [17:13:28] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1169'] [17:13:31] (03PS1) 10JHathaway: rsync: ensure daemon is started after config [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) [17:13:35] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1170'] [17:13:45] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1171'] [17:13:52] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1172'] [17:13:59] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1173'] [17:14:10] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1174'] [17:14:10] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [17:14:25] rzl: I guess the good news and the bad news is that they _should_ be no ops. I can check in #wikimedia-analytics that the legacy EventLogging refinement systems are still functioning [17:14:26] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1171'] [17:14:28] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1172'] [17:14:45] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1171'] [17:14:54] haha sure -- I just want to have good confidence that there isn't some other unintended effect somewhere [17:14:55] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1172'] [17:14:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1171'] [17:15:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1172'] [17:15:15] do they need to be merged in any particular order or should I just go for it? [17:15:43] Just go for it :) Any particular order should be fine [17:15:55] 👍 [17:16:29] (03PS1) 10Arnaudb: mariadb: bugfix mysql_upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/975931 (https://phabricator.wikimedia.org/T343674) [17:17:33] (03CR) 10CI reject: [V: 04-1] rsync: ensure daemon is started after config [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [17:17:49] (03CR) 10Volans: [C: 03+2] requests: fix import of urllib Retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976278 (owner: 10Volans) [17:17:53] (03CR) 10Volans: [C: 03+2] tox.ini: remove optimization for tox <4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976279 (owner: 10Volans) [17:18:52] (03CR) 10RLazarus: [C: 03+2] eventlogging: Remove obsolete FeaturePolicyViolation schema [puppet] - 10https://gerrit.wikimedia.org/r/908382 (https://phabricator.wikimedia.org/T209572) (owner: 10Krinkle) [17:18:55] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:19:09] (03CR) 10RLazarus: [C: 03+2] Stop refining SpecialMuteSubmit events [puppet] - 10https://gerrit.wikimedia.org/r/894000 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx) [17:19:15] Lucas_WMDE: I don't know if you got the answer but sometimes if you see two replicas not getting replication (they are red not yellow) and there is one per dc, it usually means a backup is running (double check if they are pooled in https://noc.wikimedia.org/dbconfig/eqiad.json) and nothing to be worried about [17:19:35] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:19:40] sorry for a really long response [17:19:51] PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:00] (03CR) 10Volans: mariadb: bugfix mysql_upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/975931 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [17:20:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1168'] [17:20:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1169'] [17:20:24] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1170'] [17:20:34] (03CR) 10Ladsgroup: [C: 03+1] mariadb: bugfix mysql_upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/975931 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [17:21:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1173'] [17:21:34] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1158'] [17:21:40] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1159'] [17:21:47] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1160'] [17:21:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1159'] [17:22:02] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1160'] [17:22:21] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1160'] [17:22:26] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1160'] [17:22:39] (03Merged) 10jenkins-bot: requests: fix import of urllib Retry [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976278 (owner: 10Volans) [17:23:01] rzl: Confirmed in #wikimedia-analytics that there's alerting set up for the affected system. We'll get emails shortly if something breaks as a result of these changes [17:23:10] okay great [17:23:16] puppet's just finishing up [17:23:37] there it goes! both patches merged, and I ran puppet on an-launcher1002,eventlog1003 -- did I miss any? and how's everything looking [17:23:44] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1157.mgmt.eqiad.wmnet with reboot policy FORCED [17:23:45] (03PS10) 10Jbond: C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) [17:23:59] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1158.mgmt.eqiad.wmnet with reboot policy FORCED [17:24:01] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1159.mgmt.eqiad.wmnet with reboot policy FORCED [17:24:03] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1160.mgmt.eqiad.wmnet with reboot policy FORCED [17:24:04] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1161.mgmt.eqiad.wmnet with reboot policy FORCED [17:24:10] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1162.mgmt.eqiad.wmnet with reboot policy FORCED [17:24:19] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:24:38] (03CR) 10Jbond: C:rsync::server: convert to concat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [17:24:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1157.mgmt.eqiad.wmnet with reboot policy FORCED [17:24:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1159.mgmt.eqiad.wmnet with reboot policy FORCED [17:25:01] (03CR) 10Arnaudb: [C: 03+2] mariadb: bugfix mysql_upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/975931 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [17:25:03] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1158.mgmt.eqiad.wmnet with reboot policy FORCED [17:25:07] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1160.mgmt.eqiad.wmnet with reboot policy FORCED [17:25:10] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1162.mgmt.eqiad.wmnet with reboot policy FORCED [17:25:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1161.mgmt.eqiad.wmnet with reboot policy FORCED [17:25:30] (03CR) 10Arnaudb: [V: 03+1 C: 03+2] mariadb: bugfix mysql_upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/975931 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [17:25:32] (03CR) 10Arnaudb: [V: 03+2 C: 03+2] mariadb: bugfix mysql_upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/975931 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [17:25:53] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED [17:25:56] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED [17:25:58] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1165.mgmt.eqiad.wmnet with reboot policy FORCED [17:26:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:26:04] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1166.mgmt.eqiad.wmnet with reboot policy FORCED [17:26:06] (03CR) 10Jbond: "hi i think i have pretty much the exact same change :) but also with the spec tests fixed (i think, just rebased so will see)" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [17:26:08] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1167.mgmt.eqiad.wmnet with reboot policy FORCED [17:26:09] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED [17:26:13] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.161 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:26:21] (03Merged) 10jenkins-bot: tox.ini: remove optimization for tox <4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/976279 (owner: 10Volans) [17:26:55] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1165.mgmt.eqiad.wmnet with reboot policy FORCED [17:26:59] (03PS11) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 [17:27:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1167.mgmt.eqiad.wmnet with reboot policy FORCED [17:27:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED [17:27:16] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED [17:27:53] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED [17:27:57] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1166.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:52] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1172.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:53] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1169.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:55] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1170.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:56] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:58] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1174.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:59] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1173.mgmt.eqiad.wmnet with reboot policy FORCED [17:29:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1169.mgmt.eqiad.wmnet with reboot policy FORCED [17:29:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1172.mgmt.eqiad.wmnet with reboot policy FORCED [17:29:34] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1173.mgmt.eqiad.wmnet with reboot policy FORCED [17:29:46] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1170.mgmt.eqiad.wmnet with reboot policy FORCED [17:29:51] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1174.mgmt.eqiad.wmnet with reboot policy FORCED [17:30:09] (03CR) 10JHathaway: rsync: ensure daemon is started after config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [17:30:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:30:33] (03CR) 10Ladsgroup: [C: 03+1] ores extension: set default value of OresLiftWingAddHostHeader to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976161 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos) [17:30:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED [17:30:41] RECOVERY - snapshot of x1 in codfw on backupmon1001 is OK: Last snapshot for x1 at codfw (db2097) taken on 2023-11-21 16:57:23 (443 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [17:31:37] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1158'] [17:31:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1159'] [17:31:44] rzl: I don't think that there are any others. I'll keep an eye out for alerts about the legacy EventLogging refinement :) [17:31:50] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1160'] [17:31:57] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:31:59] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1161'] [17:32:02] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1162'] [17:32:08] phuedx: okay sounds good! I'll be around if you need any followups merged [17:32:08] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1163'] [17:32:15] Thanks <3 [17:32:19] (03PS1) 10Jcrespo: bacula: Increase the amount of maximum volumes for regular backups to 140 [puppet] - 10https://gerrit.wikimedia.org/r/976286 (https://phabricator.wikimedia.org/T351725) [17:32:22] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1161'] [17:32:23] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1162'] [17:32:55] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1164'] [17:33:20] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1165'] [17:33:38] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1166'] [17:35:03] RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:03] PROBLEM - Hadoop NodeManager on an-worker1155 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:35:11] PROBLEM - Check systemd state on an-worker1155 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:46] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/output/971476/619/" [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [17:36:09] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:36:15] PROBLEM - Check systemd state on ganeti5005 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:52] (03PS9) 10Jbond: rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 [17:37:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1163'] [17:37:41] (03CR) 10CI reject: [V: 04-1] rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 (owner: 10Jbond) [17:37:49] RECOVERY - Hadoop NodeManager on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:37:57] RECOVERY - Check systemd state on an-worker1155 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1159'] [17:38:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1160'] [17:38:21] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1167'] [17:38:27] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1168'] [17:38:33] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1169'] [17:39:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1165'] [17:39:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1164'] [17:40:01] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1170'] [17:40:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.293 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:40:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50862 bytes in 1.137 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:40:09] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1171'] [17:40:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1171'] [17:40:17] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:40:24] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1171'] [17:40:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1171'] [17:40:32] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [17:40:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1166'] [17:40:45] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1172'] [17:40:54] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1172'] [17:42:18] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1171'] [17:42:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1171'] [17:42:48] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1172'] [17:42:50] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1173'] [17:43:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1172'] [17:43:16] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1174'] [17:44:18] (03CR) 10Jbond: [C: 03+2] realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [17:44:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1167'] [17:44:39] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1168'] [17:44:41] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1169'] [17:44:57] RECOVERY - snapshot of x1 in eqiad on backupmon1001 is OK: Last snapshot for x1 at eqiad (db1225) taken on 2023-11-21 17:10:07 (382 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [17:46:15] PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:46:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1170'] [17:48:09] PROBLEM - Check systemd state on an-worker1112 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:11] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:48:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:48:29] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:49:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1173'] [17:50:24] jouncebot: nowandnext [17:50:24] For the next 0 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1700) [17:50:24] In 0 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1800) [17:52:31] (03CR) 10Ladsgroup: [C: 03+2] Undeploy DoubleWiki, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975912 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup) [17:52:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975912 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup) [17:53:22] (03Merged) 10jenkins-bot: Undeploy DoubleWiki, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975912 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup) [17:53:37] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:975912|Undeploy DoubleWiki, Part I (T351675)]] [17:53:42] T351675: Undeploy DoubleWiki - https://phabricator.wikimedia.org/T351675 [17:53:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1174'] [17:54:01] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1174'] [17:54:24] (03CR) 10Jcrespo: [C: 03+2] bacula: Increase the amount of maximum volumes for regular backups to 140 [puppet] - 10https://gerrit.wikimedia.org/r/976286 (https://phabricator.wikimedia.org/T351725) (owner: 10Jcrespo) [17:54:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) [17:54:56] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:975912|Undeploy DoubleWiki, Part I (T351675)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:56:01] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:56:17] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [17:56:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.402 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:56:37] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:58:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating restbase servers in codfw - jhancock@cumin2002" [17:59:07] RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:09] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:59:13] (03PS1) 10Ladsgroup: Undeploy DoubleWiki, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976288 (https://phabricator.wikimedia.org/T351675) [17:59:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating restbase servers in codfw - jhancock@cumin2002" [17:59:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:59:59] RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T1800) [18:02:05] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:975912|Undeploy DoubleWiki, Part I (T351675)]] (duration: 08m 27s) [18:02:24] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:02:28] T351675: Undeploy DoubleWiki - https://phabricator.wikimedia.org/T351675 [18:02:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976288 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup) [18:03:12] (03Merged) 10jenkins-bot: Undeploy DoubleWiki, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976288 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup) [18:03:28] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:976288|Undeploy DoubleWiki, Part II (T351675)]] [18:03:46] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:04:44] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:976288|Undeploy DoubleWiki, Part II (T351675)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:06:09] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [18:10:44] (03PS1) 10Jclark-ctr: Add an-worker1157-75.yaml file T349936 [puppet] - 10https://gerrit.wikimedia.org/r/976289 (https://phabricator.wikimedia.org/T349936) [18:11:29] (03CR) 10Jclark-ctr: [C: 03+2] Add an-worker1157-75.yaml file T349936 [puppet] - 10https://gerrit.wikimedia.org/r/976289 (https://phabricator.wikimedia.org/T349936) (owner: 10Jclark-ctr) [18:11:52] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:976288|Undeploy DoubleWiki, Part II (T351675)]] (duration: 08m 24s) [18:12:13] T351675: Undeploy DoubleWiki - https://phabricator.wikimedia.org/T351675 [18:13:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1158'] [18:13:52] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1158'] [18:14:34] (03PS1) 10Ladsgroup: Undeploy DoubleWiki, Part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976291 (https://phabricator.wikimedia.org/T351675) [18:15:15] RECOVERY - snapshot of s5 in eqiad on backupmon1001 is OK: Last snapshot for s5 at eqiad (db1216) taken on 2023-11-21 17:03:48 (558 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [18:15:19] !log restart of bacula-sd on backup1009 T351725 [18:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:37] T351725: Daily backup job not running for gerrit1003 - https://phabricator.wikimedia.org/T351725 [18:15:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976291 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup) [18:16:13] PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:16:27] (03Merged) 10jenkins-bot: Undeploy DoubleWiki, Part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976291 (https://phabricator.wikimedia.org/T351675) (owner: 10Ladsgroup) [18:16:43] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:976291|Undeploy DoubleWiki, Part III (T351675)]] [18:16:47] PROBLEM - Check systemd state on an-worker1127 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:53] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:18:02] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1157.eqiad.wmnet with OS bullseye [18:18:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [18:25:37] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:25:41] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:29:48] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:976291|Undeploy DoubleWiki, Part III (T351675)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:29:53] T351675: Undeploy DoubleWiki - https://phabricator.wikimedia.org/T351675 [18:30:46] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [18:31:19] RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:31:51] RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:13] RECOVERY - Check systemd state on ganeti5005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:38:03] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1158.eqiad.wmnet with OS bullseye [18:38:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye [18:38:41] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:40:21] PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:53] PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:41:23] PROBLEM - Check systemd state on an-worker1111 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:41:37] PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:42:21] PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:42:24] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:976291|Undeploy DoubleWiki, Part III (T351675)]] (duration: 25m 41s) [18:42:28] T351675: Undeploy DoubleWiki - https://phabricator.wikimedia.org/T351675 [18:42:59] PROBLEM - Hadoop NodeManager on an-worker1105 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:44:29] PROBLEM - Hadoop NodeManager on an-worker1146 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:44:57] PROBLEM - Check systemd state on an-worker1146 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:05] RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:17] PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:45:45] RECOVERY - Hadoop NodeManager on an-worker1105 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:45:47] PROBLEM - Hadoop NodeManager on analytics1077 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:45:53] RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:21] PROBLEM - Check systemd state on an-worker1078 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:23] PROBLEM - Check systemd state on analytics1077 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:25] RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:46:27] PROBLEM - Check systemd state on an-worker1090 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:49] PROBLEM - Check systemd state on an-worker1133 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:49] PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:46:55] PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:49:39] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:49:57] RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:50:31] RECOVERY - Check systemd state on analytics1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:23] RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:56:45] RECOVERY - snapshot of s5 in codfw on backupmon1001 is OK: Last snapshot for s5 at codfw (db2101) taken on 2023-11-21 17:45:41 (679 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [18:57:29] RECOVERY - Check systemd state on an-worker1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:01] RECOVERY - Check systemd state on an-worker1111 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:03] (03PS1) 10Jbond: facter: add new wmflib.site fact [puppet] - 10https://gerrit.wikimedia.org/r/976299 [18:58:17] RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:00:42] (03CR) 10CI reject: [V: 04-1] facter: add new wmflib.site fact [puppet] - 10https://gerrit.wikimedia.org/r/976299 (owner: 10Jbond) [19:01:09] RECOVERY - Hadoop NodeManager on an-worker1146 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:01:37] RECOVERY - Check systemd state on an-worker1146 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:40] (03PS1) 10Jbond: java: update certificate name [puppet] - 10https://gerrit.wikimedia.org/r/976300 [19:03:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:03:43] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:04:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:04:37] PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:49] RECOVERY - Check systemd state on an-worker1133 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:53] PROBLEM - Check systemd state on an-worker1082 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:57] RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:05:01] PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:05:14] (03CR) 10Jbond: [C: 03+2] java: update certificate name [puppet] - 10https://gerrit.wikimedia.org/r/976300 (owner: 10Jbond) [19:08:01] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1016.eqiad.wmnet with OS bullseye [19:08:59] RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:09:39] PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:09:57] RECOVERY - Check systemd state on an-worker1090 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:09:59] PROBLEM - Check systemd state on an-worker1123 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:23] RECOVERY - Check systemd state on an-worker1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:31] RECOVERY - Hadoop NodeManager on an-worker1082 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:10:40] (03PS1) 10Ottomata: refine - Put SpecialMuteSubmit and FeaturePolicyViolation in eventlogging analytics exclude list [puppet] - 10https://gerrit.wikimedia.org/r/976303 (https://phabricator.wikimedia.org/T209572) [19:10:50] (03CR) 10Dzahn: "thanks, seems reasoanble to me" [puppet] - 10https://gerrit.wikimedia.org/r/976286 (https://phabricator.wikimedia.org/T351725) (owner: 10Jcrespo) [19:11:29] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1017.eqiad.wmnet with OS bullseye [19:12:18] (03CR) 10Ottomata: [C: 03+2] refine - Put SpecialMuteSubmit and FeaturePolicyViolation in eventlogging analytics exclude list [puppet] - 10https://gerrit.wikimedia.org/r/976303 (https://phabricator.wikimedia.org/T209572) (owner: 10Ottomata) [19:12:42] (03CR) 10Jbond: "i wonder if the test failures relate to the puppet version" [puppet] - 10https://gerrit.wikimedia.org/r/976299 (owner: 10Jbond) [19:13:06] (03CR) 10CI reject: [V: 04-1] refine - Put SpecialMuteSubmit and FeaturePolicyViolation in eventlogging analytics exclude list [puppet] - 10https://gerrit.wikimedia.org/r/976303 (https://phabricator.wikimedia.org/T209572) (owner: 10Ottomata) [19:13:46] (03PS2) 10Ottomata: refine - Put SpecialMuteSubmit and FeaturePolicyViolation in exclude list [puppet] - 10https://gerrit.wikimedia.org/r/976303 (https://phabricator.wikimedia.org/T209572) [19:14:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:14:17] RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:14:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:14:45] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:19:35] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:19:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:19:49] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:25:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:25:29] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:26:09] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1017.eqiad.wmnet with reason: host reimage [19:26:32] (03PS2) 10Jbond: facter: add new wmflib.site fact [puppet] - 10https://gerrit.wikimedia.org/r/976299 [19:26:34] (03PS1) 10Jbond: Gemfile: update to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976304 [19:26:39] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.635 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:26:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:28:09] (03CR) 10CI reject: [V: 04-1] Gemfile: update to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976304 (owner: 10Jbond) [19:28:12] (03CR) 10CI reject: [V: 04-1] facter: add new wmflib.site fact [puppet] - 10https://gerrit.wikimedia.org/r/976299 (owner: 10Jbond) [19:28:47] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1017.eqiad.wmnet with reason: host reimage [19:29:26] (03PS3) 10Jbond: facter: add new wmflib.site fact [puppet] - 10https://gerrit.wikimedia.org/r/976299 [19:33:04] (03CR) 10CI reject: [V: 04-1] facter: add new wmflib.site fact [puppet] - 10https://gerrit.wikimedia.org/r/976299 (owner: 10Jbond) [19:33:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:34:40] (03CR) 10Ebernhardson: [C: 03+1] Add alert for CirrusSearch reported memory issues [puppet] - 10https://gerrit.wikimedia.org/r/830240 (https://phabricator.wikimedia.org/T316712) (owner: 10Ebernhardson) [19:35:39] (03CR) 10DCausse: [C: 03+1] query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:36:23] RECOVERY - Check systemd state on an-worker1123 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:25] RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:38:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1157.eqiad.wmnet with OS bullseye [19:38:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [19:41:43] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:41:51] PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:57] PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:42:01] PROBLEM - Check systemd state on an-worker1083 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:44:41] PROBLEM - Hadoop NodeManager on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:44:55] PROBLEM - Check systemd state on an-worker1145 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:27] PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:46:07] PROBLEM - Check systemd state on an-worker1137 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:29] PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:46:53] PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:49] PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:48:09] PROBLEM - Check systemd state on an-worker1086 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:49:35] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1017.eqiad.wmnet with OS bullseye [19:49:59] RECOVERY - snapshot of s3 in eqiad on backupmon1001 is OK: Last snapshot for s3 at eqiad (db1150) taken on 2023-11-21 17:44:38 (1318 GiB, +0.5 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [19:50:15] RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:50:19] RECOVERY - Check systemd state on an-worker1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:55] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1018.eqiad.wmnet with OS bullseye [19:54:07] PROBLEM - Check systemd state on an-worker1081 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:41] PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:58:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1158.eqiad.wmnet with OS bullseye [19:58:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [19:59:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T348183)', diff saved to https://phabricator.wikimedia.org/P53679 and previous config saved to /var/cache/conftool/dbconfig/20231121-195920-arnaudb.json [19:59:27] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:01:10] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:34] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:02:34] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:02:45] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1018.eqiad.wmnet with reason: host reimage [20:03:10] RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:34] RECOVERY - Check systemd state on an-worker1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:36] RECOVERY - Hadoop NodeManager on an-worker1137 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:04:50] RECOVERY - Check systemd state on an-worker1137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:46] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1018.eqiad.wmnet with reason: host reimage [20:06:39] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:06:51] RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:55] RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:08:11] RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:08:13] RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:17] RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:33] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:12:59] (03PS1) 10Subramanya Sastry: ParserOutputPostCacheTransform: Don't reprocess content [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976330 (https://phabricator.wikimedia.org/T351461) [20:14:02] (03PS1) 10Subramanya Sastry: [parser] Broaden TOC placeholder regular expression [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976331 [20:14:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P53680 and previous config saved to /var/cache/conftool/dbconfig/20231121-201427-arnaudb.json [20:16:35] RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:16:49] RECOVERY - Check systemd state on an-worker1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:18:36] (03CR) 10C. Scott Ananian: [C: 03+1] "backporting to unblock visual diff testing of parsoid read views" [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976331 (owner: 10Subramanya Sastry) [20:18:53] (03CR) 10C. Scott Ananian: [C: 03+1] "backporting to unblock visual diff testing of parsoid read views" [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976330 (https://phabricator.wikimedia.org/T351461) (owner: 10Subramanya Sastry) [20:19:19] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [20:19:36] RoanKattouw: CTT added a pair of late backport patches to the window in ~50 minutes [20:20:07] I'll be here for the first part of the window but I'll need to leave to bring a kid to his clarinet lesson; subbu will be the primary on-call for the backport. [20:20:30] (03CR) 10Ryan Kemper: [C: 03+2] Add alert for CirrusSearch reported memory issues [puppet] - 10https://gerrit.wikimedia.org/r/830240 (https://phabricator.wikimedia.org/T316712) (owner: 10Ebernhardson) [20:24:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:27:07] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1018.eqiad.wmnet with OS bullseye [20:29:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P53681 and previous config saved to /var/cache/conftool/dbconfig/20231121-202933-arnaudb.json [20:31:18] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs1018.eqiad.wmnet [20:31:19] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1018.eqiad.wmnet [20:32:34] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1019.eqiad.wmnet with OS bullseye [20:32:34] PROBLEM - Hadoop NodeManager on an-worker1142 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:34:26] (03PS1) 10Ryan Kemper: wdqs: remove old nginx-level bans [puppet] - 10https://gerrit.wikimedia.org/r/976308 [20:37:04] (03PS2) 10Ryan Kemper: wdqs: remove old nginx-level bans [puppet] - 10https://gerrit.wikimedia.org/r/976308 [20:39:10] PROBLEM - Check systemd state on an-worker1149 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:38] PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:41:28] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1019.eqiad.wmnet with OS bullseye [20:41:42] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1019.eqiad.wmnet with OS bullseye [20:44:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T348183)', diff saved to https://phabricator.wikimedia.org/P53682 and previous config saved to /var/cache/conftool/dbconfig/20231121-204440-arnaudb.json [20:44:42] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [20:44:47] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:44:55] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [20:45:02] PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T348183)', diff saved to https://phabricator.wikimedia.org/P53683 and previous config saved to /var/cache/conftool/dbconfig/20231121-204501-arnaudb.json [20:45:36] PROBLEM - Check systemd state on an-worker1150 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:02] RECOVERY - Hadoop NodeManager on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:46:16] RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:22] RECOVERY - snapshot of s7 in eqiad on backupmon1001 is OK: Last snapshot for s7 at eqiad (db1171) taken on 2023-11-21 18:16:13 (1105 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [20:52:12] PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:52:23] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1019.eqiad.wmnet with reason: host reimage [20:53:00] !log gerrit1003 - deleted /root/backup_of_srv_gerrit_plugins - disk usage down to 56% (T351658) [20:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:02] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1019.eqiad.wmnet with reason: host reimage [20:55:48] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:55:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:59:42] RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231121T2100). nyaa~ [21:00:05] subbu and subbu: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:00:22] o/ [21:01:34] I can deploy [21:02:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976331 (owner: 10Subramanya Sastry) [21:03:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:03:44] RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:04:26] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:04:28] RECOVERY - Check systemd state on an-worker1149 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:05:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:10:30] RECOVERY - snapshot of s7 in codfw on backupmon1001 is OK: Last snapshot for s7 at codfw (db2098) taken on 2023-11-21 18:36:28 (1269 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [21:14:20] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1019.eqiad.wmnet with OS bullseye [21:15:59] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1020.eqiad.wmnet with OS bullseye [21:16:24] (03Merged) 10jenkins-bot: [parser] Broaden TOC placeholder regular expression [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976331 (owner: 10Subramanya Sastry) [21:16:39] !log catrope@deploy2002 Started scap: Backport for [[gerrit:976331|[parser] Broaden TOC placeholder regular expression]] [21:16:42] (03PS1) 10Ssingh: pybal: do not install from component [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) [21:18:02] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/627/con" [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) (owner: 10Ssingh) [21:18:03] !log catrope@deploy2002 catrope and ssastry: Backport for [[gerrit:976331|[parser] Broaden TOC placeholder regular expression]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:18:22] ready to test on mwdebug? [21:18:28] subbu: Your first patch (Broaden TOC placeholder regular expression) is ready for testing [21:18:33] ok [21:18:39] Yup you beat me to it :) [21:18:55] is it on 2001.codfw? [21:18:56] (03CR) 10Catrope: [C: 03+2] ParserOutputPostCacheTransform: Don't reprocess content [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976330 (https://phabricator.wikimedia.org/T351461) (owner: 10Subramanya Sastry) [21:19:09] (03CR) 10Ssingh: [V: 03+1] "sukhe@apt1001:~$ sudo -i reprepro lsbycomponent pybal" [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) (owner: 10Ssingh) [21:19:15] It should be on all the test servers [21:21:51] ok, lgtm tested a few different ways. [21:22:28] RECOVERY - snapshot of s3 in codfw on backupmon1001 is OK: Last snapshot for s3 at codfw (db2139) taken on 2023-11-21 17:48:43 (1291 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [21:23:27] RoanKattouw, ok to sync the core one. [21:23:28] !log catrope@deploy2002 catrope and ssastry: Continuing with sync [21:24:51] (03Merged) 10jenkins-bot: ParserOutputPostCacheTransform: Don't reprocess content [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976330 (https://phabricator.wikimedia.org/T351461) (owner: 10Subramanya Sastry) [21:25:43] Oh wow that DT patch merged quickly! I +2ed it earlier to speed things up since I expected it to take 15 mins like the core one [21:26:16] ya .. :) looks like extension patches have fewer gate jobs. [21:26:17] (03CR) 10Ssingh: [V: 03+1] "There is also python-prometheus-client in the Pybal component that we are installing but I think we will leave that. (Moritz, I think you " [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) (owner: 10Ssingh) [21:29:19] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:976331|[parser] Broaden TOC placeholder regular expression]] (duration: 12m 40s) [21:29:58] !log catrope@deploy2002 Started scap: Backport for [[gerrit:976330|ParserOutputPostCacheTransform: Don't reprocess content (T351461)]] [21:30:12] T351461: InvalidArgumentException: Multiple conflicting values given for wgDiscussionToolsPageThreads - https://phabricator.wikimedia.org/T351461 [21:30:18] Alright moving on to the DiscussionTools patch [21:31:18] !log catrope@deploy2002 ssastry and catrope: Backport for [[gerrit:976330|ParserOutputPostCacheTransform: Don't reprocess content (T351461)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:31:30] subbu: The DT patch is ready to test on the mwdebug servers [21:31:39] ty . .testing .. [21:34:24] hmm .. not sure it fixed anything .. testing some other pages. [21:35:49] nah ... it actually made things slightly worse for Parsoid & DT ... :) .. let's skip this one. [21:35:57] OK we'll roll this one back [21:35:59] !log catrope@deploy2002 Sync cancelled. [21:36:02] thanks. [21:36:14] i'll have to go digging elsewhere for this. [21:36:28] subbu: Please submit a revert of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/976330/ with a brief explanation of what went wrong [21:36:41] (even just one sentence after "Reason for revert:" is fine) [21:36:49] will do. [21:37:53] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1020.eqiad.wmnet with reason: host reimage [21:37:56] (03PS1) 10Subramanya Sastry: Revert "ParserOutputPostCacheTransform: Don't reprocess content" [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976332 [21:38:03] RECOVERY - Check systemd state on mw2442 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:38:11] RoanKattouw, will you +2 that or should I? [21:38:16] I will [21:38:27] (03CR) 10Catrope: [C: 03+2] Revert "ParserOutputPostCacheTransform: Don't reprocess content" [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976332 (owner: 10Subramanya Sastry) [21:38:56] good we tried to backport it today .. i have some time to fix it tomorrow / monday. anyway, have a good evening all! [21:40:55] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1020.eqiad.wmnet with reason: host reimage [21:44:15] (03Merged) 10jenkins-bot: Revert "ParserOutputPostCacheTransform: Don't reprocess content" [extensions/DiscussionTools] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/976332 (owner: 10Subramanya Sastry) [21:45:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T348183)', diff saved to https://phabricator.wikimedia.org/P53684 and previous config saved to /var/cache/conftool/dbconfig/20231121-214534-arnaudb.json [21:45:40] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:00:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P53685 and previous config saved to /var/cache/conftool/dbconfig/20231121-220040-arnaudb.json [22:02:25] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1020.eqiad.wmnet with OS bullseye [22:06:07] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1021.eqiad.wmnet with OS bullseye [22:07:59] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED [22:10:45] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:11:08] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1036.mgmt.eqiad.wmnet with reboot policy FORCED [22:12:48] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1037.mgmt.eqiad.wmnet with reboot policy FORCED [22:15:19] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED [22:15:45] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:15:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P53686 and previous config saved to /var/cache/conftool/dbconfig/20231121-221547-arnaudb.json [22:15:50] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1038.mgmt.eqiad.wmnet with reboot policy FORCED [22:17:55] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host aqs1021.eqiad.wmnet with OS bullseye [22:18:19] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1021.eqiad.wmnet with OS bullseye [22:18:58] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1036.mgmt.eqiad.wmnet with reboot policy FORCED [22:20:09] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1037.mgmt.eqiad.wmnet with reboot policy FORCED [22:23:10] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1038.mgmt.eqiad.wmnet with reboot policy FORCED [22:27:37] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:29:06] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1021.eqiad.wmnet with reason: host reimage [22:30:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T348183)', diff saved to https://phabricator.wikimedia.org/P53687 and previous config saved to /var/cache/conftool/dbconfig/20231121-223053-arnaudb.json [22:31:05] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:32:09] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1021.eqiad.wmnet with reason: host reimage [22:34:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:35:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:43:02] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:46:16] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1035 [22:47:32] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1035 [22:48:33] !log vriley@cumin1001 START - Cookbook sre.dns.netbox [22:48:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:49:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:50:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.125 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:50:20] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:53:22] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1021.eqiad.wmnet with OS bullseye [22:54:34] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs1021.eqiad.wmnet [22:54:35] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1021.eqiad.wmnet [22:55:08] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:55:20] (03PS2) 10JHathaway: rsync: ensure daemon is started after config [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) [22:55:41] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [22:55:58] PROBLEM - Check systemd state on an-worker1124 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:14] PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:56:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:58:02] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:58:31] !log vriley@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: message - vriley@cumin1001" [22:58:34] PROBLEM - Check systemd state on an-worker1089 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:59:23] !log vriley@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: message - vriley@cumin1001" [22:59:23] !log vriley@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:00:04] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1036 [23:00:23] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1037 [23:00:37] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1038 [23:00:54] PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:01:08] PROBLEM - Check systemd state on analytics1073 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:48] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1037 [23:01:49] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1038 [23:02:00] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1036 [23:02:40] !log vriley@cumin1001 START - Cookbook sre.dns.netbox [23:04:36] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:04:43] !log vriley@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: message - vriley@cumin1001" [23:04:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:05:18] PROBLEM - Check systemd state on an-worker1147 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:36] !log vriley@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: message - vriley@cumin1001" [23:05:36] !log vriley@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:05:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:08:36] PROBLEM - Check systemd state on an-worker1138 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:52] PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:10:59] (03PS1) 10DDesouza: Undeploy Reader Demographics 2 survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976325 (https://phabricator.wikimedia.org/T344393) [23:13:00] PROBLEM - Hadoop NodeManager on an-worker1148 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:13:02] PROBLEM - Check systemd state on an-worker1148 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:38] RECOVERY - Check systemd state on an-worker1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:58] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:14:24] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:15:04] RECOVERY - Check systemd state on an-worker1124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:46] RECOVERY - Hadoop NodeManager on an-worker1148 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:15:50] RECOVERY - Check systemd state on an-worker1148 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:08] (03CR) 10Krinkle: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [23:17:45] (03CR) 10Krinkle: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [23:21:42] RECOVERY - Check systemd state on analytics1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:28] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:22:50] RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:23:10] RECOVERY - Check systemd state on an-worker1147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:23] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:31:16] PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:31:26] PROBLEM - Check systemd state on an-worker1078 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:54] RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:32:00] RECOVERY - Check systemd state on an-worker1138 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:57] pff [23:40:49] so I think I might have the start of a fix for the old T282893 (which really most probably has always been there): https://github.com/jenkinsci/parameterized-trigger-plugin/pull/363/files [23:40:50] T282893: Various CI jobs failing after "mkdir: cannot create directory ‘log’: Permission denied" - https://phabricator.wikimedia.org/T282893 [23:54:27] (03PS1) 10Brennen Bearnes: allow all images from docker-registry.tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/976355 (https://phabricator.wikimedia.org/T334512) [23:57:22] RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:57:34] RECOVERY - Check systemd state on an-worker1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:01] (03CR) 10Brennen Bearnes: "Someone should definitely check me on this; also I'm trying to remember where else a list of allowed images lives at this point." [puppet] - 10https://gerrit.wikimedia.org/r/976355 (https://phabricator.wikimedia.org/T334512) (owner: 10Brennen Bearnes)