[00:04:35] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - No response from remote host 185.15.59.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:07:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:12:17] (03CR) 10Ssingh: [C: 03+1] hiera: added new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/968301 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [00:17:25] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 8568 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [00:17:55] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:29] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:34] (03CR) 10Tim Starling: Enable LoginNotify seen subnets table (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [00:38:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/967919 [00:39:01] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/967919 (owner: 10TrainBranchBot) [00:56:05] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/967919 (owner: 10TrainBranchBot) [01:05:20] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:06:54] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:07:05] (03PS14) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [01:44:45] marostegui (since on call SRE is not here): https://phabricator.wikimedia.org/T349671 [01:48:01] does uploading other files work? [01:49:45] 10SRE: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10Peachey88) [02:04:27] p858snake: let me test [02:06:49] p858snake: negative [02:06:54] same error [02:07:01] going to take the XID and paste it there [02:08:32] 10SRE: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10Jasper) Same problem at https://commons.wikimedia.org/wiki/File:Tammy_2023_path.png (new image) XID 718538244 [02:28:33] PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_tmpdumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:40] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:31] (03PS1) 10BPirkle: Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) [02:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:03:40] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:48:16] (03PS1) 10KartikMistry: Update MinT to 2023-10-25-032936-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/968388 (https://phabricator.wikimedia.org/T349079) [03:53:58] (03PS15) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [03:55:31] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [03:56:17] (03PS16) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [03:59:25] (03PS17) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:01:08] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:07:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:10:33] (03PS18) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:12:06] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:14:26] (03PS19) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:16:00] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:19:21] (03PS20) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:20:56] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:21:35] (03PS21) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:22:01] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:22:55] (03PS22) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:24:29] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:24:57] (03PS23) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:26:30] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:28:06] (03PS24) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:31:11] (03PS25) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:32:44] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:33:43] (03PS26) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:35:16] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:36:14] (03PS27) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:37:47] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:38:51] (03PS28) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:42:13] (03PS29) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:43:53] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:47:29] (03PS30) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:47:56] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:53:42] (03PS31) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [04:55:15] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [04:55:41] (03PS32) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [05:01:48] (03PS33) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [05:04:37] (03PS34) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [05:05:20] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:06:54] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:07:58] (03PS35) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [05:09:31] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [05:09:57] (03PS36) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [05:25:26] (03PS1) 10MPGuy2824: [DNM] InitialiseSettings-labs: Remove values for renamed PageTriage variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968397 (https://phabricator.wikimedia.org/T331595) [05:29:18] (03PS37) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [05:31:13] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [05:57:28] (03PS12) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231025T0600) [06:12:11] (03PS13) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [06:40:09] (03PS14) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [06:42:22] 10SRE: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10Joe) p:05Unbreak!→03Medium Looking at the [[ https://commons.wikimedia.org/wiki/Special:NewFiles | new file stream ]] it looks like uploads work in general, so this is not an UBN! bug as far as SREs are concerned. I'd... [06:46:55] (03CR) 10Dwisehaupt: "Chunk of fixes and comments. Will follow up with finishing up the rest tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [06:50:54] !log repooling db1231 [06:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1231 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P53044 and previous config saved to /var/cache/conftool/dbconfig/20231025-065103-arnaudb.json [06:52:10] (03PS1) 10Arnaudb: mariadb: alarming db1231 [puppet] - 10https://gerrit.wikimedia.org/r/967920 (https://phabricator.wikimedia.org/T344036) [06:52:19] 10SRE: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10Jasper) I am blocked by this. I have not been able to upload for quite a few hours. This appears to be a malfunction of the Varnish instance I'm being routed towards. [06:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:00:05] Amir1, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231025T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:06:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1231 (re)pooling @ 20%: Maint over', diff saved to https://phabricator.wikimedia.org/P53045 and previous config saved to /var/cache/conftool/dbconfig/20231025-070608-arnaudb.json [07:21:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1231 (re)pooling @ 30%: Maint over', diff saved to https://phabricator.wikimedia.org/P53046 and previous config saved to /var/cache/conftool/dbconfig/20231025-072113-arnaudb.json [07:36:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1231 (re)pooling @ 40%: Maint over', diff saved to https://phabricator.wikimedia.org/P53047 and previous config saved to /var/cache/conftool/dbconfig/20231025-073618-arnaudb.json [07:39:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) Update on the JTAC case, they're still working on it, they tested it without SSL and it worked fine, but: > Hope you are doi... [07:45:36] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/168/console" [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [07:49:53] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/169/console" [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [07:51:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1231 (re)pooling @ 50%: Maint over', diff saved to https://phabricator.wikimedia.org/P53048 and previous config saved to /var/cache/conftool/dbconfig/20231025-075123-arnaudb.json [08:02:19] (03CR) 10Marostegui: [C: 03+1] mariadb: alarming db1231 [puppet] - 10https://gerrit.wikimedia.org/r/967920 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [08:03:23] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+1] "LGTM, good job!" [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [08:06:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1231 (re)pooling @ 60%: Maint over', diff saved to https://phabricator.wikimedia.org/P53049 and previous config saved to /var/cache/conftool/dbconfig/20231025-080628-arnaudb.json [08:07:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:11:45] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/967870 (https://phabricator.wikimedia.org/T344428) (owner: 10Giuseppe Lavagetto) [08:12:56] (03CR) 10Ladsgroup: [C: 04-1] Enable LoginNotify seen subnets table (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [08:16:53] 10SRE: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10MatthewVernon) FWIW, there's no evidence of Tammy_2023_path.png in the swift frontend logs. [08:17:17] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: ship amtool.yml for AM api access [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [08:18:29] (03CR) 10Slavina Stefanova: [C: 03+1] harbor: upgrade from 2.5 to 2.9 [puppet] - 10https://gerrit.wikimedia.org/r/966874 (https://phabricator.wikimedia.org/T346241) (owner: 10Slavina Stefanova) [08:21:00] (03CR) 10David Caro: [C: 03+2] harbor: upgrade from 2.5 to 2.9 [puppet] - 10https://gerrit.wikimedia.org/r/966874 (https://phabricator.wikimedia.org/T346241) (owner: 10Slavina Stefanova) [08:21:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1231 (re)pooling @ 70%: Maint over', diff saved to https://phabricator.wikimedia.org/P53050 and previous config saved to /var/cache/conftool/dbconfig/20231025-082133-arnaudb.json [08:24:27] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:29:11] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:30:32] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: let karma use apache to access AM [puppet] - 10https://gerrit.wikimedia.org/r/968119 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [08:31:22] (03CR) 10Giuseppe Lavagetto: jobrunner: increase open files limit [puppet] - 10https://gerrit.wikimedia.org/r/967870 (https://phabricator.wikimedia.org/T344428) (owner: 10Giuseppe Lavagetto) [08:35:23] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) They replied back saying that the staging env is ready. Following their instructions I get a 401, to be investigated. [08:35:43] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:36:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1231 (re)pooling @ 80%: Maint over', diff saved to https://phabricator.wikimedia.org/P53051 and previous config saved to /var/cache/conftool/dbconfig/20231025-083638-arnaudb.json [08:41:59] 10SRE: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10Jasper) Then that would be evidence that my request never reached the Swift frontend. Also, the problem is more than just uploading; it extends to editing too, particularly for any large article. Overall, this seems to b... [08:49:52] (03CR) 10JMeybohm: [C: 03+1] "If you want to stick with the bool toggle, this LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/967870 (https://phabricator.wikimedia.org/T344428) (owner: 10Giuseppe Lavagetto) [08:51:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1231 (re)pooling @ 90%: Maint over', diff saved to https://phabricator.wikimedia.org/P53052 and previous config saved to /var/cache/conftool/dbconfig/20231025-085143-arnaudb.json [08:53:47] (03CR) 10Btullis: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/964008 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [08:57:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] jobrunner: increase open files limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967870 (https://phabricator.wikimedia.org/T344428) (owner: 10Giuseppe Lavagetto) [08:58:26] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 8 CORE_DIFF 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/964008 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [08:58:51] (03PS1) 10Ilias Sarantopoulos: ml-services: upgrade llm image to allow local runs of llms [deployment-charts] - 10https://gerrit.wikimedia.org/r/968614 (https://phabricator.wikimedia.org/T349371) [09:02:58] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: upgrade llm image to allow local runs of llms [deployment-charts] - 10https://gerrit.wikimedia.org/r/968614 (https://phabricator.wikimedia.org/T349371) (owner: 10Ilias Sarantopoulos) [09:03:50] (03Merged) 10jenkins-bot: ml-services: upgrade llm image to allow local runs of llms [deployment-charts] - 10https://gerrit.wikimedia.org/r/968614 (https://phabricator.wikimedia.org/T349371) (owner: 10Ilias Sarantopoulos) [09:04:20] (03PS1) 10Filippo Giunchedi: alertmanager: sanitise silence audit log [puppet] - 10https://gerrit.wikimedia.org/r/968615 (https://phabricator.wikimedia.org/T321579) [09:05:20] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:06:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1231 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P53053 and previous config saved to /var/cache/conftool/dbconfig/20231025-090648-arnaudb.json [09:06:54] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:13:21] (03PS14) 10Btullis: Deploy multiple spark shufflers for yarn to production [puppet] - 10https://gerrit.wikimedia.org/r/964008 (https://phabricator.wikimedia.org/T344910) [09:15:34] (03PS1) 10JMeybohm: Remove deprecated hiera keys from icu63 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/968616 (https://phabricator.wikimedia.org/T345561) [09:16:16] (03PS2) 10Brouberol: Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout [puppet] - 10https://gerrit.wikimedia.org/r/967930 (https://phabricator.wikimedia.org/T348495) [09:17:18] (03CR) 10Brouberol: Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967930 (https://phabricator.wikimedia.org/T348495) (owner: 10Brouberol) [09:23:57] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:15] (03CR) 10Btullis: [C: 03+1] Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout [puppet] - 10https://gerrit.wikimedia.org/r/967930 (https://phabricator.wikimedia.org/T348495) (owner: 10Brouberol) [09:25:13] (03CR) 10Btullis: [C: 03+1] Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967930 (https://phabricator.wikimedia.org/T348495) (owner: 10Brouberol) [09:25:15] (03CR) 10Brouberol: [C: 03+2] Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout [puppet] - 10https://gerrit.wikimedia.org/r/967930 (https://phabricator.wikimedia.org/T348495) (owner: 10Brouberol) [09:25:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:15] 10SRE, 10Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [09:26:32] 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10ayounsi) 05Resolved→03Open Can you confirm on which switch port this was connected? Dell numbering is annoyingly different from Juniper. I see a transceiver (SFP+ 10GBASE-CR-DAC-3.0M) on port 2 (bottom left), but... [09:34:11] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:35:20] (03PS1) 10Hnowlan: service_proxy: add rest-gateway to listeners [puppet] - 10https://gerrit.wikimedia.org/r/968617 (https://phabricator.wikimedia.org/T348731) [09:35:57] (03PS2) 10Majavah: aptrepo: Import kubeadm 1.23 for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/967875 (https://phabricator.wikimedia.org/T284656) [09:35:58] (03PS1) 10Majavah: P:wmcs::kubeadm: rely on iptables-nft on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/968618 (https://phabricator.wikimedia.org/T284656) [09:38:41] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:38:55] (03PS1) 10Ayounsi: Add support for SONiC EthernetX named interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/968619 (https://phabricator.wikimedia.org/T335028) [09:39:55] (03PS2) 10Majavah: P:wmcs::kubeadm: rely on iptables-nft on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/968618 (https://phabricator.wikimedia.org/T284656) [09:40:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:33] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:42:33] RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:34] (03PS1) 10Giuseppe Lavagetto: Update httpd images to pick up the change in glogger [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/968620 [09:44:25] (03CR) 10FNegri: "In other places we're always using debian::codename::eq('buster'), not sure if "==" is working as well, probably yes but I would double ch" [puppet] - 10https://gerrit.wikimedia.org/r/968618 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [09:46:59] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/178/console" [puppet] - 10https://gerrit.wikimedia.org/r/968616 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [09:47:08] (03PS3) 10Majavah: P:wmcs::kubeadm: rely on iptables-nft on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/968618 (https://phabricator.wikimedia.org/T284656) [09:47:27] (03CR) 10Majavah: P:wmcs::kubeadm: rely on iptables-nft on bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968618 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [09:48:12] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [09:48:20] (03CR) 10FNegri: aptrepo: Import kubeadm 1.23 for bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967875 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [09:49:02] (03CR) 10FNegri: [C: 03+1] P:wmcs::kubeadm: rely on iptables-nft on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/968618 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [09:49:52] (03CR) 10Majavah: aptrepo: Import kubeadm 1.23 for bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967875 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [09:50:32] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [09:50:59] (03CR) 10FNegri: [C: 03+1] aptrepo: Import kubeadm 1.23 for bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967875 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [09:53:10] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1007.eqiad.wmnet with OS bullseye [09:55:19] (03CR) 10David Caro: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/968618 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [09:55:21] (03CR) 10JMeybohm: [C: 03+1] miscweb: remove the use of :latest image tag in httpd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/967147 (https://phabricator.wikimedia.org/T348856) (owner: 10Jelto) [09:59:06] (03CR) 10Majavah: [C: 03+2] aptrepo: Import kubeadm 1.23 for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/967875 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231025T1000) [10:02:03] !log import kubernetes 1.23 packages for debian bookworm T284656 [10:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:21] T284656: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656 [10:05:08] (03CR) 10Arnaudb: [C: 03+2] mariadb: alarming db1231 [puppet] - 10https://gerrit.wikimedia.org/r/967920 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:06:20] (03PS4) 10Majavah: P:wmcs::kubeadm: rely on iptables-nft on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/968618 (https://phabricator.wikimedia.org/T284656) [10:06:22] (03PS1) 10Majavah: P:wmcs::kubeadm: install containerd on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/968623 (https://phabricator.wikimedia.org/T284656) [10:08:48] (03PS5) 10Majavah: P:wmcs::kubeadm: rely on iptables-nft on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/968618 (https://phabricator.wikimedia.org/T284656) [10:08:50] (03PS2) 10Majavah: P:wmcs::kubeadm: install containerd on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/968623 (https://phabricator.wikimedia.org/T284656) [10:14:13] (03CR) 10Jbond: "LGTm see inline for minor improvment" [puppet] - 10https://gerrit.wikimedia.org/r/968360 (https://phabricator.wikimedia.org/T349166) (owner: 10EoghanGaffney) [10:15:51] (03CR) 10Jbond: prometheus: realise blackbox::check's instantly on prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968250 (owner: 10Jbond) [10:19:03] (03CR) 10Jbond: "see inline for suggested improvment" [puppet] - 10https://gerrit.wikimedia.org/r/968361 (https://phabricator.wikimedia.org/T349166) (owner: 10EoghanGaffney) [10:21:31] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:22:00] (03CR) 10Jbond: [C: 03+1] Add support for SONiC EthernetX named interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/968619 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [10:24:17] !log mwmaint2002: foreachwikiindblist /srv/mediawiki/dblists/growth-biggest.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=1second --verbose --use-job-queue (T344428; with higher file limit) [10:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:22] T344428: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 [10:25:21] (03CR) 10Majavah: [C: 03+2] P:wmcs::kubeadm: rely on iptables-nft on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/968618 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [10:25:33] (03CR) 10Majavah: [C: 03+2] P:wmcs::kubeadm: install containerd on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/968623 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [10:35:14] (03CR) 10Jbond: [C: 03+2] wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [10:43:54] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) [10:44:10] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): re-write compile_redirects function - https://phabricator.wikimedia.org/T348883 (10jbond) 05Stalled→03Resolved a:03jbond [10:45:25] (03PS2) 10Jbond: Fix HTML index title and make titles concises [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967506 (owner: 10Hashar) [10:51:23] (03PS1) 10Jbond: puppet_compier: its buster that needs the backport [puppet] - 10https://gerrit.wikimedia.org/r/968631 [10:51:37] (03CR) 10Jbond: [C: 03+2] puppet_compier: its buster that needs the backport [puppet] - 10https://gerrit.wikimedia.org/r/968631 (owner: 10Jbond) [10:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:56:40] !log mwmaint2002: foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=1second --verbose --use-job-queue (T344428; all wikis, higher file limit) [10:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:45] T344428: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 [11:01:36] (03CR) 10Jbond: [C: 03+2] Remove `defaultbranch=master` from .gitreview [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967440 (https://phabricator.wikimedia.org/T146293) (owner: 10Hashar) [11:01:38] (03CR) 10Jbond: [C: 03+2] tox: add HTML and branch coverage to pytest [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967441 (owner: 10Hashar) [11:01:41] (03CR) 10Jbond: [C: 03+2] tox: flake8 exclude build and venv directories [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967483 (owner: 10Hashar) [11:01:43] (03CR) 10Jbond: [C: 03+2] Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [11:01:46] (03CR) 10Jbond: [C: 03+2] Add a json representation for each host [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967479 (owner: 10Hashar) [11:01:48] (03CR) 10Jbond: [C: 03+2] Fix HTML index title and make titles concises [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967506 (owner: 10Hashar) [11:03:13] (DruidWebrequestSampledNoEvents) firing: Zero webrequest_sampled events received by druid_analytics over the last 30 minutes. ... [11:03:13] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_webrequest_sampled_live_Supervisor - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=41&fullscreen&orgId=1&var-druid_datasource=webrequest_sampled_live - https://alerts.wikimedia.org/?q=alertname%3DDruidWebrequestSampledNoEvents [11:04:44] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [11:05:06] (03Merged) 10jenkins-bot: Remove `defaultbranch=master` from .gitreview [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967440 (https://phabricator.wikimedia.org/T146293) (owner: 10Hashar) [11:05:10] (03Merged) 10jenkins-bot: tox: add HTML and branch coverage to pytest [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967441 (owner: 10Hashar) [11:05:13] (03Merged) 10jenkins-bot: tox: flake8 exclude build and venv directories [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967483 (owner: 10Hashar) [11:05:16] (03Merged) 10jenkins-bot: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [11:05:18] (03Merged) 10jenkins-bot: Add a json representation for each host [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967479 (owner: 10Hashar) [11:05:56] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Docker [11:08:21] (03Merged) 10jenkins-bot: Fix HTML index title and make titles concises [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967506 (owner: 10Hashar) [11:24:21] (03PS1) 10Majavah: kubeadm: only install containerd.io with docker [puppet] - 10https://gerrit.wikimedia.org/r/968634 (https://phabricator.wikimedia.org/T284656) [11:24:23] (03PS1) 10Majavah: kubeadm: containerd: install br_netfilter kmod [puppet] - 10https://gerrit.wikimedia.org/r/968635 (https://phabricator.wikimedia.org/T284656) [11:26:50] (03PS1) 10Urbanecm: changeprop: Increase refreshUserImpactJob concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/968636 (https://phabricator.wikimedia.org/T344428) [11:27:11] 10SRE, 10Math, 10RESTBase-API, 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10Physikerwelt) 05Resolved→03Open Ok, I'll add that. [11:27:34] (03PS1) 10Btullis: Change the reuse-parts recipe for kafka-jumbo slightly. [puppet] - 10https://gerrit.wikimedia.org/r/968637 (https://phabricator.wikimedia.org/T348495) [11:35:17] (03CR) 10Brouberol: [C: 03+1] Change the reuse-parts recipe for kafka-jumbo slightly. [puppet] - 10https://gerrit.wikimedia.org/r/968637 (https://phabricator.wikimedia.org/T348495) (owner: 10Btullis) [11:35:35] (03PS1) 10Jbond: 2.7.0: prepare release [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/968640 [11:35:48] (03CR) 10Jbond: [C: 03+2] 2.7.0: prepare release [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/968640 (owner: 10Jbond) [11:36:30] (03CR) 10Btullis: [C: 03+2] Change the reuse-parts recipe for kafka-jumbo slightly. [puppet] - 10https://gerrit.wikimedia.org/r/968637 (https://phabricator.wikimedia.org/T348495) (owner: 10Btullis) [11:37:54] (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/968642 [11:39:52] (03Merged) 10jenkins-bot: 2.7.0: prepare release [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/968640 (owner: 10Jbond) [11:43:06] (03PS1) 10Jbond: Merge branch '2.x' [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/968643 [11:43:20] (03CR) 10Jbond: [C: 03+2] Merge branch '2.x' [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/968643 (owner: 10Jbond) [11:46:53] (03Merged) 10jenkins-bot: Merge branch '2.x' [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/968643 (owner: 10Jbond) [11:48:48] (03PS2) 10Majavah: kubeadm: containerd: add kernel modules and config [puppet] - 10https://gerrit.wikimedia.org/r/968635 (https://phabricator.wikimedia.org/T284656) [11:49:18] (03CR) 10CI reject: [V: 04-1] kubeadm: containerd: add kernel modules and config [puppet] - 10https://gerrit.wikimedia.org/r/968635 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [12:04:32] (03PS2) 10Majavah: kubeadm: only install containerd.io with docker [puppet] - 10https://gerrit.wikimedia.org/r/968634 (https://phabricator.wikimedia.org/T284656) [12:04:34] (03PS3) 10Majavah: kubeadm: containerd: add kernel modules and config [puppet] - 10https://gerrit.wikimedia.org/r/968635 (https://phabricator.wikimedia.org/T284656) [12:04:36] (03PS1) 10Majavah: kubeadm: add required config for containerd [puppet] - 10https://gerrit.wikimedia.org/r/968647 (https://phabricator.wikimedia.org/T284656) [12:05:44] (03PS2) 10Majavah: kubeadm: add required config for containerd [puppet] - 10https://gerrit.wikimedia.org/r/968647 (https://phabricator.wikimedia.org/T284656) [12:06:12] (03PS3) 10Majavah: kubeadm: add required config for containerd [puppet] - 10https://gerrit.wikimedia.org/r/968647 (https://phabricator.wikimedia.org/T284656) [12:07:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:07:22] (03CR) 10CI reject: [V: 04-1] kubeadm: add required config for containerd [puppet] - 10https://gerrit.wikimedia.org/r/968647 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [12:07:50] (03PS1) 10Jbond: setup.py: bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/968648 [12:08:04] (03CR) 10Jbond: [V: 03+2 C: 03+2] setup.py: bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/968648 (owner: 10Jbond) [12:09:01] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/968642 (owner: 10Jbond) [12:15:55] (03PS2) 10Urbanecm: Growth: Enable new Impact backend everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949034 (https://phabricator.wikimedia.org/T344143) [12:17:30] (03PS4) 10Majavah: kubeadm: add required config for containerd [puppet] - 10https://gerrit.wikimedia.org/r/968647 (https://phabricator.wikimedia.org/T284656) [12:18:41] (03PS2) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [12:20:33] (03PS1) 10KartikMistry: testwiki: Enable Section translation on some Wikipedias with potential to be supported with MinT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968649 (https://phabricator.wikimedia.org/T345267) [12:21:03] 10sre-alert-triage: TEST Alert in need of triage: SystemdUnitFailed (instance mwmaint2002:9100) - https://phabricator.wikimedia.org/T349698 (10fgiunchedi) [12:22:49] (03PS5) 10Majavah: kubeadm: add required config for containerd [puppet] - 10https://gerrit.wikimedia.org/r/968647 (https://phabricator.wikimedia.org/T284656) [12:41:58] (03CR) 10Jbond: Initial checkin of community_civicrm module (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [12:42:42] (03CR) 10Jforrester: Remove deprecated hiera keys from icu63 upgrade (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968616 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [12:43:29] (03PS1) 10Kevin Bazira: ml-services: add rest-gateway listener for rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/967922 (https://phabricator.wikimedia.org/T348607) [12:44:10] (03CR) 10Jbond: [V: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond) [12:45:07] (03CR) 10Elukey: [C: 03+1] ml-services: add rest-gateway listener for rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/967922 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [12:47:13] (03PS3) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [12:48:30] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [12:57:47] (03CR) 10Kevin Bazira: [C: 03+2] ml-services: add rest-gateway listener for rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/967922 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [12:58:45] (03Merged) 10jenkins-bot: ml-services: add rest-gateway listener for rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/967922 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [12:58:55] (03PS4) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [12:59:52] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1017.eqiad.wmnet [12:59:54] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 3 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) Thanks @joe for the FD limits change! All tests I did so far suggest that the errors tracke... [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231025T1300). [13:00:05] James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:24] * James_F waves. [13:00:29] I'm happy to deploy for myself. [13:00:34] * urbanecm waves too, but i assume James_F will self deploy [13:00:49] +1 [13:01:29] (03PS2) 10Jforrester: ExtensionDistributor: Add REL1_41 as the development snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964910 (https://phabricator.wikimedia.org/T346929) [13:01:33] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:01:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964910 (https://phabricator.wikimedia.org/T346929) (owner: 10Jforrester) [13:02:02] * Lucas_WMDE likewise [13:02:02] (03CR) 10Klausman: team-ml: add alert for Kafka consumer lag for ores extension (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:02:58] (03Merged) 10jenkins-bot: ExtensionDistributor: Add REL1_41 as the development snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964910 (https://phabricator.wikimedia.org/T346929) (owner: 10Jforrester) [13:03:33] Oy, loads of i18n backport noise. First deploy of the day fun. [13:04:23] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:964910|ExtensionDistributor: Add REL1_41 as the development snapshot (T346929)]] [13:04:28] T346929: Add REL1_41 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T346929 [13:05:20] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:05:50] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:964910|ExtensionDistributor: Add REL1_41 as the development snapshot (T346929)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:06:01] !log jforrester@deploy2002 jforrester: Continuing with sync [13:06:09] 10sre-alert-triage: TEST Alert in need of triage: SystemdUnitFailed (instance mwmaint2002:9100) - https://phabricator.wikimedia.org/T349698 (10fgiunchedi) 05Open→03Invalid Just a test [13:06:22] (03PS2) 10Jforrester: [wikifunctions] Allow logged-out users to run approved functions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968362 (https://phabricator.wikimedia.org/T349055) [13:06:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1017.eqiad.wmnet [13:06:54] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:08:24] Is ^ because deploy1002 was re-racked yesterday? [13:10:21] (03PS2) 10JMeybohm: Remove deprecated hiera keys from icu63 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/968616 (https://phabricator.wikimedia.org/T345561) [13:10:23] (03PS3) 10JMeybohm: Add a Hiera option to enable ICU67 component [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:10:25] (03PS3) 10JMeybohm: Enable icu67 component on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/954659 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:10:27] (03PS3) 10JMeybohm: Enable icu67 component on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954657 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:10:29] (03PS3) 10JMeybohm: Enable icu67 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/954661 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:10:31] (03PS1) 10JMeybohm: Enable icu67 component on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/968658 (https://phabricator.wikimedia.org/T345561) [13:10:54] (03Abandoned) 10JMeybohm: Enable icu67 component on appserver hosts [puppet] - 10https://gerrit.wikimedia.org/r/954666 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:10:56] (03PS25) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:10:58] (03Abandoned) 10JMeybohm: Enable icu67 component on api hosts [puppet] - 10https://gerrit.wikimedia.org/r/954665 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:11:00] (03Abandoned) 10JMeybohm: Enable icu67 component on parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/954664 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:11:02] (03Abandoned) 10JMeybohm: Enable icu67 component on jobrunner hosts [puppet] - 10https://gerrit.wikimedia.org/r/954663 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:11:13] (03Abandoned) 10JMeybohm: Enable icu67 component on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954662 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:11:17] (03Abandoned) 10JMeybohm: Enable icu67 component on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/954660 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:11:21] (03Abandoned) 10JMeybohm: Enable icu67 component on appserver canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954658 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:11:25] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:964910|ExtensionDistributor: Add REL1_41 as the development snapshot (T346929)]] (duration: 07m 01s) [13:11:30] T346929: Add REL1_41 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T346929 [13:11:37] (03CR) 10JMeybohm: Add a Hiera option to enable ICU67 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:11:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968362 (https://phabricator.wikimedia.org/T349055) (owner: 10Jforrester) [13:11:43] (03CR) 10CI reject: [V: 04-1] Enable icu67 component on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/968658 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [13:12:33] (03PS2) 10Jforrester: Remove no-op $wgHiddenPrefs[] = 'prefershttps' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967992 (owner: 10Bartosz Dziewoński) [13:12:35] (03PS2) 10JMeybohm: Enable icu67 component on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/968658 (https://phabricator.wikimedia.org/T345561) [13:12:37] (03PS4) 10JMeybohm: Enable icu67 component on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/954659 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:12:39] (03PS4) 10JMeybohm: Enable icu67 component on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954657 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:12:41] (03PS4) 10JMeybohm: Enable icu67 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/954661 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [13:12:56] (03Merged) 10jenkins-bot: [wikifunctions] Allow logged-out users to run approved functions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968362 (https://phabricator.wikimedia.org/T349055) (owner: 10Jforrester) [13:13:19] (03CR) 10JMeybohm: Remove deprecated hiera keys from icu63 upgrade (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968616 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [13:13:21] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:968362|[wikifunctions] Allow logged-out users to run approved functions (T349055)]] [13:13:26] T349055: User rights: Users can run functions - https://phabricator.wikimedia.org/T349055 [13:13:52] (03PS1) 10Elukey: install_server: fix reuse-parts-test.cfg [puppet] - 10https://gerrit.wikimedia.org/r/968659 [13:14:43] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:968362|[wikifunctions] Allow logged-out users to run approved functions (T349055)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:15:06] (03CR) 10Elukey: [C: 03+2] install_server: fix reuse-parts-test.cfg [puppet] - 10https://gerrit.wikimedia.org/r/968659 (owner: 10Elukey) [13:16:02] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/186/console" [puppet] - 10https://gerrit.wikimedia.org/r/968658 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [13:16:13] !log jforrester@deploy2002 jforrester: Continuing with sync [13:16:33] (03PS1) 10JMeybohm: Revert "Enable icu67 component on mwdebug1001" [puppet] - 10https://gerrit.wikimedia.org/r/968660 (https://phabricator.wikimedia.org/T345561) [13:16:38] (03PS1) 10Jforrester: Allow logged out users to run FunctionEvaluator widget [extensions/WikiLambda] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968319 (https://phabricator.wikimedia.org/T301670) [13:17:26] (03CR) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:21:21] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:968362|[wikifunctions] Allow logged-out users to run approved functions (T349055)]] (duration: 07m 59s) [13:21:25] T349055: User rights: Users can run functions - https://phabricator.wikimedia.org/T349055 [13:21:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967992 (owner: 10Bartosz Dziewoński) [13:22:19] (03Merged) 10jenkins-bot: Remove no-op $wgHiddenPrefs[] = 'prefershttps' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967992 (owner: 10Bartosz Dziewoński) [13:22:41] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:967992|Remove no-op $wgHiddenPrefs[] = 'prefershttps']] [13:22:51] (03CR) 10Btullis: [C: 03+1] "Thanks so much for the fix." [puppet] - 10https://gerrit.wikimedia.org/r/968659 (owner: 10Elukey) [13:24:02] !log jforrester@deploy2002 matmarex and jforrester: Backport for [[gerrit:967992|Remove no-op $wgHiddenPrefs[] = 'prefershttps']] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:24:19] !log jforrester@deploy2002 matmarex and jforrester: Continuing with sync [13:25:12] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on an-tool1010.eqiad.wmnet with reason: Moving an-tool1010 [13:25:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on an-tool1010.eqiad.wmnet with reason: Moving an-tool1010 [13:27:22] (03CR) 10Jforrester: [C: 03+1] Remove deprecated hiera keys from icu63 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/968616 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [13:29:27] (03PS3) 10Jbond: mariadb - wikireplicas: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) [13:29:29] (03PS1) 10Jbond: mariadb - wmcs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) [13:29:31] (03PS1) 10Jbond: mariadb - analytics: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) [13:29:33] (03PS1) 10Jbond: mariadb - misc: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968667 (https://phabricator.wikimedia.org/T340741) [13:29:35] (03PS1) 10Jbond: mariadb - dedicated dbs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968668 (https://phabricator.wikimedia.org/T340741) [13:29:36] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:967992|Remove no-op $wgHiddenPrefs[] = 'prefershttps']] (duration: 06m 54s) [13:29:37] (03PS1) 10Jbond: mariadb - core: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968669 (https://phabricator.wikimedia.org/T340741) [13:29:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968319 (https://phabricator.wikimedia.org/T301670) (owner: 10Jforrester) [13:31:11] (03PS26) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:31:19] (03CR) 10AW GitLab Bot: "PAN-PAN: end-to-end deploy stage failed" [extensions/WikiLambda] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968319 (https://phabricator.wikimedia.org/T301670) (owner: 10Jforrester) [13:31:37] (03PS1) 10Marostegui: install_server: Do not reimage db1229 [puppet] - 10https://gerrit.wikimedia.org/r/968671 [13:32:19] (03PS1) 10Jforrester: [Staging only] wikifunctions: Raise PyWASM memory limits by 2x [deployment-charts] - 10https://gerrit.wikimedia.org/r/968672 [13:32:33] oh thanks James_F [13:33:41] MatmaRex: Happy to help. Thanks for finding that! [13:33:55] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Raise PyWASM memory limits by 2x [deployment-charts] - 10https://gerrit.wikimedia.org/r/968672 (owner: 10Jforrester) [13:33:57] (03CR) 10Klausman: [C: 03+1] team-ml: add alert for Kafka consumer lag for ores extension (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:34:31] (03Merged) 10jenkins-bot: Allow logged out users to run FunctionEvaluator widget [extensions/WikiLambda] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968319 (https://phabricator.wikimedia.org/T301670) (owner: 10Jforrester) [13:34:38] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/968671 (owner: 10Marostegui) [13:34:52] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Raise PyWASM memory limits by 2x [deployment-charts] - 10https://gerrit.wikimedia.org/r/968672 (owner: 10Jforrester) [13:34:59] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:968319|Allow logged out users to run FunctionEvaluator widget (T301670 T349055 T349057)]] [13:35:07] T349057: User rights: Users can run unsaved code on the implementation page - https://phabricator.wikimedia.org/T349057 [13:35:07] T301670: Create a placeholder Roles module in Vuex ahead of future role work - https://phabricator.wikimedia.org/T301670 [13:35:08] T349055: User rights: Users can run functions - https://phabricator.wikimedia.org/T349055 [13:35:51] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:36:06] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [13:36:16] (03CR) 10Fabfur: [C: 03+2] hiera: added new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/968301 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [13:36:29] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1229 [puppet] - 10https://gerrit.wikimedia.org/r/968671 (owner: 10Marostegui) [13:36:44] fabfur: you can go ahead and merge my change anytime [13:38:29] ah ok thanks [13:38:56] merged thanks [13:40:27] (03PS5) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [13:42:48] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS bullseye [13:42:53] (03PS1) 10Ayounsi: Allow mgmt to pki for CRL retrieval [homer/public] - 10https://gerrit.wikimedia.org/r/968676 [13:44:13] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [13:45:16] 10SRE, 10serviceops-radar, 10Patch-For-Review: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10JMeybohm) 05Open→03Declined Because of time constraints we're going to do the ICU upgrade the "old way" again. Closing this in favor of {T345561} [13:45:25] (03CR) 10Filippo Giunchedi: "Idea LGTM, see inline for naming" [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [13:45:52] (03PS1) 10Ayounsi: profile::pki::multirootca: allow port 80 from mgmt [puppet] - 10https://gerrit.wikimedia.org/r/968677 [13:46:27] (03PS2) 10Ayounsi: profile::pki::multirootca: allow port 80 from mgmt [puppet] - 10https://gerrit.wikimedia.org/r/968677 [13:46:28] 10SRE, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [13:46:45] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/968677 (owner: 10Ayounsi) [13:47:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:01] (03CR) 10Filippo Giunchedi: "Also in other per-team bits we're using "team" not "owner" as the argument name, e.g. "prometheus::blackbox::check::http" which I think is" [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [13:51:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 15 hosts with reason: not pooled, reimaging in progress [13:51:49] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1007.eqiad.wmnet with reason: host reimage [13:52:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:52:16] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 15 hosts with reason: not pooled, reimaging in progress [13:52:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/968676 (owner: 10Ayounsi) [13:52:58] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-tool1010 [13:53:56] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:968319|Allow logged out users to run FunctionEvaluator widget (T301670 T349055 T349057)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:53:58] !log jforrester@deploy2002 jforrester: Continuing with sync [13:54:05] T349057: User rights: Users can run unsaved code on the implementation page - https://phabricator.wikimedia.org/T349057 [13:54:05] T301670: Create a placeholder Roles module in Vuex ahead of future role work - https://phabricator.wikimedia.org/T301670 [13:54:05] T349055: User rights: Users can run functions - https://phabricator.wikimedia.org/T349055 [13:54:41] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-tool1010 [13:55:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1007.eqiad.wmnet with reason: host reimage [13:57:56] (03CR) 10Ayounsi: [C: 03+2] Allow mgmt to pki for CRL retrieval [homer/public] - 10https://gerrit.wikimedia.org/r/968676 (owner: 10Ayounsi) [13:58:02] (03CR) 10Ayounsi: [C: 03+2] profile::pki::multirootca: allow port 80 from mgmt [puppet] - 10https://gerrit.wikimedia.org/r/968677 (owner: 10Ayounsi) [13:59:29] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage [14:00:04] Deploy window Wikifunction Services UTC Afternoon¥ (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231025T1400) [14:00:17] PROBLEM - Check systemd state on ml-serve1007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:38] (03PS1) 10Ssingh: hiera: remove dns1004 for authdns_servers for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/968680 (https://phabricator.wikimedia.org/T342154) [14:01:59] PROBLEM - Check systemd state on kubernetes2035 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:02:10] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host deploy1002 [14:02:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host deploy1002 [14:02:59] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage [14:03:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:38] (03CR) 10Ssingh: [C: 03+2] hiera: remove dns1004 for authdns_servers for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/968680 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [14:05:13] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [14:05:40] RECOVERY - Check systemd state on ml-serve1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:13] 10SRE, 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10Jclark-ctr) a:03Jclark-ctr [14:07:30] 10SRE, 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10Jclark-ctr) It is reachable and Taavi took care of switch interface it was missed by Valery i will work with her and remind her of it do you still see any other issue @ayounsi prior to me closing... [14:09:01] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Jclark-ctr) [14:09:22] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Jclark-ctr) Relocated an-tool1010 to rack C3 [14:09:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1004.wikimedia.org with OS bookworm [14:11:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:11:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1007.eqiad.wmnet with OS bullseye [14:11:56] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:12:01] ^ expected [14:13:13] (DruidWebrequestSampledNoEvents) resolved: Zero webrequest_sampled events received by druid_analytics over the last 30 minutes. ... [14:13:13] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_webrequest_sampled_live_Supervisor - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=41&fullscreen&orgId=1&var-druid_datasource=webrequest_sampled_live - https://alerts.wikimedia.org/?q=alertname%3DDruidWebrequestSampledNoEvents [14:13:20] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10SRE Observability (FY2023/2024-Q2): librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10lmata) [14:13:54] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:42] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2035 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:16:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:28] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1100.eqiad.wmnet with OS bullseye [14:27:09] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns1004.wikimedia.org with OS bookworm [14:27:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1004.wikimedia.org with OS bookworm [14:27:52] (03CR) 10Btullis: [C: 03+1] dse-k8s: don't watch rdf-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [14:28:26] (03PS1) 10Jbond: profile::pki::multirootca: Add initial CRL [puppet] - 10https://gerrit.wikimedia.org/r/968696 (https://phabricator.wikimedia.org/T340543) [14:28:51] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) First host reimage is complete: ` Reimage completed: - cp1100 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Puppet - Removed from Puppet and PuppetDB if present and delet... [14:28:57] (03CR) 10CI reject: [V: 04-1] profile::pki::multirootca: Add initial CRL [puppet] - 10https://gerrit.wikimedia.org/r/968696 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [14:29:00] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:29:56] This sync has had a stuck host on the php-fpm-restart phase now for 30 minutes [14:30:10] !log jforrester@deploy2002 sync-world aborted: Backport for [[gerrit:968319|Allow logged out users to run FunctionEvaluator widget (T301670 T349055 T349057)]] (duration: 55m 10s) [14:30:18] T349057: User rights: Users can run unsaved code on the implementation page - https://phabricator.wikimedia.org/T349057 [14:30:18] T301670: Create a placeholder Roles module in Vuex ahead of future role work - https://phabricator.wikimedia.org/T301670 [14:30:18] T349055: User rights: Users can run functions - https://phabricator.wikimedia.org/T349055 [14:30:44] Anyway, 30 minutes late, done. [14:31:20] o_O [14:31:28] I don’t remember scap getting stuck that long in this particular phase before… [14:31:32] Oh, I do. [14:31:33] Sadly. [14:31:34] (the k8s rollouts sometimes take longer) [14:31:36] ok :/ [14:31:49] Sometimes one of the servers is just having a bad day, or there's a dropped packet, or… [14:32:18] I believe the php-fpm-restart command is a single outbound command and single return response, no on-going check. [14:32:35] (03CR) 10Btullis: [C: 03+1] Switch druid1006 zookeeper node with druid1011 [puppet] - 10https://gerrit.wikimedia.org/r/965501 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [14:32:59] (03CR) 10Btullis: [C: 03+1] Switch druid1005 zookeeper node with druid1010 [puppet] - 10https://gerrit.wikimedia.org/r/965500 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [14:33:13] (03CR) 10Btullis: [C: 03+1] Switch druid1004 zookeeper node with druid1009 [puppet] - 10https://gerrit.wikimedia.org/r/965499 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [14:33:17] (03PS2) 10Jbond: profile::pki::multirootca: Add initial CRL [puppet] - 10https://gerrit.wikimedia.org/r/968696 (https://phabricator.wikimedia.org/T340543) [14:34:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/191/con" [puppet] - 10https://gerrit.wikimedia.org/r/968696 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [14:36:58] (03CR) 10Jbond: [V: 03+1 C: 03+2] profile::pki::multirootca: Add initial CRL [puppet] - 10https://gerrit.wikimedia.org/r/968696 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [14:37:59] (03CR) 10Btullis: [C: 03+1] "Looks good. Apologies for the delay is reponding." [puppet] - 10https://gerrit.wikimedia.org/r/955927 (owner: 10Muehlenhoff) [14:38:40] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:08] James_F: Looks like there's a lingering ssh connection to mw2332.codfw.wmnet. Was that the problem host? [14:39:21] (03CR) 10Btullis: [V: 03+1 C: 03+2] Ensure that alluxio cache directories are present for presto [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [14:39:22] dancy: No idea, no feedback from scap. Possibly. [14:39:29] (03PS1) 10Ssingh: Revert "hiera: remove dns1004 for authdns_servers for reimaging" [puppet] - 10https://gerrit.wikimedia.org/r/968320 [14:39:56] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1004.wikimedia.org with reason: host reimage [14:40:08] (03PS1) 10Jbond: pki: remove leading whitespace [puppet] - 10https://gerrit.wikimedia.org/r/968700 [14:41:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/192/console" [puppet] - 10https://gerrit.wikimedia.org/r/968700 (owner: 10Jbond) [14:42:35] (03CR) 10CI reject: [V: 04-1] pki: remove leading whitespace [puppet] - 10https://gerrit.wikimedia.org/r/968700 (owner: 10Jbond) [14:43:02] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1004.wikimedia.org with reason: host reimage [14:43:18] (03PS2) 10Jbond: pki: remove leading whitespace [puppet] - 10https://gerrit.wikimedia.org/r/968700 (https://phabricator.wikimedia.org/T340543) [14:44:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/193/con" [puppet] - 10https://gerrit.wikimedia.org/r/968700 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [14:46:34] (03CR) 10Brouberol: [C: 03+1] Switch druid1006 zookeeper node with druid1011 [puppet] - 10https://gerrit.wikimedia.org/r/965501 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [14:46:45] (03CR) 10Brouberol: [C: 03+1] Switch druid1005 zookeeper node with druid1010 [puppet] - 10https://gerrit.wikimedia.org/r/965500 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [14:47:04] (03CR) 10Brouberol: [C: 03+1] Switch druid1004 zookeeper node with druid1009 [puppet] - 10https://gerrit.wikimedia.org/r/965499 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [14:47:36] (03CR) 10Jbond: [V: 03+1 C: 03+2] pki: remove leading whitespace [puppet] - 10https://gerrit.wikimedia.org/r/968700 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [14:48:01] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond) [14:50:31] (03PS1) 10Jbond: pki: remove trailing quote [puppet] - 10https://gerrit.wikimedia.org/r/968708 (https://phabricator.wikimedia.org/T340543) [14:50:37] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/194/con" [puppet] - 10https://gerrit.wikimedia.org/r/968708 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [14:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:53:40] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:04:55] (03CR) 10Herron: [C: 03+1] team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [15:05:14] (03CR) 10Herron: [C: 03+1] alertmanager: sanitise silence audit log [puppet] - 10https://gerrit.wikimedia.org/r/968615 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [15:05:38] (03CR) 10Jbond: [V: 03+1 C: 03+2] pki: remove trailing quote [puppet] - 10https://gerrit.wikimedia.org/r/968708 (https://phabricator.wikimedia.org/T340543) (owner: 10Jbond) [15:07:27] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:07:51] RECOVERY - BFD status on cr1-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:08:14] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1004.wikimedia.org with OS bookworm [15:09:01] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: remove dns1004 for authdns_servers for reimaging" [puppet] - 10https://gerrit.wikimedia.org/r/968320 (owner: 10Ssingh) [15:10:57] (03CR) 10Jbond: [V: 03+1 C: 03+2] acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond) [15:15:08] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) [15:16:27] (03PS1) 10Jbond: apereo_cas: remove recurse on log dir. [puppet] - 10https://gerrit.wikimedia.org/r/968711 [15:22:26] (03CR) 10Cwhite: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [15:24:58] (03CR) 10Cwhite: "Might be waiting on I5be6766bc351963a88fc01e3415598bc1b78945f for mwdebug hosts." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [15:26:46] (03CR) 10Jbond: [C: 03+2] apereo_cas: remove recurse on log dir. [puppet] - 10https://gerrit.wikimedia.org/r/968711 (owner: 10Jbond) [15:32:32] (03PS1) 10Zoranzoki21: Enable block feature for AbuseFilter on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968713 (https://phabricator.wikimedia.org/T349727) [15:34:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:53] (03PS38) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [15:45:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:37] (03PS39) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [15:50:16] (03PS1) 10AikoChou: ml-services: set OMP_NUM_THREADS and OMP_THREAD_LIMIT in rr-multilingual [deployment-charts] - 10https://gerrit.wikimedia.org/r/968715 (https://phabricator.wikimedia.org/T347551) [15:50:27] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:51:30] DannyS712: You could CC more on more patches. ;-) [15:52:01] (03PS1) 10Ayounsi: Add helper bash function to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) [15:52:26] (03CR) 10CI reject: [V: 04-1] Add helper bash function to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [15:52:31] (03PS2) 10Ayounsi: Add helper bash function to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) [15:52:59] (03CR) 10CI reject: [V: 04-1] Add helper bash function to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [15:53:01] (03PS3) 10Ayounsi: Add helper bash function to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) [15:53:27] (03CR) 10CI reject: [V: 04-1] Add helper bash function to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [15:54:33] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: set OMP_NUM_THREADS and OMP_THREAD_LIMIT in rr-multilingual [deployment-charts] - 10https://gerrit.wikimedia.org/r/968715 (https://phabricator.wikimedia.org/T347551) (owner: 10AikoChou) [15:57:32] (03PS4) 10Ayounsi: Add helper functions to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) [15:57:54] (03CR) 10CI reject: [V: 04-1] Add helper functions to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [15:58:40] (03PS5) 10Ayounsi: Add helper functions to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) [15:58:59] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [15:59:24] (03PS2) 10AikoChou: ml-services: set OMP_NUM_THREADS and OMP_THREAD_LIMIT in rr-multilingual [deployment-charts] - 10https://gerrit.wikimedia.org/r/968715 (https://phabricator.wikimedia.org/T347551) [16:01:06] (03CR) 10CI reject: [V: 04-1] Add helper functions to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [16:02:08] (03PS3) 10AikoChou: ml-services: set OMP_NUM_THREADS and OMP_THREAD_LIMIT in rr-multilingual [deployment-charts] - 10https://gerrit.wikimedia.org/r/968715 (https://phabricator.wikimedia.org/T347551) [16:03:11] (03PS6) 10Ayounsi: Add helper functions to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) [16:03:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:56] (03CR) 10AikoChou: [C: 03+2] ml-services: set OMP_NUM_THREADS and OMP_THREAD_LIMIT in rr-multilingual [deployment-charts] - 10https://gerrit.wikimedia.org/r/968715 (https://phabricator.wikimedia.org/T347551) (owner: 10AikoChou) [16:04:21] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10ayounsi) Going though this task after discussing proxies in another group. The side of this task about forcing sane no_proxy settings is done tha... [16:04:56] (03Merged) 10jenkins-bot: ml-services: set OMP_NUM_THREADS and OMP_THREAD_LIMIT in rr-multilingual [deployment-charts] - 10https://gerrit.wikimedia.org/r/968715 (https://phabricator.wikimedia.org/T347551) (owner: 10AikoChou) [16:05:46] (03CR) 10CI reject: [V: 04-1] Add helper functions to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [16:07:09] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [16:07:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:08:13] (03PS7) 10Ayounsi: Add helper functions to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) [16:08:21] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [16:11:59] (03CR) 10jenkins-bot: Add helper functions to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [16:15:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:12] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) [16:16:36] 10SRE, 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10ayounsi) 05Open→03Resolved All good! [16:19:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:22] (03PS1) 10BCornwall: hiera: remove dns1005 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968721 (https://phabricator.wikimedia.org/T342154) [16:34:25] (03CR) 10Ssingh: [C: 03+1] hiera: remove dns1005 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968721 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [16:36:23] (03CR) 10BCornwall: [C: 03+2] hiera: remove dns1005 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968721 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [16:38:54] (03PS1) 10Jforrester: wikifunctions: Switch JS evaluator to WASM, drop staging copy [deployment-charts] - 10https://gerrit.wikimedia.org/r/968722 (https://phabricator.wikimedia.org/T343829) [16:38:57] (03CR) 10Ottomata: [C: 03+1] "Nice! LGTM." [alerts] - 10https://gerrit.wikimedia.org/r/959039 (https://phabricator.wikimedia.org/T326002) (owner: 10Gmodena) [16:39:14] jouncebot: nowandnext [16:39:14] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [16:39:14] In 0 hour(s) and 20 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231025T1700) [16:41:24] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Switch JS evaluator to WASM, drop staging copy [deployment-charts] - 10https://gerrit.wikimedia.org/r/968722 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester) [16:42:10] (03Merged) 10jenkins-bot: wikifunctions: Switch JS evaluator to WASM, drop staging copy [deployment-charts] - 10https://gerrit.wikimedia.org/r/968722 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester) [16:44:02] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:44:48] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:45:07] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:45:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:16] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns1005.wikimedia.org with OS bookworm [16:45:26] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns1005.wikimedia.org with OS bookworm [16:45:30] (03CR) 10Hnowlan: [C: 03+1] Decommission restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/968341 (https://phabricator.wikimedia.org/T349526) (owner: 10Eevans) [16:46:01] (03CR) 10Hnowlan: [C: 03+1] Decommission restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/968340 (https://phabricator.wikimedia.org/T349526) (owner: 10Eevans) [16:46:20] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:46:23] (03CR) 10Hnowlan: [C: 03+1] Decommission restbase1016 [puppet] - 10https://gerrit.wikimedia.org/r/968339 (https://phabricator.wikimedia.org/T349526) (owner: 10Eevans) [16:46:24] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:47:12] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:47:15] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:47:45] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:47:58] ^Me [16:48:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [16:49:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:07] PROBLEM - Host 2620:0:861:2:208:80:154:153 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:47] PROBLEM - Recursive DNS on 208.80.154.153 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [16:56:32] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) @Jclark-ctr hi! Yes, we can schedule, though we can't take many hosts at the same time, so will have to be done little by litt... [16:57:43] RECOVERY - Host 2620:0:861:2:208:80:154:153 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:59:15] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1005.wikimedia.org with reason: host reimage [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231025T1700) [17:00:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:19] !log temporarily increasing log level to trace for eventgate-logging-external in eqiad canary release only - T347477 [17:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:25] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [17:02:27] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1005.wikimedia.org with reason: host reimage [17:02:38] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [17:03:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:28] T347477: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 [17:04:32] 10SRE, 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10CodeReviewBot) brett merged https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/4 Implement Gitlab CI and Blubber config [17:05:20] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:06:25] PROBLEM - Recursive DNS on 208.80.154.153 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:06:54] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:09:52] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [17:10:10] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [17:10:33] RECOVERY - Recursive DNS on 208.80.154.153 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [17:15:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:23] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [17:15:35] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [17:18:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:54] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [17:21:07] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [17:23:15] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:55] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:28:51] RECOVERY - BFD status on cr1-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:28:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1005.wikimedia.org with OS bookworm [17:29:03] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns1005.wikimedia.org with OS bookworm completed: - dns1005 (**PASS**) - Downtimed on Icinga/Al... [17:30:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:34:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:13] (03PS1) 10BCornwall: Revert "hiera: remove dns1005 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968323 [17:38:03] (03CR) 10Ssingh: [C: 03+1] Revert "hiera: remove dns1005 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968323 (owner: 10BCornwall) [17:41:09] (03PS40) 10Andrea Denisse: prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [17:45:03] (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns1005 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968323 (owner: 10BCornwall) [17:58:01] (03PS1) 10Jbond: puppet::agent: allow to disable certificate revocation [puppet] - 10https://gerrit.wikimedia.org/r/968731 [17:58:21] (03CR) 10Jbond: [C: 03+2] puppet::agent: allow to disable certificate revocation [puppet] - 10https://gerrit.wikimedia.org/r/968731 (owner: 10Jbond) [17:59:26] (03PS1) 10Eevans: install_server: configure for initial install of restbase20[28-35] [puppet] - 10https://gerrit.wikimedia.org/r/968732 (https://phabricator.wikimedia.org/T348474) [18:00:04] dancy and brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231025T1800). [18:00:05] dancy and brennen: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231025T1800). [18:00:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:51] I will advance the train in about 15-20 minutes. [18:04:11] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:04:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:08] (03CR) 10Eevans: [C: 03+2] Decommission restbase1016 [puppet] - 10https://gerrit.wikimedia.org/r/968339 (https://phabricator.wikimedia.org/T349526) (owner: 10Eevans) [18:08:16] (03PS1) 10Jbond: puppet: allow to override certificate_revocation even on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/968733 [18:11:06] !log eevans@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase1016.eqiad.wmnet [18:11:19] (03PS15) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [18:11:47] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [18:13:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/195/console" [puppet] - 10https://gerrit.wikimedia.org/r/968733 (owner: 10Jbond) [18:13:54] (03CR) 10Jbond: [V: 03+1 C: 03+2] sretest: remove old test [puppet] - 10https://gerrit.wikimedia.org/r/967467 (owner: 10Jbond) [18:14:04] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet: allow to override certificate_revocation even on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/968733 (owner: 10Jbond) [18:14:08] (dancy - just back at main keyboard, will keep one eye on logs.) [18:16:37] (03PS1) 10BCornwall: hiera: remove dns1006 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968735 (https://phabricator.wikimedia.org/T342154) [18:17:30] !log eevans@cumin1001 START - Cookbook sre.dns.netbox [18:19:45] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:50] (03PS1) 10Jbond: puppet::agent: allow 'false' for certificate_revocation [puppet] - 10https://gerrit.wikimedia.org/r/968736 [18:19:55] (03PS1) 10Zoranzoki21: Add throttle rule for Edit-a-Thon on 2023-11-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968737 (https://phabricator.wikimedia.org/T349234) [18:20:15] Alright, let's do this thing. [18:20:35] (03CR) 10Jbond: [C: 03+2] puppet::agent: allow 'false' for certificate_revocation [puppet] - 10https://gerrit.wikimedia.org/r/968736 (owner: 10Jbond) [18:20:39] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968738 (https://phabricator.wikimedia.org/T348355) [18:20:42] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968738 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot) [18:21:05] (03PS2) 10Zoranzoki21: Add throttle rule for Edit-a-Thon on 2023-11-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968737 (https://phabricator.wikimedia.org/T349234) [18:21:33] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968738 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot) [18:22:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 42.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:23:03] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:25:24] (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/968739 [18:28:02] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.2 refs T348355 [18:28:07] T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355 [18:29:05] (03PS16) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [18:29:32] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [18:30:22] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/968739 (owner: 10Ebernhardson) [18:31:07] (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/968739 (owner: 10Ebernhardson) [18:31:24] (03CR) 10Ssingh: [C: 03+1] hiera: remove dns1006 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968735 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:32:39] (03PS17) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [18:32:42] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:32:55] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:33:06] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [18:33:55] !log dancy@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.2 refs T348355 (duration: 05m 52s) [18:34:00] T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355 [18:35:16] (03CR) 10BCornwall: [C: 03+2] hiera: remove dns1006 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968735 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:35:42] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:39:22] (03PS18) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [18:42:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:43:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:46:41] (03CR) 10Dwisehaupt: "Just down to figuring out how we want to effectively check the schema load and grants. Everything else should be addressed at this point." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [18:47:24] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 44.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:48:39] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 43.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:03:39] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:04:47] (03PS1) 10Ssingh: 10.in-addr.arpa: remove redundant entry for 159.64.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/968743 [19:05:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:05:48] (03CR) 10CI reject: [V: 04-1] 10.in-addr.arpa: remove redundant entry for 159.64.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/968743 (owner: 10Ssingh) [19:07:16] (03PS2) 10Ssingh: 10.in-addr.arpa: remove redundant entry for 159.64.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/968743 [19:10:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:12:53] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:14:43] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: check if it makes vlan1054 records - cmooney@cumin1001" [19:15:08] (03CR) 10Gmodena: [C: 03+1] "+1." [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu) [19:16:07] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: check if it makes vlan1054 records - cmooney@cumin1001" [19:16:08] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:16:12] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:17:35] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:18:40] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [19:18:50] (03Abandoned) 10Ssingh: 10.in-addr.arpa: remove redundant entry for 159.64.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/968743 (owner: 10Ssingh) [19:19:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:19:44] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [19:19:44] !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:19:45] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase1016.eqiad.wmnet [19:20:34] 10ops-eqiad, 10Cassandra, 10decommission-hardware: decommission restbase1016 - https://phabricator.wikimedia.org/T349709 (10Eevans) [19:20:43] !log sukhe@cumin2002:~$ sudo cumin 'A:dns-rec' "enable-puppet 'wait before enabling'" [19:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:57] (03PS2) 10Eevans: Decommission restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/968340 (https://phabricator.wikimedia.org/T349710) [19:21:59] (03PS2) 10Eevans: Decommission restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/968341 (https://phabricator.wikimedia.org/T349711) [19:22:53] (03CR) 10Eevans: [C: 03+2] Decommission restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/968340 (https://phabricator.wikimedia.org/T349710) (owner: 10Eevans) [19:24:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:25:39] !log eevans@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase1017.eqiad.wmnet [19:26:26] (03CR) 10Ottomata: [C: 03+1] Bump MW Page content change app version [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu) [19:26:51] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @dcaro that works for me let me know what 5 are available tomorrow and i will start. Thanks! [19:26:54] PROBLEM - Bird Internet Routing Daemon on dns1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [19:27:18] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:27:25] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns1006.wikimedia.org with OS bookworm [19:27:32] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:27:35] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns1006.wikimedia.org with OS bookworm [19:28:22] (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/968744 [19:30:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:03] !log eevans@cumin1001 START - Cookbook sre.dns.netbox [19:34:20] PROBLEM - Host 2620:0:861:3:208:80:154:77 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:3:208:80:154:77) [19:34:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:11] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [19:36:18] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [19:36:18] !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:36:18] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase1017.eqiad.wmnet [19:36:50] PROBLEM - Recursive DNS on 208.80.154.77 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:37:15] 10ops-eqiad, 10Cassandra, 10decommission-hardware: decommission restbase1017 - https://phabricator.wikimedia.org/T349710 (10Eevans) [19:37:31] (03CR) 10Eevans: [C: 03+2] Decommission restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/968341 (https://phabricator.wikimedia.org/T349711) (owner: 10Eevans) [19:40:16] RECOVERY - Host 2620:0:861:3:208:80:154:77 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [19:40:18] !log eevans@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase1018.eqiad.wmnet [19:41:23] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/968744 (owner: 10Ebernhardson) [19:41:54] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1006.wikimedia.org with reason: host reimage [19:42:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:42:23] (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/968744 (owner: 10Ebernhardson) [19:42:26] PROBLEM - Recursive DNS on 2620:0:861:3:208:80:154:77 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:44:36] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1006.wikimedia.org with reason: host reimage [19:44:44] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:44:53] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:47:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 44.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:48:28] 10SRE, 10Cassandra, 10Data-Persistence: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [19:49:06] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Eevans) [19:49:10] 10SRE, 10Cassandra, 10Data-Persistence: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) 05Open→03Resolved macro-deployed [19:50:49] !log eevans@cumin1001 START - Cookbook sre.dns.netbox [19:51:03] 10SRE, 10ops-eqiad, 10Cassandra, 10decommission-hardware: decommission restbase1017 - https://phabricator.wikimedia.org/T349710 (10Eevans) [19:51:17] 10SRE, 10ops-eqiad, 10Cassandra, 10decommission-hardware: decommission restbase1016 - https://phabricator.wikimedia.org/T349709 (10Eevans) [19:52:58] RECOVERY - Recursive DNS on 208.80.154.77 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:53:12] RECOVERY - Recursive DNS on 2620:0:861:3:208:80:154:77 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:55:50] (03PS1) 10Jbond: puppet_ca_certs: We need to have the full chain for client auth [puppet] - 10https://gerrit.wikimedia.org/r/968748 [19:56:07] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase1018.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [19:56:31] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/968748 (owner: 10Jbond) [19:56:59] (03PS2) 10Jbond: puppet_ca_certs: We need to have the full chain for client auth [puppet] - 10https://gerrit.wikimedia.org/r/968748 [19:57:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase1018.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [19:57:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:57:14] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase1018.eqiad.wmnet [19:58:13] 10ops-eqiad, 10Cassandra, 10decommission-hardware, 10Patch-For-Review: decommission restbase1018 - https://phabricator.wikimedia.org/T349711 (10Eevans) [19:58:21] 10ops-eqiad, 10DC-Ops: Audit of WMCS Servers Using Single & Dual Switchports - https://phabricator.wikimedia.org/T349756 (10wiki_willy) [19:58:44] 10SRE, 10Cassandra, 10Data-Persistence: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231025T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:15] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10odimitrijevic) Approved! [20:00:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:25] (03PS1) 10Ebernhardson: cirrus updater: Turn on two med-large wikis in the producer [deployment-charts] - 10https://gerrit.wikimedia.org/r/968750 [20:01:45] RECOVERY - Bird Internet Routing Daemon on dns1006 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [20:01:57] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:02:19] RECOVERY - BFD status on cr1-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:03:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:50] (03PS5) 10Andrew Bogott: add domain param to openstack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) [20:04:49] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1006.wikimedia.org with OS bookworm [20:05:00] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns1006.wikimedia.org with OS bookworm completed: - dns1006 (**PASS**) - Downtimed on Icinga/Al... [20:06:14] (03CR) 10Andrew Bogott: add domain param to openstack backend (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [20:06:36] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Turn on two med-large wikis in the producer [deployment-charts] - 10https://gerrit.wikimedia.org/r/968750 (owner: 10Ebernhardson) [20:07:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:07:19] (03PS1) 10Eevans: site.pp: cleanup decommissioned restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/968751 (https://phabricator.wikimedia.org/T328490) [20:07:25] (03Merged) 10jenkins-bot: cirrus updater: Turn on two med-large wikis in the producer [deployment-charts] - 10https://gerrit.wikimedia.org/r/968750 (owner: 10Ebernhardson) [20:10:10] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:10:25] (03CR) 10CI reject: [V: 04-1] add domain param to openstack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [20:11:06] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:15:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:15:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:09] PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:47] (03PS1) 10Ebernhardson: cirrus updater: Correct list of wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/968752 [20:18:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:39] RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 44.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:20:44] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Correct list of wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/968752 (owner: 10Ebernhardson) [20:21:30] (03Merged) 10jenkins-bot: cirrus updater: Correct list of wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/968752 (owner: 10Ebernhardson) [20:21:45] (03PS1) 10BCornwall: Revert "hiera: remove dns1006 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968324 [20:22:20] (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns1006 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968324 (owner: 10BCornwall) [20:22:28] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10RobH) [20:22:38] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10RobH) [20:23:55] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:24:07] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:26:39] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [20:27:32] (03CR) 10Ssingh: [C: 03+1] site.pp: cleanup decommissioned restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/968751 (https://phabricator.wikimedia.org/T328490) (owner: 10Eevans) [20:32:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:37:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:48:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [20:49:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:54:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:54:21] (03PS1) 10Jforrester: wikifunctions: Bump evaluators to latest, less-noisy logging for prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/968755 [20:54:27] (03PS1) 10Umherirrender: Check key from OutputPage::getCategoryLinks [skins/Timeless] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968325 (https://phabricator.wikimedia.org/T349747) [20:54:47] (03PS1) 10Umherirrender: diff: Fix LinkRenderer method call [core] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968766 (https://phabricator.wikimedia.org/T349726) [20:55:18] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Bump evaluators to latest, less-noisy logging for prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/968755 (owner: 10Jforrester) [20:56:09] (03Merged) 10jenkins-bot: wikifunctions: Bump evaluators to latest, less-noisy logging for prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/968755 (owner: 10Jforrester) [20:57:23] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:58:13] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:59:12] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:00:04] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231025T2100) [21:00:15] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:00:18] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:01:15] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:05:20] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:06:54] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:07:10] (03CR) 10Eevans: [C: 03+2] site.pp: cleanup decommissioned restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/968751 (https://phabricator.wikimedia.org/T328490) (owner: 10Eevans) [21:15:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:19:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:22] (03CR) 10Anzx: Add throttle rule for Edit-a-Thon on 2023-11-03 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968737 (https://phabricator.wikimedia.org/T349234) (owner: 10Zoranzoki21) [21:34:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:16] (03PS6) 10Andrew Bogott: add domain param to openstack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) [21:43:07] (03CR) 10CI reject: [V: 04-1] add domain param to openstack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [21:45:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:25] (03PS7) 10Andrew Bogott: add domain param to openstack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) [21:49:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Andrew) [21:53:21] (03CR) 10CI reject: [V: 04-1] add domain param to openstack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [21:53:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Andrew) These hosts have four drives will be fine with just one SW raid, so "partman/standard.cfg partman/raid10-4dev.cfg" look... [21:54:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Andrew) [21:54:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Andrew) Note that I also changed the distro to Bookworm. We're currently upgrading all our existing hosts to Bookworm. [21:57:40] (03PS1) 10Andrew Bogott: Test patch to see if the linter is misbehaving. [software/cumin] - 10https://gerrit.wikimedia.org/r/968762 [22:01:59] jouncebot: nowandnext [22:01:59] No deployments scheduled for the next 7 hour(s) and 58 minute(s) [22:01:59] In 7 hour(s) and 58 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T0600) [22:01:59] In 7 hour(s) and 58 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T0600) [22:02:14] OK, I'm going to quickly sync out a UBN. [22:02:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968766 (https://phabricator.wikimedia.org/T349726) (owner: 10Umherirrender) [22:04:24] (03CR) 10CI reject: [V: 04-1] Test patch to see if the linter is misbehaving. [software/cumin] - 10https://gerrit.wikimedia.org/r/968762 (owner: 10Andrew Bogott) [22:05:29] Or apparently two UBNs. fun. [22:07:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:08:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:09:01] (03CR) 10Jforrester: [C: 03+2] Check key from OutputPage::getCategoryLinks [skins/Timeless] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968325 (https://phabricator.wikimedia.org/T349747) (owner: 10Umherirrender) [22:12:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:12:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:14:31] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:14:47] Whee, aren't the CI tests fast? [22:15:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:17:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.319 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:18:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:19:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:19:41] (03Merged) 10jenkins-bot: diff: Fix LinkRenderer method call [core] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968766 (https://phabricator.wikimedia.org/T349726) (owner: 10Umherirrender) [22:19:42] (03Merged) 10jenkins-bot: Check key from OutputPage::getCategoryLinks [skins/Timeless] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968325 (https://phabricator.wikimedia.org/T349747) (owner: 10Umherirrender) [22:20:47] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:968766|diff: Fix LinkRenderer method call (T349726)]] [22:20:52] T349726: "Revision as of" on diff links to incorrect target - https://phabricator.wikimedia.org/T349726 [22:22:10] !log jforrester@deploy2002 jforrester and umherirrender: Backport for [[gerrit:968766|diff: Fix LinkRenderer method call (T349726)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:22:47] !log jforrester@deploy2002 jforrester and umherirrender: Continuing with sync [22:26:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:28:09] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:968766|diff: Fix LinkRenderer method call (T349726)]] (duration: 07m 21s) [22:28:13] T349726: "Revision as of" on diff links to incorrect target - https://phabricator.wikimedia.org/T349726 [22:28:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:30:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:39:29] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:39:39] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:40:09] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:45:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:00:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:46] 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10RobH) [23:02:34] 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10RobH) [23:04:12] 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install elastic2087-2092 - https://phabricator.wikimedia.org/T349778 (10RobH) [23:04:41] 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install elastic2087-2092 - https://phabricator.wikimedia.org/T349778 (10RobH) [23:05:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:11] 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2092 - https://phabricator.wikimedia.org/T349778 (10RobH) [23:06:22] 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10RobH) [23:08:41] 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2093-2110 - https://phabricator.wikimedia.org/T349780 (10RobH) [23:09:09] 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2093-2110 - https://phabricator.wikimedia.org/T349780 (10RobH) [23:15:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:40] (03CR) 10Gergő Tisza: [C: 03+1] changeprop: Increase refreshUserImpactJob concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/968636 (https://phabricator.wikimedia.org/T344428) (owner: 10Urbanecm) [23:35:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:45] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:49:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:50:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:50:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:51:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:56:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 45.83% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy