[00:00:28] RECOVERY - haproxy alive on cloudlb1002 is OK: OK check_alive uptime 354s https://wikitech.wikimedia.org/wiki/HAProxy [00:16:28] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:17:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:18:46] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:22:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:28:49] (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1000295 [00:38:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1000295 (owner: 10TrainBranchBot) [00:41:50] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudlb1002 is OK: OK: UP (pid=1164045) and all threads (17) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [00:42:00] RECOVERY - Bird Internet Routing Daemon on cloudlb1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [00:42:08] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:42:36] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:46:40] (03PS2) 10RLazarus: mediawiki: Support one-off jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/988849 (https://phabricator.wikimedia.org/T341553) [00:46:42] (03PS2) 10RLazarus: Add helmfile for running MediaWiki one-off jobs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) [00:46:49] (03PS2) 10RLazarus: deployment_server: Add mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/988851 (https://phabricator.wikimedia.org/T341553) [00:48:19] (03CR) 10CI reject: [V: 04-1] Add helmfile for running MediaWiki one-off jobs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [01:03:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1000295 (owner: 10TrainBranchBot) [01:05:18] (03CR) 10RLazarus: "> When we're closer to a 100% failure rate it doesn't matter to have pinpoint accuracy as we're totally down anyway; Having 105 bad reques" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [01:06:49] (03CR) 10RLazarus: "Same request as at https://gerrit.wikimedia.org/r/c/973871/comments/8e49812e_ed2f60f1, otherwise LGTM." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [05:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:15:10] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:15:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:16:18] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:16:22] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:25:42] (03CR) 10Vgutierrez: Add module for ncmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [07:01:03] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate stat1005.eqiad.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:44:45] !log upload golang-github-u-root-u-root_0.11.0 to apt.wm.o (bookworm) [07:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-api-int (k8s) 19m 32s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:54:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-api-int (k8s) 17m 12s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:00:06] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T0800). [08:00:06] hubaishan: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:51] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144 (10MoritzMuehlenhoff) [08:03:42] !log taavi@cumin1002 conftool action : set/pooled=inactive; selector: name=cloudweb1003.wikimedia.org [08:10:47] !log update netboot image for Bullseye 11.9 point release T357144 [08:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:52] T357144: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144 [08:11:09] !log set esams NL-IX peering as primary [08:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:18] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudweb1003.wikimedia.org with OS bullseye [08:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:18:49] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:23:42] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb1003.wikimedia.org with reason: host reimage [08:23:49] (JobUnavailable) resolved: Reduced availability for job nutcracker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:23:55] !log update netboot image for Bookworm 12.5 point release T357133 [08:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:00] T357133: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133 [08:25:50] (Traffic bill over quota) firing: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:26:07] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb1003.wikimedia.org with reason: host reimage [08:26:36] !log hashar@deploy2002 Started deploy [integration/docroot@2360fa1]: Updating eslint-config-wikimedia and mediawiki-phan-config [08:26:42] !log hashar@deploy2002 Finished deploy [integration/docroot@2360fa1]: Updating eslint-config-wikimedia and mediawiki-phan-config (duration: 00m 06s) [08:29:46] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:50] (Traffic bill over quota) firing: (2) Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:36:20] (03CR) 10Muehlenhoff: [C: 03+2] ulogd: Make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/995213 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [08:38:56] Amir1 when to apply https://gerrit.wikimedia.org/r/1000601 [08:40:53] (03PS7) 10Muehlenhoff: Pass the ensure parameter to the Ferm logging class [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) [08:45:50] (Traffic bill over quota) firing: (2) Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:47:45] (03PS1) 10Majavah: lxc: Rely on default network config [puppet] - 10https://gerrit.wikimedia.org/r/1002357 (https://phabricator.wikimedia.org/T356551) [08:50:50] (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:51:25] (03CR) 10Hashar: [C: 03+2] Bump javascript from es2018 to es2020 [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/999902 (owner: 10Hashar) [08:51:59] (03Merged) 10jenkins-bot: Bump javascript from es2018 to es2020 [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/999902 (owner: 10Hashar) [08:52:26] !log hashar@deploy2002 Started deploy [gerrit/gerrit@db69b2b]: Bump javascript from es2018 to es2020 [08:52:33] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@db69b2b]: Bump javascript from es2018 to es2020 (duration: 00m 07s) [08:53:45] (03PS1) 10Brouberol: superset: set puppet service states to production [puppet] - 10https://gerrit.wikimedia.org/r/1002362 (https://phabricator.wikimedia.org/T356483) [08:54:13] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudweb1003.wikimedia.org with OS bullseye [08:54:54] !log taavi@cumin1002 conftool action : set/pooled=no; selector: name=cloudweb1003.wikimedia.org [08:54:56] (03PS2) 10Brouberol: superset: set puppet service states to production [puppet] - 10https://gerrit.wikimedia.org/r/1002362 (https://phabricator.wikimedia.org/T356483) [08:56:55] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) >>! In T300152#9514644, @bking wrote: > @ayounsi Apologies for the trouble, I didn't realize `sretest2005` was in active use. Unfortunatel... [08:58:08] (03CR) 10Jelto: [C: 03+1] "lgtm, I checked the image and the landing page contains a link to the dumps now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000025 (https://phabricator.wikimedia.org/T317436) (owner: 10Dzahn) [08:58:19] !log taavi@cumin1002 conftool action : set/pooled=yes; selector: name=cloudweb1003.wikimedia.org [09:00:26] (03CR) 10Vgutierrez: [C: 04-1] Add module for ncmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [09:01:33] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T357263 (10Yk32dx) [09:04:14] (03PS1) 10Majavah: hieradata: openstack: restrict eqiad1 memcached traffic to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/1002372 (https://phabricator.wikimedia.org/T355417) [09:05:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [09:07:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1002372 (https://phabricator.wikimedia.org/T355417) (owner: 10Majavah) [09:07:40] (03CR) 10Majavah: [C: 03+2] hieradata: openstack: restrict eqiad1 memcached traffic to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/1002372 (https://phabricator.wikimedia.org/T355417) (owner: 10Majavah) [09:10:24] (03PS2) 10Brouberol: superset: setup dyna mapping rules [puppet] - 10https://gerrit.wikimedia.org/r/997858 (https://phabricator.wikimedia.org/T356481) [09:11:54] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133 (10MoritzMuehlenhoff) [09:12:07] \exit [09:12:13] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144 (10MoritzMuehlenhoff) [09:12:36] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997858 (https://phabricator.wikimedia.org/T356481) (owner: 10Brouberol) [09:13:07] (03CR) 10Brouberol: "It is the port of the NodePort service for the TCP/SSL ingress of our cluster, yes" [puppet] - 10https://gerrit.wikimedia.org/r/997858 (https://phabricator.wikimedia.org/T356481) (owner: 10Brouberol) [09:16:20] !log installing java 8 security updates on Buster [09:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:29] (03PS2) 10Brouberol: Superset: setup temporary external domains for the k8s deployments [dns] - 10https://gerrit.wikimedia.org/r/997859 (https://phabricator.wikimedia.org/T356482) [09:21:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P56672 and previous config saved to /var/cache/conftool/dbconfig/20240212-092140-ladsgroup.json [09:21:50] !log restarting archiva to pick up Java security updates [09:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver1002.eqiad.wmnet [09:35:59] (03PS1) 10Filippo Giunchedi: multirootca: depend on cfssl when generating CRLs [puppet] - 10https://gerrit.wikimedia.org/r/1002384 (https://phabricator.wikimedia.org/T352640) [09:36:01] (03PS1) 10Filippo Giunchedi: puppetserver: add Puppet CA custom name and SANs [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) [09:36:03] (03PS1) 10Filippo Giunchedi: puppetdb: allow both secret() and source for site key material [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) [09:36:05] (03PS1) 10Filippo Giunchedi: postgresql: install configuration before starting the server [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) [09:36:07] (03PS1) 10Filippo Giunchedi: postgresql: use 'systemd reload' for pgreload [puppet] - 10https://gerrit.wikimedia.org/r/1002388 (https://phabricator.wikimedia.org/T352640) [09:36:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1002.eqiad.wmnet [09:36:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P56674 and previous config saved to /var/cache/conftool/dbconfig/20240212-093645-ladsgroup.json [09:46:54] (03CR) 10Filippo Giunchedi: "I have been testing this change in Pontoon to enable puppetserver support, let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [09:49:05] (03CR) 10Filippo Giunchedi: "This is both for symmetry reasons and being able to use non-secret() key material for puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [09:50:30] (03CR) 10Filippo Giunchedi: "The patch enables a functioning and fully configured postgresql instance from the first puppet run, and the proper solution to my hacky at" [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [09:51:02] (03CR) 10Filippo Giunchedi: profile: restart postgres on first install / bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi) [09:51:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P56675 and previous config saved to /var/cache/conftool/dbconfig/20240212-095150-ladsgroup.json [09:54:00] (03PS1) 10Muehlenhoff: netbox: Set deployment method to avoid creating scap target [puppet] - 10https://gerrit.wikimedia.org/r/1002392 [09:54:08] (03PS2) 10Muehlenhoff: netbox: Set deployment method to avoid creating scap target [puppet] - 10https://gerrit.wikimedia.org/r/1002392 [09:55:36] (03CR) 10CI reject: [V: 04-1] netbox: Set deployment method to avoid creating scap target [puppet] - 10https://gerrit.wikimedia.org/r/1002392 (owner: 10Muehlenhoff) [09:56:45] (03PS2) 10Hashar: Gerrit 3.8 no more set redundant real_author [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/999928 (https://phabricator.wikimedia.org/T354886) [09:57:44] (03PS3) 10Muehlenhoff: netbox: Set deployment method to avoid creating scap target [puppet] - 10https://gerrit.wikimedia.org/r/1002392 [10:01:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1002392 (owner: 10Muehlenhoff) [10:04:52] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: absent check_systemd_state [puppet] - 10https://gerrit.wikimedia.org/r/998924 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [10:06:00] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Remove legacy codfw vc switches from synced hiera data after netbox status change - cmooney@cumin1002 - T355544" [10:06:06] T355544: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 [10:06:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P56676 and previous config saved to /var/cache/conftool/dbconfig/20240212-100655-ladsgroup.json [10:07:34] (03CR) 10Stevemunene: [C: 03+1] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/997859 (https://phabricator.wikimedia.org/T356482) (owner: 10Brouberol) [10:07:43] (03PS4) 10Muehlenhoff: netbox: Set deployment method to avoid creating scap target [puppet] - 10https://gerrit.wikimedia.org/r/1002392 [10:07:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Remove legacy codfw vc switches from synced hiera data after netbox status change - cmooney@cumin1002 - T355544" [10:08:26] (03CR) 10Stevemunene: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1002362 (https://phabricator.wikimedia.org/T356483) (owner: 10Brouberol) [10:10:34] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:10:46] PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:10:58] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:10:58] PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:11:04] PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:11:10] lots of puppet failures ongoing [10:12:29] lots of ps failures [10:12:53] is there maintenance on the mgmt network? [10:13:37] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T357263 (10hnowlan) 05Open→03Resolved a:03hnowlan Required fields empty, no domain specified. Declining. [10:15:15] (03PS1) 10Arnaudb: mariadb: last test for new clone cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1000299 (https://phabricator.wikimedia.org/T343674) [10:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:16:21] (03Abandoned) 10Arnaudb: mariadb: last test for new clone cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1000299 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:16:49] topranks: ^ you seem to be doing something on the codfw asws [10:17:59] main metric signals seem unaffected [10:18:49] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:19:50] codfw power usage reported as 0 at the moment, but it is a metrics deficit [10:19:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2180.codfw.wmnet with reason: T343674 testing cloning a single instance node to a multi-instance one [10:19:52] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 138881 [10:19:56] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [10:20:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: T343674 testing cloning a single instance node to a multi-instance one [10:20:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2190 T343674', diff saved to https://phabricator.wikimedia.org/P56677 and previous config saved to /var/cache/conftool/dbconfig/20240212-102046-arnaudb.json [10:20:55] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 138881 [10:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:23:05] jynus: it seems we lost mr1-codfw (cc topranks, XioNoX) [10:23:48] I can't ssh to it nor ping it both from outside and inside the network [10:24:20] how did you identified it was the router? [10:24:22] indeed, I don't seem to be able to reach it via the OOB interface either [10:24:47] (03CR) 10Brouberol: [C: 03+2] Superset: setup temporary external domains for the k8s deployments [dns] - 10https://gerrit.wikimedia.org/r/997859 (https://phabricator.wikimedia.org/T356482) (owner: 10Brouberol) [10:24:50] taavi: mr1 is the OOB :D [10:25:10] I guess he meant the direct external interface [10:25:24] yes [10:25:24] or just guessed based on the loss of everything else? [10:26:17] volans: nothing that can be done without onsite hands, maybe it will come back on its own before dcops gets there [10:26:27] I know, just FYI [10:26:34] th [10:26:35] x [10:27:02] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1002392 (owner: 10Muehlenhoff) [10:29:01] jynus: to answer your question, I opened icinga, saw the 4 blocking network outages for all 4 rows in codfw, then looked at the 57 unreachable devices and are all hosts in the mgmt network [10:29:14] so I tried to connect to mr1-codfw [10:29:14] yes, I got there [10:29:39] it's that last jump I didn't do [10:37:38] (03CR) 10Muehlenhoff: [C: 03+2] "Cood catch! I've created https://gerrit.wikimedia.org/r/c/operations/puppet /+/1002392 to address this." [puppet] - 10https://gerrit.wikimedia.org/r/997790 (owner: 10Muehlenhoff) [10:37:42] (03PS1) 10Arnaudb: mariadb: decom db1133 [puppet] - 10https://gerrit.wikimedia.org/r/1000300 (https://phabricator.wikimedia.org/T350458) [10:44:48] (03PS1) 10Arnaudb: mariadb: removes db1135 [puppet] - 10https://gerrit.wikimedia.org/r/1000301 (https://phabricator.wikimedia.org/T350458) [10:46:00] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:51:26] 10SRE, 10Infrastructure-Foundations: Further enhancements for nftables support in profile::firewall - https://phabricator.wikimedia.org/T348498 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T1100) [11:01:03] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate stat1005.eqiad.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:01:08] (03PS1) 10Arnaudb: mariadb: removes db1139 [puppet] - 10https://gerrit.wikimedia.org/r/1000302 (https://phabricator.wikimedia.org/T350458) [11:03:14] (03PS1) 10Arnaudb: mariadb: removes db1140 [puppet] - 10https://gerrit.wikimedia.org/r/1000303 (https://phabricator.wikimedia.org/T350458) [11:04:30] 10SRE: Page: cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% - https://phabricator.wikimedia.org/T357198 (10hnowlan) Could `Too many eqiad mediawiki originals uploads` be a red herring? The traffic jumps are all in codfw. I'm not sure what's actionable in this ticket. [11:05:48] (03PS1) 10Arnaudb: mariadb: removes db1144 [puppet] - 10https://gerrit.wikimedia.org/r/1000304 (https://phabricator.wikimedia.org/T350458) [11:06:42] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:08:07] (03PS1) 10Arnaudb: mariadb: removes db1145 [puppet] - 10https://gerrit.wikimedia.org/r/1000305 (https://phabricator.wikimedia.org/T350458) [11:08:23] (03CR) 10Fabfur: [C: 03+1] "gtm!" [puppet] - 10https://gerrit.wikimedia.org/r/997858 (https://phabricator.wikimedia.org/T356481) (owner: 10Brouberol) [11:09:04] 10ops-codfw: mr1-eqiad down - https://phabricator.wikimedia.org/T357291 (10ayounsi) [11:09:45] (03CR) 10Brouberol: [C: 03+2] superset: setup dyna mapping rules [puppet] - 10https://gerrit.wikimedia.org/r/997858 (https://phabricator.wikimedia.org/T356481) (owner: 10Brouberol) [11:10:43] (03PS1) 10Arnaudb: mariadb: removes db1146 [puppet] - 10https://gerrit.wikimedia.org/r/1002406 (https://phabricator.wikimedia.org/T350458) [11:11:57] 10ops-codfw: mr1-codfw down - https://phabricator.wikimedia.org/T357291 (10ayounsi) [11:12:41] (03PS1) 10Arnaudb: mariadb: removes db1149 [puppet] - 10https://gerrit.wikimedia.org/r/1002407 (https://phabricator.wikimedia.org/T350458) [11:13:43] 10ops-codfw: mr1-codfw down - https://phabricator.wikimedia.org/T357291 (10ayounsi) [11:20:52] (03PS1) 10Hnowlan: admin: add jaz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1002397 (https://phabricator.wikimedia.org/T356917) [11:22:30] (03CR) 10CI reject: [V: 04-1] admin: add jaz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1002397 (https://phabricator.wikimedia.org/T356917) (owner: 10Hnowlan) [11:24:55] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10TimedMediaHandler, 10media-backups: Consider increasing $wgTranscodeBackgroundSizeLimit to 5GB - https://phabricator.wikimedia.org/T357184 (10TheDJ) TranscodeBackgroundSizeLimit was compared to the predictive estimated result size. Estimated... [11:25:51] (03PS1) 10Clément Goubert: mw-on-k8s: Convert latency to seconds for display [alerts] - 10https://gerrit.wikimedia.org/r/1002398 [11:26:03] (03PS2) 10Hnowlan: admin: add jaz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1002397 (https://phabricator.wikimedia.org/T356917) [11:26:44] (03PS1) 10Clément Goubert: mediawiki: Allow setting deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002399 [11:30:36] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:31:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:39:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ElineWMDE - https://phabricator.wikimedia.org/T357097 (10hnowlan) 05In progress→03Stalled Moving to stalled pending approval from an analytics-privatedata-users owner (@odimitrijevic , @WDoranWMF, @Ahoelzl @Milimetric) [11:39:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Arthur Taylor - https://phabricator.wikimedia.org/T357147 (10hnowlan) 05In progress→03Stalled Moving to stalled pending approval from an analytics-privatedata-users owner (@odimitrijevic , @WDoranWMF, @Ahoelzl @Milimetric) [11:42:51] (03Abandoned) 10Clément Goubert: utils: Simple dblist_to_urllist.py script [puppet] - 10https://gerrit.wikimedia.org/r/923591 (owner: 10Clément Goubert) [11:43:16] (03PS1) 10Volans: ssh client config: add support for OOB network [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1002401 [11:43:40] (03CR) 10Volans: "Context: https://wikitech.wikimedia.org/wiki/Out-of-band_network" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1002401 (owner: 10Volans) [11:44:29] (03Abandoned) 10Clément Goubert: httpbb: Migrate to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/993710 (https://phabricator.wikimedia.org/T356054) (owner: 10Clément Goubert) [11:45:17] jouncebot: now [11:45:17] For the next 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T1100) [11:45:25] (03Abandoned) 10Clément Goubert: prometheus-php-fpm-exporter, prometheus-apache-exporter: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/997896 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:45:29] (03Restored) 10Clément Goubert: prometheus-php-fpm-exporter, prometheus-apache-exporter: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/997896 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:45:31] * Lucas_WMDE starts some long-running maintenance scripts [11:45:35] (03Abandoned) 10Clément Goubert: C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [11:45:43] (03CR) 10Filippo Giunchedi: [C: 03+1] mw-on-k8s: Convert latency to seconds for display [alerts] - 10https://gerrit.wikimedia.org/r/1002398 (owner: 10Clément Goubert) [11:46:03] !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki viwiki --current --all --touched-after=20230613000000 2>&1 | tee ~/T315510-viwiki # in tmux [11:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:21] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Convert latency to seconds for display [alerts] - 10https://gerrit.wikimedia.org/r/1002398 (owner: 10Clément Goubert) [11:47:29] (03Merged) 10jenkins-bot: mw-on-k8s: Convert latency to seconds for display [alerts] - 10https://gerrit.wikimedia.org/r/1002398 (owner: 10Clément Goubert) [11:48:04] !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki frwiki --current --all --touched-after=20230613000000 --start '["7544396"]' 2>&1 | tee ~/T315510-frwiki # in tmux [11:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:34] (03CR) 10Clément Goubert: [C: 03+1] service mesh: Listen on IPv6 too (copy patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/999866 (owner: 10Alexandros Kosiaris) [11:55:10] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) [11:56:39] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) @colewhite /srv/opensearch looks related to https://phabricator.wikimedia.org/T314098 possibly? Is is still needed, if not can you please remove it? [11:59:20] is gerrit reachable for anyone else? [11:59:42] I can ping it but not load it in the browser [11:59:45] Down here too taavi [11:59:47] indeed [11:59:51] confirm down here [12:01:20] can someone try to restart the service or contact releng [12:01:25] hashar: ^ [12:01:31] (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:01:40] !log taavi@gerrit1003 ~ $ sudo systemctl restart apache2 [12:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:49] there it is the confirmation [12:01:57] back up for me [12:02:00] it is back [12:02:12] going on a meeting, will look at logs later [12:02:30] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 36236 [12:04:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 36236 [12:04:11] (03CR) 10Clément Goubert: [C: 03+1] "I think this CR and subsequent should be disconnected from the previous one bumping mediawiki, it introduces quite a lot of noise in the d" [deployment-charts] - 10https://gerrit.wikimedia.org/r/999866 (owner: 10Alexandros Kosiaris) [12:04:48] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:06:00] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:06:05] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) [12:06:31] (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:13:10] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:13:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1042.eqiad.wmnet with OS bullseye [12:13:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host restbase1042.eqiad.wmnet with OS bullseye completed: - restbase1042 (**WARN**) - R... [12:13:27] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:13:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1041.eqiad.wmnet with OS bullseye [12:13:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host restbase1041.eqiad.wmnet with OS bullseye completed: - restbase1041 (**WARN**) - R... [12:13:37] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:13:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1039.eqiad.wmnet with OS bullseye [12:13:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host restbase1039.eqiad.wmnet with OS bullseye completed: - restbase1039 (**WARN**) - R... [12:13:45] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:13:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1038.eqiad.wmnet with OS bullseye [12:13:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host restbase1038.eqiad.wmnet with OS bullseye completed: - restbase1038 (**WARN**) - R... [12:13:54] (03PS1) 10Majavah: Remove ns-recursor0.openstack.eqiad.wikimediacloud.org names [dns] - 10https://gerrit.wikimedia.org/r/1002446 (https://phabricator.wikimedia.org/T346426) [12:14:10] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [12:14:13] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:14:22] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:14:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1037.eqiad.wmnet with OS bullseye [12:14:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host restbase1037.eqiad.wmnet with OS bullseye completed: - restbase1037 (**WARN**) - R... [12:14:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:14:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1034.eqiad.wmnet with OS bullseye [12:14:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host restbase1034.eqiad.wmnet with OS bullseye completed: - restbase1034 (**WARN**) - R... [12:14:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:14:40] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:14:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1040.eqiad.wmnet with OS bullseye [12:14:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host restbase1040.eqiad.wmnet with OS bullseye completed: - restbase1040 (**WARN**) - R... [12:14:51] (03CR) 10Majavah: [C: 03+1] "https://phabricator.wikimedia.org/T346426#9533557" [puppet] - 10https://gerrit.wikimedia.org/r/998780 (https://phabricator.wikimedia.org/T346426) (owner: 10Arturo Borrero Gonzalez) [12:14:57] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1035.mgmt.eqiad.wmnet with reboot policy FORCED [12:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:17:18] !log brouberol@cumin1002 START - Cookbook sre.puppet.renew-cert for stat1005.eqiad.wmnet: Renew puppet certificate - brouberol@cumin1002 [12:17:40] (03CR) 10Ayounsi: [C: 03+1] ssh client config: add support for OOB network [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1002401 (owner: 10Volans) [12:19:04] !log brouberol@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for stat1005.eqiad.wmnet: Renew puppet certificate - brouberol@cumin1002 [12:19:07] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1002401 (owner: 10Volans) [12:19:49] (03CR) 10Muehlenhoff: [C: 03+1] "You can just merge, there are some small changes, I'll probably make a new deb in the next weeks." [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1002401 (owner: 10Volans) [12:20:35] (03CR) 10Volans: [V: 03+2 C: 03+2] "Ack, thanks!" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1002401 (owner: 10Volans) [12:20:48] (PuppetCertificateAboutToExpire) resolved: Puppet CA certificate stat1005.eqiad.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:21:05] brouberol: fixed ^^^ :) [12:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:21:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: drop temporal NAT for legacy DNS resolvers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998780 (https://phabricator.wikimedia.org/T346426) (owner: 10Arturo Borrero Gonzalez) [12:26:15] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) /srv/megacli contains an old released of megacli from 2014 along with some legacy shared libs and script which wgets these files from a server called b... [12:29:22] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) [12:29:38] (03PS1) 10Slyngshede: PuppetPendingCertificateRequest linting fails in production [alerts] - 10https://gerrit.wikimedia.org/r/1002453 [12:29:46] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:26] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002409 [12:33:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [12:33:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [12:41:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1002397 (https://phabricator.wikimedia.org/T356917) (owner: 10Hnowlan) [12:43:06] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudbackup1001-dev.eqiad.wmnet [12:47:45] (03PS1) 10Jaime Nuche: support Zuul v2 on bullseye contint hosts [puppet] - 10https://gerrit.wikimedia.org/r/1002461 [12:47:54] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet [12:48:14] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudbackup1002-dev.eqiad.wmnet [12:52:51] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet [12:55:49] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2001-dev.codfw.wmnet [12:57:06] (03PS1) 10Brouberol: idp: Register superset and superset-next IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1002462 (https://phabricator.wikimedia.org/T353794) [13:08:06] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:08:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:09:50] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:11:08] (03Abandoned) 10Kosta Harlan: GrowthExperiments: Undeploy topic match mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912959 (https://phabricator.wikimedia.org/T335205) (owner: 10Kosta Harlan) [13:12:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51451 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:13:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:14:22] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:21:04] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2180.codfw.wmnet onto db2194.codfw.wmnet [13:21:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2180.codfw.wmnet onto db2194.codfw.wmnet [13:27:43] !log taavi@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol2001-dev.codfw.wmnet [13:28:04] (03PS1) 10Jelto: bump changelog to 1.9.7 [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002468 (https://phabricator.wikimedia.org/T316421) [13:28:06] (03PS1) 10Jelto: bump nodejs and npm version [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002469 (https://phabricator.wikimedia.org/T316421) [13:31:28] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2180.codfw.wmnet onto db2194.codfw.wmnet [13:31:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] bump changelog to 1.9.7 [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002468 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [13:32:04] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2180.codfw.wmnet onto db2194.codfw.wmnet [13:33:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] "2 minor comments (typos). Otherwise LGTM, once the typos are fixed, consider it +1ed." [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002469 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [13:34:24] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2180.codfw.wmnet onto db2194.codfw.wmnet [13:34:44] (03PS2) 10Jelto: bump nodejs and npm version [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002469 (https://phabricator.wikimedia.org/T316421) [13:34:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2180.codfw.wmnet onto db2194.codfw.wmnet [13:35:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] cache.mcrouter: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000032 (owner: 10Effie Mouzeli) [13:35:17] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2180.codfw.wmnet onto db2194.codfw.wmnet [13:35:45] (03CR) 10Jelto: bump nodejs and npm version (032 comments) [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002469 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [13:36:13] jouncebot: nowandntext [13:36:25] jouncebot: nowandnext [13:36:25] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [13:36:26] In 0 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T1400) [13:36:36] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to old pagelinks schema in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999084 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [13:37:22] (03Merged) 10jenkins-bot: Stop writing to old pagelinks schema in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999084 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [13:38:13] (03CR) 10Jelto: [C: 03+2] bump changelog to 1.9.7 [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002468 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [13:38:33] (03CR) 10Jelto: [C: 03+2] bump nodejs and npm version [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002469 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [13:39:00] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:999084|Stop writing to old pagelinks schema in s4 (T352010)]] [13:39:14] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:43:05] (03CR) 10Filippo Giunchedi: [C: 03+1] PuppetPendingCertificateRequest linting fails in production [alerts] - 10https://gerrit.wikimedia.org/r/1002453 (owner: 10Slyngshede) [13:44:02] !log taavi@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol2004-dev.codfw.wmnet [13:44:15] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2005-dev.codfw.wmnet [13:47:52] jouncebot: next [13:47:52] In 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T1400) [13:48:12] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 (10Ladsgroup) @WhatamIdoing Welcome to the mess of thumbnailing sizes in mediawiki. I wrote in more details in T211661#8377883. TLDR: That pregen sizes ar... [13:49:20] (03PS2) 10Brouberol: idp: Register superset and superset-next IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1002462 (https://phabricator.wikimedia.org/T353794) [13:49:49] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:999084|Stop writing to old pagelinks schema in s4 (T352010)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:49:53] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:50:36] (03PS3) 10Anzx: mywiki: create portal and draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990077 (https://phabricator.wikimedia.org/T352424) [13:50:52] (03PS5) 10Anzx: uzwiki: remove temporary logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992379 (https://phabricator.wikimedia.org/T353723) [13:55:04] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [13:57:47] !log taavi@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol2005-dev.codfw.wmnet [13:58:23] (03PS205) 10Arnaudb: mariadb: cookbook draft to clone multiinstance [cookbooks] - 10https://gerrit.wikimedia.org/r/976709 (https://phabricator.wikimedia.org/T343674) [13:58:50] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2037 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T1400). [14:00:04] hubaishan and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:17] * anzx o/ [14:01:24] I'm here too, waiting for the backport to be done though then proceeding to flip over Grafana with denisse [14:01:33] Here as well. [14:02:13] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:999084|Stop writing to old pagelinks schema in s4 (T352010)]] (duration: 23m 12s) [14:02:28] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:04:44] no one for the backport ? Lucas_WMDE maybe ? [14:06:46] I was in an overrunning meeting but I have some time now [14:06:48] guess I’m deploying then [14:08:15] hubaishan, anzx: are you around? [14:08:19] thank you! I have no stake in the patch except waiting for the backport to be done :D [14:08:22] yes [14:09:15] o/ [14:09:33] (03PS2) 10Lucas Werkmeister (WMDE): Set $wgMinervaEnableSiteNotice for arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000601 (https://phabricator.wikimedia.org/T356460) (owner: 10Hubaishan) [14:09:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000601 (https://phabricator.wikimedia.org/T356460) (owner: 10Hubaishan) [14:10:20] since https://www.mediawiki.org/wiki/Manual:Interface/Sitenotice#Mobile says $wgMinervaEnableSiteNotice defaults to true, I wonder if we should change the default to true in production too 🤔 [14:10:23] but that’s definitely a bigger question [14:10:32] (and I don’t think I can be bothered to pursue it ^^) [14:10:41] (03Merged) 10jenkins-bot: Set $wgMinervaEnableSiteNotice for arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000601 (https://phabricator.wikimedia.org/T356460) (owner: 10Hubaishan) [14:10:55] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1000601|Set $wgMinervaEnableSiteNotice for arwikisource (T356460)]] [14:10:59] T356460: set $wgMinervaEnableSiteNotice to true in arwikisource - https://phabricator.wikimedia.org/T356460 [14:12:21] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and hubaishan: Backport for [[gerrit:1000601|Set $wgMinervaEnableSiteNotice for arwikisource (T356460)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:12:31] hubaishan: please test on mwdebug :) [14:12:59] it's OK [14:13:21] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and hubaishan: Continuing with sync [14:17:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:18:23] (03CR) 10Muehlenhoff: bump nodejs and npm version (032 comments) [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002469 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [14:18:38] anzx: is namespaceDupes fixed now? [14:19:01] Lucas_WMDE: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/998858 seems like it [14:19:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:48] but that change isn’t in any train branch yet, as far as I can tell [14:20:00] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1000601|Set $wgMinervaEnableSiteNotice for arwikisource (T356460)]] (duration: 09m 05s) [14:20:06] T356460: set $wgMinervaEnableSiteNotice to true in arwikisource - https://phabricator.wikimedia.org/T356460 [14:20:41] Lucas_WMDE: ok you can skip namespace change , will schedule it for later [14:20:46] suggestion: deploy the uzwiki and mywiki changes; see if namespaceDupes crashes; if yes, backport the core change and see if that fixes it [14:21:18] though that might take more than the rest of the hour [14:21:27] (03PS6) 10Lucas Werkmeister (WMDE): uzwiki: remove temporary logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992379 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:22:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992379 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:22:08] eh, let’s just do uzwiki then [14:22:13] Ok [14:22:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:22:32] there’s a lot of ”Received cirrusSearchElasticaWrite job for an unwritable cluster cloudelastic” in logspam-watch [14:22:38] * Lucas_WMDE looks if that’s on the job runners or unrelated to the alert [14:22:40] oh, scratch that [14:22:42] the alert *resolvedo [14:22:45] * *resolved* [14:22:49] (03Merged) 10jenkins-bot: uzwiki: remove temporary logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992379 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:23:04] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992379|uzwiki: remove temporary logo files (T353723)]] [14:23:09] T353723: Requesting temporary logo change for uz.wikipedia.org - https://phabricator.wikimedia.org/T353723 [14:23:18] yeah, those errors sure were on kube-mw-jobrunner [14:23:20] but they went away again [14:23:56] apparently they were also all on commonswiki [14:24:16] (03PS2) 10Raymond Ndibe: [domainproxy]: increase client_max_body_size [puppet] - 10https://gerrit.wikimedia.org/r/998659 (https://phabricator.wikimedia.org/T351178) [14:24:24] !log lucaswerkmeister-wmde@deploy2002 anzx and lucaswerkmeister-wmde: Backport for [[gerrit:992379|uzwiki: remove temporary logo files (T353723)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:25:03] (03CR) 10Giuseppe Lavagetto: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/998986 (owner: 10Giuseppe Lavagetto) [14:25:19] I get a 404 on https://en.wikipedia.org/static/images/project-logos/uzwiki-birthday.png with mwdebug, at least [14:25:32] anzx: does it look okay for you as well? [14:26:11] Lucas_WMDE: yeah all image links give 404 [14:26:30] !log lucaswerkmeister-wmde@deploy2002 anzx and lucaswerkmeister-wmde: Continuing with sync [14:26:32] ok [14:26:56] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.82 ms [14:26:56] RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.31 ms [14:27:00] RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.09 ms [14:27:00] PROBLEM - BFD status on lsw1-a4-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:27:20] RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [14:27:32] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.72 ms [14:27:48] !log installing Linux 6.1.76 on Bookworm hosts [14:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:04] PROBLEM - BFD status on lsw1-b7-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:28:49] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:00] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2037 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:29:35] (03CR) 10Raymond Ndibe: "If the question is for me David the answer is no. I only know of the harbor related ones" [puppet] - 10https://gerrit.wikimedia.org/r/998659 (https://phabricator.wikimedia.org/T351178) (owner: 10Raymond Ndibe) [14:32:57] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992379|uzwiki: remove temporary logo files (T353723)]] (duration: 09m 53s) [14:33:04] T353723: Requesting temporary logo change for uz.wikipedia.org - https://phabricator.wikimedia.org/T353723 [14:33:30] alright, then I think I’m done for now [14:33:39] !log UTC afternoon backport+config window done [14:33:41] godog: over to you [14:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:17] Here. [14:34:38] thank you Lucas_WMDE ! [14:35:10] denisse: feel free to start any time [14:36:24] !log starting Upgrade Grafana hosts to Bookworm - T352665 [14:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:37] T352665: Upgrade Grafana hosts to Bookworm - https://phabricator.wikimedia.org/T352665 [14:38:16] (03CR) 10Andrea Denisse: [C: 03+2] grafana: Failover from grafana1002 to grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/992710 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [14:38:38] (03CR) 10Andrea Denisse: [C: 03+2] grafana: Ensure user traffic goes to grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/992719 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [14:38:49] (JobUnavailable) firing: (2) Reduced availability for job pdu_sentry4 in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:24] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133 (10MoritzMuehlenhoff) [14:41:55] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) There were multiple image files used to install RIPE anchors in /srv/firmware, these are not needed any longer and have been removed. [14:43:55] (03PS1) 10Jelto: fix typo, set debhelper-compat in control file [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002566 (https://phabricator.wikimedia.org/T316421) [14:44:29] (03CR) 10Jelto: [C: 03+2] bump nodejs and npm version (032 comments) [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002469 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [14:45:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002566 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [14:45:38] (03CR) 10Hnowlan: [C: 03+2] admin: add jaz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1002397 (https://phabricator.wikimedia.org/T356917) (owner: 10Hnowlan) [14:47:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2180.codfw.wmnet onto db2194.codfw.wmnet [14:47:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for JTanner - https://phabricator.wikimedia.org/T356917 (10hnowlan) [14:47:49] !log Completed failover to grafana2001 - T352665 [14:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:06] T352665: Upgrade Grafana hosts to Bookworm - https://phabricator.wikimedia.org/T352665 [14:49:00] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for JTanner - https://phabricator.wikimedia.org/T356917 (10hnowlan) 05In progress→03Resolved a:03hnowlan User `jaz` has been added to analytics-privatedata-users, you should be able to use superset with t... [14:49:32] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.codfw.wmnet [14:51:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2194.codfw.wmnet with OS bookworm [14:52:28] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:53:06] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:53:08] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 (10TheDJ) >>! In T355914#9532098, @WhatamIdoing wrote: > Are we really never pregenerating the thumbnail sizes that get used in practice? They almost alw... [14:53:36] (03PS1) 10Arnaudb: mariadb: disable systematic formatting of /srv [puppet] - 10https://gerrit.wikimedia.org/r/1002410 (https://phabricator.wikimedia.org/T343674) [14:54:46] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:55:22] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:01] (03CR) 10Effie Mouzeli: mcrouter: add chart (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:56:12] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2004-dev.codfw.wmnet [14:56:17] (03CR) 10Effie Mouzeli: mcrouter: add chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:56:30] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.codfw.wmnet [14:58:49] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:23] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:59:54] (03PS1) 10Filippo Giunchedi: hieradata: move grafana-next from codfw to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1002569 (https://phabricator.wikimedia.org/T352665) [14:59:59] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:00:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Remove ns-recursor0.openstack.eqiad.wikimediacloud.org names [dns] - 10https://gerrit.wikimedia.org/r/1002446 (https://phabricator.wikimedia.org/T346426) (owner: 10Majavah) [15:00:38] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1002569 (https://phabricator.wikimedia.org/T352665) (owner: 10Filippo Giunchedi) [15:00:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [dns] - 10https://gerrit.wikimedia.org/r/1002446 (https://phabricator.wikimedia.org/T346426) (owner: 10Majavah) [15:00:57] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] hieradata: move grafana-next from codfw to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1002569 (https://phabricator.wikimedia.org/T352665) (owner: 10Filippo Giunchedi) [15:02:01] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:02:25] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:02:25] 10SRE, 10SRE-swift-storage, 10Commons, 10User-ArielGlenn: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822 (10Bugreporter) [15:03:01] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.codfw.wmnet [15:06:56] (03CR) 10Arnaudb: [C: 03+1] sessionstore: setup sessionstore200[4-6] (new) [deployment-charts] - 10https://gerrit.wikimedia.org/r/998538 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [15:07:21] !log Reimage Standby Host (grafana1002) - T352665 [15:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:36] T352665: Upgrade Grafana hosts to Bookworm - https://phabricator.wikimedia.org/T352665 [15:08:06] !log denisse@cumin2002 START - Cookbook sre.hosts.reimage for host grafana1002.eqiad.wmnet with OS bookworm [15:08:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: testing db2194 done', diff saved to https://phabricator.wikimedia.org/P56679 and previous config saved to /var/cache/conftool/dbconfig/20240212-150810-arnaudb.json [15:11:10] (03CR) 10MVernon: [C: 03+1] sessionstore: setup sessionstore200[4-6] (new) [deployment-charts] - 10https://gerrit.wikimedia.org/r/998538 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [15:11:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [15:14:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [15:15:34] (03CR) 10Jelto: [C: 03+2] fix typo, set debhelper-compat in control file [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1002566 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [15:16:52] !log denisse@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on grafana1002.eqiad.wmnet with reason: host reimage [15:19:40] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on grafana1002.eqiad.wmnet with reason: host reimage [15:20:47] 10SRE-tools, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10joanna_borun) p:05Triage→03Medium [15:23:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: testing db2194 done', diff saved to https://phabricator.wikimedia.org/P56680 and previous config saved to /var/cache/conftool/dbconfig/20240212-152315-arnaudb.json [15:24:13] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410 (10joanna_borun) p:05Triage→03Medium [15:25:32] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/997555 (owner: 10JHathaway) [15:26:21] (03CR) 10Filippo Giunchedi: [C: 03+1] SystemdUnitFailed: remove 'Failed' from alert text [alerts] - 10https://gerrit.wikimedia.org/r/998545 (owner: 10Herron) [15:28:46] (03CR) 10Majavah: [C: 03+2] Remove ns-recursor0.openstack.eqiad.wikimediacloud.org names [dns] - 10https://gerrit.wikimedia.org/r/1002446 (https://phabricator.wikimedia.org/T346426) (owner: 10Majavah) [15:29:20] 10SRE-tools, 10Infrastructure-Foundations: Upgrade BGPAlerter to 1.33 - https://phabricator.wikimedia.org/T354998 (10joanna_borun) p:05Triage→03Low [15:30:23] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:32:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] mediawiki: Allow setting deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002399 (owner: 10Clément Goubert) [15:34:03] jouncebot: now [15:34:04] No deployments scheduled for the next 0 hour(s) and 55 minute(s) [15:34:22] does anyone mind if I backport some Wikibase(Lexeme) changes? [15:34:32] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host grafana1002.eqiad.wmnet with OS bookworm [15:36:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2194.codfw.wmnet with OS bookworm [15:36:34] !log Failover Back to grafana1002 - T352665 [15:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:38] T352665: Upgrade Grafana hosts to Bookworm - https://phabricator.wikimedia.org/T352665 [15:36:44] 10SRE-tools, 10Data-Persistence, 10Infrastructure-Foundations, 10Patch-For-Review: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10joanna_borun) p:05Triage→03Medium [15:37:17] 10SRE, 10Infrastructure-Foundations, 10User-aborrero: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10aborrero) [15:38:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 30%: testing db2194 done', diff saved to https://phabricator.wikimedia.org/P56681 and previous config saved to /var/cache/conftool/dbconfig/20240212-153820-arnaudb.json [15:39:04] (03CR) 10Alexandros Kosiaris: "Good point, doing so." [deployment-charts] - 10https://gerrit.wikimedia.org/r/999866 (owner: 10Alexandros Kosiaris) [15:39:09] (03PS1) 10Andrea Denisse: Revert "grafana: Failover from grafana1002 to grafana2001" [puppet] - 10https://gerrit.wikimedia.org/r/1002428 [15:40:55] (03PS1) 10Andrea Denisse: Revert "grafana: Ensure user traffic goes to grafana2001" [puppet] - 10https://gerrit.wikimedia.org/r/1002429 [15:41:08] (03CR) 10Andrea Denisse: [C: 03+2] Revert "grafana: Failover from grafana1002 to grafana2001" [puppet] - 10https://gerrit.wikimedia.org/r/1002428 (owner: 10Andrea Denisse) [15:41:12] (03CR) 10Eevans: [C: 03+2] sessionstore: setup sessionstore200[4-6] (new) [deployment-charts] - 10https://gerrit.wikimedia.org/r/998538 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [15:41:35] (03PS2) 10Alexandros Kosiaris: service mesh: Listen on IPv6 too (copy patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/999866 [15:41:37] (03PS4) 10Alexandros Kosiaris: service mesh: Listen unconditionally on IPv6/IPv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/999867 (https://phabricator.wikimedia.org/T255568) [15:41:39] (03PS4) 10Alexandros Kosiaris: termbox: Bump module dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/999882 (https://phabricator.wikimedia.org/T255568) [15:41:41] (03PS1) 10Andrea Denisse: Revert "hieradata: move grafana-next from codfw to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1002430 [15:42:22] (03CR) 10Andrea Denisse: [C: 03+2] Revert "grafana: Ensure user traffic goes to grafana2001" [puppet] - 10https://gerrit.wikimedia.org/r/1002429 (owner: 10Andrea Denisse) [15:42:28] (03Merged) 10jenkins-bot: sessionstore: setup sessionstore200[4-6] (new) [deployment-charts] - 10https://gerrit.wikimedia.org/r/998538 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [15:42:55] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [15:43:35] (03CR) 10Andrea Denisse: [C: 03+2] Revert "hieradata: move grafana-next from codfw to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1002430 (owner: 10Andrea Denisse) [15:44:04] (03PS1) 10Lucas Werkmeister (WMDE): Disable JSON Dump tests to prepare for schema change in Wikibase [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002431 (https://phabricator.wikimedia.org/T305660) [15:44:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039 (owner: 10Effie Mouzeli) [15:44:11] (03PS1) 10Lucas Werkmeister (WMDE): Return stdClass/Object from Serializers for empty lists [extensions/Wikibase] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002432 (https://phabricator.wikimedia.org/T305660) [15:44:15] (03PS1) 10Lucas Werkmeister (WMDE): Change expected serialization format of JSON dumps to include arrays [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002433 (https://phabricator.wikimedia.org/T305660) [15:44:28] ^ I’ll backport these unless someone tells me not to :) [15:44:33] (03CR) 10Alexandros Kosiaris: [C: 03+1] "This will need a helm chart version bump btw. But it should be a noop otherwise." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039 (owner: 10Effie Mouzeli) [15:45:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002431 (https://phabricator.wikimedia.org/T305660) (owner: 10Lucas Werkmeister (WMDE)) [15:45:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002432 (https://phabricator.wikimedia.org/T305660) (owner: 10Lucas Werkmeister (WMDE)) [15:45:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002433 (https://phabricator.wikimedia.org/T305660) (owner: 10Lucas Werkmeister (WMDE)) [15:45:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Thanks for the update and fix. +1ing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/999780 (owner: 10Alexandros Kosiaris) [15:46:02] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:46:15] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [15:46:32] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [15:47:12] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:47:12] 10SRE, 10Infrastructure-Foundations, 10Mail: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2 - https://phabricator.wikimedia.org/T240906 (10joanna_borun) a:03lmata [15:47:38] 10SRE, 10Infrastructure-Foundations, 10Mail: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2 - https://phabricator.wikimedia.org/T240906 (10joanna_borun) @lmata is it still valid issue? [15:48:05] (03CR) 10Ladsgroup: [C: 03+1] mariadb: removes db1146 [puppet] - 10https://gerrit.wikimedia.org/r/1002406 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [15:48:17] !log eevans@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [15:48:29] !log eevans@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [15:48:52] 10SRE, 10observability, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10joanna_borun) [15:49:22] (03CR) 10Ladsgroup: [C: 03+1] "It's a backup source. LGTM but maybe Jaime has something to take care of before this?" [puppet] - 10https://gerrit.wikimedia.org/r/1000303 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [15:49:39] 10SRE, 10observability, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10joanna_borun) @lmata is it still valid? [15:49:43] (03CR) 10Herron: [C: 03+2] SystemdUnitFailed: remove 'Failed' from alert text [alerts] - 10https://gerrit.wikimedia.org/r/998545 (owner: 10Herron) [15:50:08] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10Observability-Alerting, and 2 others: Improve outbound mail service alerting - https://phabricator.wikimedia.org/T197172 (10joanna_borun) 05Open→03Invalid [15:50:12] (03CR) 10Ladsgroup: [C: 03+1] mariadb: removes db1144 [puppet] - 10https://gerrit.wikimedia.org/r/1000304 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [15:50:45] !log eevans@deploy2002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [15:50:49] (03Merged) 10jenkins-bot: SystemdUnitFailed: remove 'Failed' from alert text [alerts] - 10https://gerrit.wikimedia.org/r/998545 (owner: 10Herron) [15:51:01] 10SRE, 10ops-codfw: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357318 (10ops-monitoring-bot) [15:51:09] !log eevans@deploy2002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [15:51:20] (03PS51) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [15:51:21] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10Observability-Alerting, and 2 others: Improve outbound mail service alerting - https://phabricator.wikimedia.org/T197172 (10joanna_borun) Works with current setup if there are any outstanding issues please reopen or create... [15:51:25] (03PS1) 10AOkoth: vrts: increase envoy timeout for vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1002577 (https://phabricator.wikimedia.org/T355980) [15:51:30] (03CR) 10Ladsgroup: [C: 03+1] mariadb: removes db1145 [puppet] - 10https://gerrit.wikimedia.org/r/1000305 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [15:51:52] (03PS2) 10AOkoth: vrts: increase envoy timeout for vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1002577 (https://phabricator.wikimedia.org/T355980) [15:52:05] (03CR) 10Ladsgroup: [C: 03+1] mariadb: removes db1135 [puppet] - 10https://gerrit.wikimedia.org/r/1000301 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [15:53:12] (03CR) 10Ladsgroup: [C: 03+1] mariadb: removes db1149 [puppet] - 10https://gerrit.wikimedia.org/r/1002407 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [15:53:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: testing db2194 done', diff saved to https://phabricator.wikimedia.org/P56682 and previous config saved to /var/cache/conftool/dbconfig/20240212-155325-arnaudb.json [15:54:14] 10SRE-tools, 10Infrastructure-Foundations, 10netbox: Netbox support for svc allocation - https://phabricator.wikimedia.org/T263429 (10joanna_borun) p:05High→03Medium [15:54:46] 10SRE, 10cloud-services-team: ceph: test and decide 1 network interface setup - https://phabricator.wikimedia.org/T325531 (10joanna_borun) [15:56:33] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Puppet-Core, 10Sustainability (Incident Followup): Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters - https://phabricator.wikimedia.org/T161145 (10joanna_borun) 05Open→03Invalid [15:56:43] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Puppet-Core, 10Sustainability (Incident Followup): Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters - https://phabricator.wikimedia.org/T161145 (10joanna_borun) 05Invalid→03Declined [15:57:31] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10joanna_borun) 05Open→03Resolved [15:57:59] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10colewhite) >>! In T357306#9533490, @MoritzMuehlenhoff wrote: > @colewhite /srv/opensearch looks related to https://phabricator.wikimedia.org/T314098 possibly? Is is still... [15:59:16] (03CR) 10Arnaudb: [C: 03+2] mariadb: removes db1146 [puppet] - 10https://gerrit.wikimedia.org/r/1002406 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [15:59:24] (03CR) 10Ladsgroup: [C: 03+1] mariadb: removes db1139 [puppet] - 10https://gerrit.wikimedia.org/r/1000302 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [15:59:33] (03CR) 10Arnaudb: [C: 03+2] mariadb: removes db1140 [puppet] - 10https://gerrit.wikimedia.org/r/1000303 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [15:59:48] 10SRE, 10ops-codfw: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357318 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm duplicate ticket see T357015 [16:00:18] (03CR) 10Arnaudb: [C: 03+2] mariadb: removes db1144 [puppet] - 10https://gerrit.wikimedia.org/r/1000304 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [16:01:07] (03CR) 10Arnaudb: [C: 03+2] mariadb: removes db1145 [puppet] - 10https://gerrit.wikimedia.org/r/1000305 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [16:01:16] (03Abandoned) 10Arnaudb: mariadb: removes db1144 [puppet] - 10https://gerrit.wikimedia.org/r/1000304 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [16:02:41] (03CR) 10Arnaudb: [C: 03+2] mariadb: removes db1135 [puppet] - 10https://gerrit.wikimedia.org/r/1000301 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [16:02:56] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) >>! In T357306#9534321, @colewhite wrote: >>>! In T357306#9533490, @MoritzMuehlenhoff wrote: >> @colewhite /srv/opensearch looks related to https://pha... [16:03:01] (03Abandoned) 10Arnaudb: mariadb: removes db1149 [puppet] - 10https://gerrit.wikimedia.org/r/1002407 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [16:06:08] (03PS1) 10Arnaudb: mariadb: removes db1144 db1149 [puppet] - 10https://gerrit.wikimedia.org/r/1002411 (https://phabricator.wikimedia.org/T350458) [16:06:24] 10SRE: Page: cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% - https://phabricator.wikimedia.org/T357198 (10Eevans) >>! In T357198#9533155, @hnowlan wrote: > Could `Too many eqiad mediawiki originals uploads` be a red herring? The traffic jumps are all in codfw. Honestly, that alert is what... [16:06:59] (03CR) 10Ladsgroup: [C: 03+1] mariadb: removes db1144 db1149 [puppet] - 10https://gerrit.wikimedia.org/r/1002411 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [16:07:32] (03CR) 10Arnaudb: [C: 03+2] mariadb: removes db1144 db1149 [puppet] - 10https://gerrit.wikimedia.org/r/1002411 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [16:10:12] (03Merged) 10jenkins-bot: Disable JSON Dump tests to prepare for schema change in Wikibase [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002431 (https://phabricator.wikimedia.org/T305660) (owner: 10Lucas Werkmeister (WMDE)) [16:11:49] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:11:54] (03Merged) 10jenkins-bot: Return stdClass/Object from Serializers for empty lists [extensions/Wikibase] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002432 (https://phabricator.wikimedia.org/T305660) (owner: 10Lucas Werkmeister (WMDE)) [16:11:57] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:12:21] (03CR) 10Arnaudb: [C: 03+2] mariadb: removes db1139 [puppet] - 10https://gerrit.wikimedia.org/r/1000302 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [16:12:39] (03Merged) 10jenkins-bot: Change expected serialization format of JSON dumps to include arrays [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002433 (https://phabricator.wikimedia.org/T305660) (owner: 10Lucas Werkmeister (WMDE)) [16:12:59] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1002431|Disable JSON Dump tests to prepare for schema change in Wikibase (T305660)]], [[gerrit:1002432|Return stdClass/Object from Serializers for empty lists (T305660)]], [[gerrit:1002433|Change expected serialization format of JSON dumps to include arrays (T305660)]] [16:13:15] T305660: [LEX] Empty senses/forms lists presentation in dump - https://phabricator.wikimedia.org/T305660 [16:13:41] (03CR) 10Ladsgroup: mariadb: disable systematic formatting of /srv (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1002410 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [16:14:21] (03CR) 10Ladsgroup: [C: 03+1] mariadb: decom db1133 [puppet] - 10https://gerrit.wikimedia.org/r/1000300 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [16:14:22] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1002431|Disable JSON Dump tests to prepare for schema change in Wikibase (T305660)]], [[gerrit:1002432|Return stdClass/Object from Serializers for empty lists (T305660)]], [[gerrit:1002433|Change expected serialization format of JSON dumps to include arrays (T305660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:14:33] testing [16:14:37] (03CR) 10Arnaudb: [C: 03+2] mariadb: decom db1133 [puppet] - 10https://gerrit.wikimedia.org/r/1000300 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [16:15:21] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:16:11] seems to work fine as far as I can tell [16:16:13] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [16:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:17:24] (03CR) 10JHathaway: "Depends on the size of the file, but for each 1/2GB it adds about a half second to run the equivalent of `sha256sum`. So nothing crazy, bu" [puppet] - 10https://gerrit.wikimedia.org/r/997555 (owner: 10JHathaway) [16:17:27] (03CR) 10JHathaway: [C: 03+2] rsyslog: have rsyslog create its own files [puppet] - 10https://gerrit.wikimedia.org/r/997555 (owner: 10JHathaway) [16:18:18] (03PS2) 10Arnaudb: mariadb: disable systematic wiping of /srv on db2194 [puppet] - 10https://gerrit.wikimedia.org/r/1002410 (https://phabricator.wikimedia.org/T343674) [16:18:37] (03CR) 10Arnaudb: mariadb: disable systematic wiping of /srv on db2194 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1002410 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [16:18:57] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357015 (10Jhancock.wm) Disk has been replaced. It appears in the idrac. [16:20:38] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357015 (10ABran-WMF) [16:20:40] 10SRE, 10ops-codfw: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357318 (10ABran-WMF) [16:21:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:22:42] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1002431|Disable JSON Dump tests to prepare for schema change in Wikibase (T305660)]], [[gerrit:1002432|Return stdClass/Object from Serializers for empty lists (T305660)]], [[gerrit:1002433|Change expected serialization format of JSON dumps to include arrays (T305660)]] (duration: 09m 42s) [16:22:55] T305660: [LEX] Empty senses/forms lists presentation in dump - https://phabricator.wikimedia.org/T305660 [16:23:03] * Lucas_WMDE done deploying [16:26:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:27:54] (03PS1) 10Ssingh: package_builder: add hook for building HAProxy 2.6 component [puppet] - 10https://gerrit.wikimedia.org/r/1002580 (https://phabricator.wikimedia.org/T352744) [16:29:46] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:30:04] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T1630). nyaa~ [16:30:32] (03CR) 10Ssingh: "The component hook makes the assumption that we will be changing the Build-Depends from libssl-dev to libssl1.1-dev and that it (1.1-dev) " [puppet] - 10https://gerrit.wikimedia.org/r/1002580 (https://phabricator.wikimedia.org/T352744) (owner: 10Ssingh) [16:32:06] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@228b93d]: (no justification provided) [16:32:44] (03PS206) 10Arnaudb: mariadb: cookbook draft to clone multiinstance [cookbooks] - 10https://gerrit.wikimedia.org/r/976709 (https://phabricator.wikimedia.org/T343674) [16:33:43] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002583 (https://phabricator.wikimedia.org/T128546) [16:34:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Removing instances as per T350458', diff saved to https://phabricator.wikimedia.org/P56683 and previous config saved to /var/cache/conftool/dbconfig/20240212-163407-arnaudb.json [16:34:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: testing db2194 done', diff saved to https://phabricator.wikimedia.org/P56684 and previous config saved to /var/cache/conftool/dbconfig/20240212-163413-arnaudb.json [16:34:22] T350458: Decommission db11[26-49] - https://phabricator.wikimedia.org/T350458 [16:34:46] (03PS1) 10Cwhite: profile: enforce opensearch plugin repository on apt hosts [puppet] - 10https://gerrit.wikimedia.org/r/1002412 (https://phabricator.wikimedia.org/T357306) [16:35:41] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002583 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:35:43] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:37:01] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002583 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:37:19] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:37:29] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:43:06] (03CR) 10Muehlenhoff: package_builder: add hook for building HAProxy 2.6 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1002580 (https://phabricator.wikimedia.org/T352744) (owner: 10Ssingh) [16:44:17] (03PS1) 10Ebernhardson: Connection: Correct read-only detection [extensions/CirrusSearch] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002434 (https://phabricator.wikimedia.org/T354793) [16:44:21] (03PS2) 10Ssingh: package_builder: add hook for building HAProxy 2.6 component [puppet] - 10https://gerrit.wikimedia.org/r/1002580 (https://phabricator.wikimedia.org/T352744) [16:44:37] (03CR) 10Ssingh: package_builder: add hook for building HAProxy 2.6 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1002580 (https://phabricator.wikimedia.org/T352744) (owner: 10Ssingh) [16:46:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1002412 (https://phabricator.wikimedia.org/T357306) (owner: 10Cwhite) [16:46:18] (03Abandoned) 10Ebernhardson: cirrus: Re-enable cloudelastic writes for non-testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999962 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [16:46:49] (03PS1) 10Ssingh: wikimedia.org: lower TTLs for dyna.wm.org and upload.wm.org to 300s [dns] - 10https://gerrit.wikimedia.org/r/1002585 (https://phabricator.wikimedia.org/T140365) [16:47:04] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) [16:47:11] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1002582| Bumping portals to master (T128546)]] (duration: 07m 07s) [16:47:26] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:48:22] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@228b93d]: (no justification provided) (duration: 16m 16s) [16:48:25] (03CR) 10Dzahn: [C: 03+2] miscweb: bump bugzilla to version 2024-02-09-201707 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000025 (https://phabricator.wikimedia.org/T317436) (owner: 10Dzahn) [16:49:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: testing db2194 done', diff saved to https://phabricator.wikimedia.org/P56685 and previous config saved to /var/cache/conftool/dbconfig/20240212-164918-arnaudb.json [16:49:28] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Dzahn) [16:49:35] 10SRE, 10Infrastructure-Foundations, 10collaboration-services, 10vm-requests: Site: 2 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 (10Dzahn) 05Resolved→03Open [16:49:42] (03Merged) 10jenkins-bot: miscweb: bump bugzilla to version 2024-02-09-201707 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000025 (https://phabricator.wikimedia.org/T317436) (owner: 10Dzahn) [16:49:54] (03CR) 10Ssingh: "The plan is to merge this on Wednesday, assuming no other objections in the meantime." [dns] - 10https://gerrit.wikimedia.org/r/1002585 (https://phabricator.wikimedia.org/T140365) (owner: 10Ssingh) [16:54:12] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1002582| Bumping portals to master (T128546)]] (duration: 07m 00s) [16:54:23] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:56:29] !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:57:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 4.508% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:57:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] "It's a copy/paste patch, +1ed already, I 'll merge in the interest to trim a bit my outgoing queue." [deployment-charts] - 10https://gerrit.wikimedia.org/r/999866 (owner: 10Alexandros Kosiaris) [16:57:49] (03CR) 10Dzahn: [C: 03+2] "@Daniel Souza - I just deployed this change because I was deploying a change to another site within the miscweb service and it showed up i" [deployment-charts] - 10https://gerrit.wikimedia.org/r/999180 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [16:58:20] (03Merged) 10jenkins-bot: service mesh: Listen on IPv6 too (copy patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/999866 (owner: 10Alexandros Kosiaris) [16:59:36] 10SRE, 10Infrastructure-Foundations, 10Mail: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2 - https://phabricator.wikimedia.org/T240906 (10lmata) 05Open→03Resolved >>! In T240906#9534209, @joanna_borun wrote: > @lmata is it still valid issue? it shouldn't be, watchmouse has be... [17:00:20] Erm what's with the jobs [17:01:27] (03CR) 10Dzahn: "Could you add something to the commit message or the ticket about this? I don't really see context why we are doing that and the relation " [puppet] - 10https://gerrit.wikimedia.org/r/1002577 (https://phabricator.wikimedia.org/T355980) (owner: 10AOkoth) [17:02:05] (03CR) 10Cwhite: [C: 03+2] "PCC OK https://puppet-compiler.wmflabs.org/output/1002412/1359/" [puppet] - 10https://gerrit.wikimedia.org/r/1002412 (https://phabricator.wikimedia.org/T357306) (owner: 10Cwhite) [17:02:19] (03CR) 10Dzahn: [C: 03+1] "ship it" [puppet] - 10https://gerrit.wikimedia.org/r/994686 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [17:02:44] hnowlan: nemo-yiannis: I think something happened (possibly a merge?) that is causing a huge uptick of parsoidCachePrewarm jobs [17:03:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1002580 (https://phabricator.wikimedia.org/T352744) (owner: 10Ssingh) [17:03:21] Well huge... x2 or so [17:03:24] hmm, looking [17:03:52] claime: I indeed deployed a change on restbase to send traffic directly to MW parsoid instead of storage [17:03:57] *restbase storage [17:04:09] (03CR) 10Ssingh: [V: 03+2 C: 03+2] "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1002580 (https://phabricator.wikimedia.org/T352744) (owner: 10Ssingh) [17:04:12] ok, so we need more jobrunner replicas I think [17:04:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: testing db2194 done', diff saved to https://phabricator.wikimedia.org/P56686 and previous config saved to /var/cache/conftool/dbconfig/20240212-170423-arnaudb.json [17:04:24] or we need to roll back [17:04:35] this has doubled saturation on the jobrunner cluster [17:04:51] i can rollback [17:05:22] we're processing a lot more recordLint as well [17:05:26] nemo-yiannis: is the restbase change supposed to generate more parsoidCachePrewarm jobs? [17:05:33] This is what i am trying to think [17:05:46] when do we trigger parsoidCachePrewarm ? [17:06:00] Also the uptick is around ~20mins after the change [17:06:06] I was hoping you'd know :) Daniel created it I think [17:06:48] the spike starts around 5 minutes after the deploy finished [17:07:09] !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:07:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 9.426% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:07:31] we send a parsoidCachePrewarm after each cache update [17:07:35] (03CR) 10Dzahn: [C: 03+2] "well.. I tried to apply it on staging cluster but after running the helmfile command it just hangs now..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/999180 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [17:07:36] ok [17:07:41] makes sense [17:07:58] the traffic we send triggers more cache misses ? [17:08:04] 10SRE, 10Wikimedia-Mailing-lists: Email List request - https://phabricator.wikimedia.org/T357326 (10Ladsgroup) 05Open→03Declined We don't create mailing lists for such cases. See https://meta.wikimedia.org/wiki/Mailing_lists [17:08:07] (03CR) 10Dzahn: [C: 03+2] "FAILED RELEASES:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/999180 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [17:08:12] cc duesen [17:09:26] should I revert ? claime hnowlan ? [17:10:15] not immediately, I do see it trending downwards, but not fast enough that I think we'll be in an agreeable state [17:10:42] at first i also noticed a bump in latency on envoy-telemetry but it went down again [17:10:48] We could scale up, but this increase is more or less equivalent to the size of every job we run on the job runners bar the two biggest ones [17:11:16] i think we should prioritize disabling change-prop pregeneration that also contributes to traffic [17:11:26] What's strange to me is that it's not that many job, but they must be really hungry [17:11:43] currently we have both live traffic hitting MW + changeprop pregeneration so we dont corrupt cassandra state in case we need to rollback [17:12:22] It's like 300 jobs per second, which is really not that much but they seem to tie up a lot of workers [17:13:46] !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:13:48] nemo-yiannis: hnowlan: it's not obvious to me how pregen could cause async render jobs. when pregen hits MW for fetching HTML, it should render immeidately, not trigger jobs. [17:14:04] these jobs are reasonably expensive but it seems like they have effectively doubled in the cost since 16:54 https://grafana.wikimedia.org/d/I_uNybHIz/hnowlan-jobqueue-jobrunner-sketches?orgId=1&from=1707754414522&to=1707758014522&viewPanel=2 [17:14:24] How well does the increase line up with the deployment? Could it be a coincidence? [17:14:45] not perfectly but it's very close [17:15:32] !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:15:58] !log dzahn@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:16:11] FWIW we also had a bump at the same diagram yesterday [17:16:24] yeah I am seeing occasional significant spikes of similar impact [17:16:38] https://grafana.wikimedia.org/goto/Bgv5CJhIz?orgId=1 [17:17:18] let's let it sit anyway and see what happens [17:17:20] Staring at the code, possible triggers for parsoidCachePrewarm are: page edit, page view with stale parser cache, action=purge, and action=undelete. [17:17:50] !log dzahn@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:18:08] Maybe the effect is secopndary? Pregen is causing load on the same servers that are handling the jobs? Is it the same servers? [17:18:49] unless it's done via another job, no [17:18:50] I see a tropling of p75 on mw-api-int at the same time, which would be caused by hitting the core mw parser [17:18:51] Is parsoidCachePrewarm routed to the parsoid render cluster? Or is it handled by general job runners? [17:18:56] tripling* [17:19:10] parsoidCachePrewarm is handled by mw-jobrunner on k8s [17:19:26] like all jobs now except a couple [17:19:38] mw-api-int? what's that? [17:19:51] duesen: dedicated internal api deployment of mw-on-k8s [17:20:23] would internal requests to rest.php go there? [17:20:26] yes [17:20:40] win 55 [17:20:57] ok. yes, more hits there are expected. also, more PC activity is expected. [17:20:57] I'm really confused about the jobs, though [17:21:36] (03PS3) 10AOkoth: vrts: increase envoy timeout for vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1002577 (https://phabricator.wikimedia.org/T355980) [17:21:54] nemo-yiannis: when you said sending traffic to mw parsoid, is that the parsoid cluster, or is it hitting mw-api-int? [17:22:12] i think parsoid cluster [17:22:48] (03CR) 10CI reject: [V: 04-1] vrts: increase envoy timeout for vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1002577 (https://phabricator.wikimedia.org/T355980) (owner: 10AOkoth) [17:23:04] duesen: Parsoid cluster and MW have different parsercaches right ? [17:23:45] nemo-yiannis: no, parsoid and the old parser have separate caches. the clusters share caches. [17:23:51] ok [17:24:16] (03PS4) 10AOkoth: vrts: increase envoy timeout for vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1002577 (https://phabricator.wikimedia.org/T355980) [17:24:21] So, zooming on on the job insertion graph, it seems like the current spike isn't that unusual: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&var-dc=eqiad+prometheus%2Fk8s&var-job=parsoidCachePrewarm&from=1705166593859&to=1707758593859 [17:25:02] (03PS5) 10Dzahn: vrts: increase envoy timeout for vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1002577 (https://phabricator.wikimedia.org/T355980) (owner: 10AOkoth) [17:25:10] There's also an increase in RecordLint jobs which is much more unusual [17:25:36] (03CR) 10Dzahn: [C: 03+1] vrts: increase envoy timeout for vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1002577 (https://phabricator.wikimedia.org/T355980) (owner: 10AOkoth) [17:25:44] nemo-yiannis: (*) PC has several layers, the *bottom* layer (mariadb) is shared across clusters. [17:25:51] recordlint has since dropped off a bit [17:26:12] (03CR) 10CI reject: [V: 04-1] vrts: increase envoy timeout for vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1002577 (https://phabricator.wikimedia.org/T355980) (owner: 10AOkoth) [17:26:43] hnowlan: still way over baseline right? [17:27:09] (03PS6) 10Dzahn: vrts: increase envoy timeout for vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1002577 (https://phabricator.wikimedia.org/T355980) (owner: 10AOkoth) [17:28:17] claime: yeah there was a big spike and a backlog built up, but that's been cleared so I'd expect it to calm down (while having no good explanation for the spike in the first place) [17:29:54] hnowlan: the insertion rate is still around 1k/s but yeah, run duration has stabilized [17:30:35] I know recordLint is triggered by parsoid, which would make sense in terms of the two happening at the same time I guess [17:30:44] (03CR) 10Dzahn: [V: 03+1 C: 03+2] ci: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/998848 (owner: 10Muehlenhoff) [17:30:46] ah yeah, makes sense [17:31:05] I suspect this could be because there were a bunch of pages that were stored in RESTBase (but never edited much, so were not in ParserCache), so when you turned off storage, all those now hit ParserCache. [17:31:26] Those correspond to both increased request rate (wt2html) and associated lint jobs (multiple lints per page). [17:31:35] but, stabilizes once those pages enter PC. [17:31:39] that is my theory. [17:31:56] subbu: sounds reasonable - would that explain the prewarm jobs also? [17:32:01] (gotta go offline now to head to the airport ... but just some drie by comments). [17:32:05] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "deployed on contint1002 (not active host) first, puppet disabled on contint2002" [puppet] - 10https://gerrit.wikimedia.org/r/998848 (owner: 10Muehlenhoff) [17:32:33] duesen may have a better understanding of prewarm jobs .... i need to page in all the details. [17:32:45] recordlint spikes during the deploy rather than after [17:32:54] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "the only change is the order of IPs, like in:" [puppet] - 10https://gerrit.wikimedia.org/r/998848 (owner: 10Muehlenhoff) [17:33:01] so I guess that could be unrelated to the prewarm, but also more directly correlated with the change [17:33:26] yes .. makes sense. [17:33:38] off now. check in later. [17:35:22] I'm tempted to babysit this for another 30m or so and make a call on rollback then if you're still about nemo-yiannis/duesen [17:35:33] i will be around [17:35:37] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host etherpad2001.codfw.wmnet [17:35:38] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [17:35:40] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 (10WhatamIdoing) Right. It's a mess, the numbers need to be changed, and nobody will want to be responsible, e.g., for "taking away" the 320px and substi... [17:37:28] (03PS1) 10Hnowlan: mw-jobrunner: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002590 [17:37:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [17:38:41] (03CR) 10Clément Goubert: [C: 03+1] mw-jobrunner: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002590 (owner: 10Hnowlan) [17:38:53] (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002590 (owner: 10Hnowlan) [17:39:22] hnowlan: which diagram are you checking? [17:39:49] nemo-yiannis: the primary source of worry is https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-release=main&var-container_name=All&var-site=&from=now-3h&to=now&viewPanel=84&refresh=30s [17:39:53] ok [17:39:59] (03Merged) 10jenkins-bot: mw-jobrunner: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002590 (owner: 10Hnowlan) [17:40:40] * duesen still has no theory, linting shouldn't cause prewarm jobs [17:40:53] https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%20prometheus%2Fk8s&from=now-3h&to=now&viewPanel=21 recordlint and parsoidcacheprewarm levels here are most likely the source of the impact [17:41:31] !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [17:41:38] !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [17:42:00] !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [17:42:09] !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [17:43:21] duesen: no hard evidence that they're linked, but notable timing. Regardless the recordlint doubling+ is also causing an impact at a bad time [17:43:27] it'd be nice if subbu's theory is right [17:44:04] we've just added capacity and the saturation isn't worrisome any more but it would still be nice to see a drop-off [17:44:51] So currently the only wiki that is still backed by cassandra on restbase is enwiki [17:45:09] Is there something we need to make sure before switching enwiki too ? [17:45:10] the insertion rate of recordLint isn't backing off, that's what worries me [17:45:20] Is there a way to get a breakdown of the cause-action of the jobs? [17:45:37] That would allow us to narrow down their source [17:46:24] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [17:47:25] (SystemdUnitFailed) firing: envoyproxy.service on mw2388:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:48:42] duesen: not afaik [17:50:32] not sure if this is causing any problems but there's a big increase in slow parsing on jobs btw https://logstash.wikimedia.org/goto/be31e8f4fb1a29ed632f3a030a62e272 [17:51:09] but that would also explain part of the problem, existing parse jobs are slowing down in addition to there being more of them [17:52:25] (SystemdUnitFailed) resolved: envoyproxy.service on mw2388:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:47] It's lying ^ [17:52:55] The ones I just clicked on are all enwiki. [17:53:03] And we didn't disable caching for enwiki. [17:53:20] I *think* we had something related to linting in our team meeting [17:53:30] let me check in with the team [17:53:42] again, unrelated diffs when running cookbooks, it's somewhat frustrating since I don't want to merge other changes and also not abort my own [17:55:17] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [17:55:17] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:55:34] About 88% of the slow-parsoid errors are coming from enwiki [17:55:59] I'm not saying the change caused those errors [17:56:08] If we could see where all the prewarm jobs are coming from, that would be nice... [17:56:23] But there is a slowdown [17:56:25] (SystemdUnitFailed) firing: envoyproxy.service on mw2388:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:57:06] hnowlan: yea... it looks like there is more than one thing going on. Or a ripple effect. [17:57:16] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [17:57:37] jhathaway: I think I32a06131094c6d01db74b7644a7e591722932674 is causing envoy to fail on the above mw node [17:57:43] at the least we currently think that the increase in volume and timing of recordLint *is* related [17:57:43] but not on others [17:57:46] right? [17:58:37] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:58:37] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache etherpad2001.codfw.wmnet on all recursors [17:58:40] jhathaway: ah maybe just a coincidence [17:58:41] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) etherpad2001.codfw.wmnet on all recursors [17:58:52] claime happy to take a look [17:59:01] jhathaway: could you? I was about to head off [17:59:06] I'll depool it [17:59:07] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [17:59:11] sounds good [17:59:50] I'll go back to digging around code. Ping me on slack if you need me. IRCCloud notifications don't really work unfortunately. [17:59:58] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 (10hashar) To give a bit of historical perspective, using larger thumbnail was a frequent request a decade ago. From when I was in the team managing those... [18:00:04] huh it's depooled already [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T1800) [18:00:05] ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T1800). [18:01:16] claime: weird, wasn't me [18:01:25] (SystemdUnitFailed) resolved: envoyproxy.service on mw2388:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:01:32] duesen: could you confirm my theory there at least please? [18:01:54] jhathaway: me neither, and nothing in sal. it was part of the nodes that I depooled last week for teh migration but it should have been repooled with the rest of them [18:02:07] Would it make sense to rollback to see if recordlintjobs go down ? [18:02:21] Just to ensure if its unrelated or not ? [18:02:22] hnowlan: which theory, sorry? [18:02:38] duesen: "< hnowlan> at the least we currently think that the increase in volume and timing of recordLint *is* related" [18:03:01] ah - i know nothing about recordlint... [18:03:13] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [18:03:14] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:03:48] I can see how linting would cause load, but I don't see how it would cause more prewarm jobs to be enqueued. [18:04:01] I don't think the two are necessarily linked. [18:04:17] claime: logs complain of a yaml error, envoyproxy-hot-restarter[16211]: yaml-cpp: error at line 1560, column 44: end of map flow not found [18:04:46] nemo-yiannis: I am tempted to leave it sit. with the capacity bump we're now at reasonably safe levels and if subbu is right and we eventually see the jobs tail off I'd be happy [18:05:07] i just don't understand why linter jobs are increased [18:05:21] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [18:05:25] (SystemdUnitFailed) firing: envoyproxy.service on mw2388:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:05:31] I'm still trying to find an explanation for the large number of prewarm jobs. Coming up blank so far. [18:05:52] jhathaway: yeah, but I can't understand why, I ran puppet on another node and restarted envoy and it works fine. The config is supposed to be completely puppetized. [18:06:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [18:06:25] For the time being unless we can come up with a good explanation for the prewarms we can consider it to be one of the spikey patterns we have seen in the past [18:06:34] hmm odd, okay I'll dig a bit in and see if I can figure out why, if not I will leave it depooled claime [18:06:35] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [18:06:35] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:06:36] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache etherpad2001.codfw.wmnet on all recursors [18:06:36] !log ladsgroup@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [18:06:39] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) etherpad2001.codfw.wmnet on all recursors [18:06:39] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host etherpad2001.codfw.wmnet [18:06:59] jhathaway: yeah, sorry to pull you into this, it may just be a coincidence with the merge of your syslog patch [18:07:05] in fact it probably is [18:07:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [18:07:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [18:07:21] Ok, shot in the blue: For some reason, Kiwix is failing to process the parsoid response after we turned of storage. Since Kiwix has automatic fallback, it now hits index.php. When it hits stale pages, that causes a prewarm job. Note that this is a REALLY wild guess. But it would explain the situation. [18:07:30] no problem at all [18:08:05] what's a kiwix? :] [18:08:16] oh I see [18:08:56] hnowlan: offline wikipedia. they spider all pages. [18:08:57] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:04] (03PS1) 10Dzahn: site: add etherpad2001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1002591 (https://phabricator.wikimedia.org/T357159) [18:09:13] One of the remaining major users pa the restbase parsoid endpoints [18:10:03] hnowlan: the User-Agent would be "mwoffliner" [18:10:07] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:10:12] https://github.com/openzim/mwoffliner [18:10:13] jhathaway: leave it, I think I know what happened [18:10:22] duesen: nice, thank you, was just about to ask [18:10:24] claime: ok [18:10:36] (03CR) 10Dzahn: [C: 03+2] site: add etherpad2001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1002591 (https://phabricator.wikimedia.org/T357159) (owner: 10Dzahn) [18:11:21] !log spicerack.netbox.NetboxError: Server etherpad2001 does not have any primary IP with a DNS name set. [18:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:12] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host etherpad2001.codfw.wmnet [18:12:13] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:14:14] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [18:15:05] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [18:15:05] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:15:05] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache etherpad2001.codfw.wmnet on all recursors [18:15:09] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) etherpad2001.codfw.wmnet on all recursors [18:15:12] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:15:46] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw2388.codfw.wmnet with reason: Envoy config changed for ipoid [18:16:02] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw2388.codfw.wmnet with reason: Envoy config changed for ipoid [18:17:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:17:21] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:17:51] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [18:18:22] sigh, error rate is unrelated to the ongoing issues [18:18:28] !log makevm cookbook in a cycle of adding and then removing DNS records [18:18:29] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:42] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [18:18:42] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:18:42] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache etherpad2001.codfw.wmnet on all recursors [18:18:45] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) etherpad2001.codfw.wmnet on all recursors [18:18:51] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host etherpad2001.codfw.wmnet [18:20:41] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host etherpad2001.codfw.wmnet [18:20:42] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:22:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:22:39] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [18:23:33] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [18:23:33] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:23:33] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache etherpad2001.codfw.wmnet on all recursors [18:23:37] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) etherpad2001.codfw.wmnet on all recursors [18:23:40] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:25:08] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:25:12] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache etherpad2001.codfw.wmnet on all recursors [18:25:15] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) etherpad2001.codfw.wmnet on all recursors [18:25:21] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host etherpad2001.codfw.wmnet [18:28:10] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync - dzahn@cumin1002" [18:29:00] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync - dzahn@cumin1002" [18:29:40] !log makevm cookbook creates and then removes DNS records, sync-netbox-hiera cookbook fails with raise NetboxError(f"Server {self._server.name} does not have any primary IP with a DNS name set.") [18:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:24] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host etherpad2001.codfw.wmnet [18:33:26] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:34:46] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:34:46] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache etherpad2001.codfw.wmnet on all recursors [18:34:49] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) etherpad2001.codfw.wmnet on all recursors [18:34:53] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:35:21] !log attempting decom cookbook on "unverified" host etherpad2001, followed by makevm cookbook to create it again to get out of the cycle of adding and removing DNS records - fails with "is already in the cluster" even after decom finished T357159 [18:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:25] T357159: Site: 2 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 [18:36:50] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [18:37:44] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM etherpad2001.codfw.wmnet - dzahn@cumin1002" [18:37:44] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:37:44] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache etherpad2001.codfw.wmnet on all recursors [18:37:47] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) etherpad2001.codfw.wmnet on all recursors [18:37:53] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host etherpad2001.codfw.wmnet [18:42:20] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 13:00:00 on db1133.eqiad.wmnet with reason: hush [18:42:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 13:00:00 on db1133.eqiad.wmnet with reason: hush [18:48:37] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host etherpad2002.codfw.wmnet [18:48:39] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:51:57] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:52:30] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM etherpad2002.codfw.wmnet - dzahn@cumin1002" [18:53:05] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:53:22] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM etherpad2002.codfw.wmnet - dzahn@cumin1002" [18:53:23] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:53:23] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache etherpad2002.codfw.wmnet on all recursors [18:53:26] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) etherpad2002.codfw.wmnet on all recursors [18:53:29] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:53:46] (03PS1) 10Eevans: sessionstore: remove acls & seeds sessionstore200[1-3] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002594 (https://phabricator.wikimedia.org/T356828) [18:55:02] !log attempt to create a completely new VM with a new name ALSO FAILS and removes DNS entries [18:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:59] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM etherpad2002.codfw.wmnet - dzahn@cumin1002" [18:56:49] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM etherpad2002.codfw.wmnet - dzahn@cumin1002" [18:56:49] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:56:49] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache etherpad2002.codfw.wmnet on all recursors [18:56:53] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) etherpad2002.codfw.wmnet on all recursors [18:56:58] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host etherpad2002.codfw.wmnet [18:57:12] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync - dzahn@cumin1002" [18:58:05] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync - dzahn@cumin1002" [19:02:41] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host etherpad2002.codfw.wmnet [19:02:43] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [19:04:50] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM etherpad2002.codfw.wmnet - dzahn@cumin1002" [19:05:39] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM etherpad2002.codfw.wmnet - dzahn@cumin1002" [19:05:39] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:05:39] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache etherpad2002.codfw.wmnet on all recursors [19:05:43] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) etherpad2002.codfw.wmnet on all recursors [19:06:23] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM etherpad2002.codfw.wmnet - dzahn@cumin1002" [19:07:06] (03PS1) 10Dzahn: site: update etherpad VM 2001 to 2002 [puppet] - 10https://gerrit.wikimedia.org/r/1002596 (https://phabricator.wikimedia.org/T357159) [19:07:12] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM etherpad2002.codfw.wmnet - dzahn@cumin1002" [19:07:41] (03CR) 10Dzahn: [V: 03+2 C: 03+2] site: update etherpad VM 2001 to 2002 [puppet] - 10https://gerrit.wikimedia.org/r/1002596 (https://phabricator.wikimedia.org/T357159) (owner: 10Dzahn) [19:08:41] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host etherpad2002.codfw.wmnet with OS bookworm [19:08:50] 10SRE, 10Infrastructure-Foundations, 10collaboration-services, 10vm-requests, 10Patch-For-Review: Site: 2 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host etherpad2002.codfw.wmnet with... [19:12:48] (03PS1) 10Jdlrobson: Make thanks button show again [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002437 (https://phabricator.wikimedia.org/T357202) [19:20:56] (03PS1) 10Htriedman: 12 feb 2024 redaction list update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002597 [19:22:23] (03CR) 10Htriedman: "tagging gmodena and tchin for quick review" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002597 (owner: 10Htriedman) [19:22:49] (03CR) 10Dzahn: "this repo has many services in it, not just eventstreams. would be nice to start the commit message with the service name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002597 (owner: 10Htriedman) [19:23:38] (03PS2) 10Htriedman: Eventstreams: 12 feb 2024 redaction list update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002597 [19:24:00] (03CR) 10Htriedman: "updated the commit message! sorry about that" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002597 (owner: 10Htriedman) [19:25:33] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on etherpad2002.codfw.wmnet with reason: host reimage [19:28:30] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on etherpad2002.codfw.wmnet with reason: host reimage [19:36:48] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:37:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:39:11] (03PS24) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [19:40:25] (03CR) 10CI reject: [V: 04-1] Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [19:42:11] (03PS1) 10Jdlrobson: Use @wikimedia/mediawiki.skins.clientpreferences@1.1.1 [extensions/MobileFrontend] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002607 (https://phabricator.wikimedia.org/T357212) [19:44:13] (03CR) 10JHathaway: "looks like this is now duplicated by @jbond's very similar function, https://gerrit.wikimedia.org/r/c/operations/puppet/+/931236" [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [19:47:17] (03PS25) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [19:48:39] (03PS1) 10Jdlrobson: Diffs: Localize number in timeago [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002438 (https://phabricator.wikimedia.org/T357079) [19:48:41] (03CR) 10CI reject: [V: 04-1] Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [19:52:40] (03CR) 10DLynch: [C: 03+1] MobileFrontend: Set fallback editor to 'visual' on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999813 (owner: 10Esanders) [19:54:34] (03CR) 10JHathaway: multirootca: depend on cfssl when generating CRLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1002384 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [19:55:49] (03PS26) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [19:56:08] nemo-yiannis, hnowlan: page HTML latency is not looking goof at all... https://grafana-rw.wikimedia.org/d/t_x3DEu4k/parsoid-health?forceLogin&from=now-3h&orgId=1&refresh=15m&to=now&viewPanel=73 [19:57:26] (03CR) 10CI reject: [V: 04-1] Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [19:58:18] (03CR) 10BCornwall: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [20:00:43] nemo-yiannis, hnowlan: the job enqueue rate seems mostly back to normal, however. [20:02:14] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [20:04:06] (03PS1) 10Clare Ming: Update comment for testing scap fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002611 (https://phabricator.wikimedia.org/T350628) [20:04:13] (03CR) 10BCornwall: Add module for ncmonitor (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [20:05:24] (03PS1) 10Eevans: cassandra: install cqlsh-{id} symlink [puppet] - 10https://gerrit.wikimedia.org/r/1002612 [20:07:40] (03CR) 10JHathaway: [C: 03+1] "it appears that pg_ctlcluster does this for you, unless you specify --skip-systemctl-redirect, but regardless this change seems sensible, " [puppet] - 10https://gerrit.wikimedia.org/r/1002388 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [20:10:21] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1002612 (owner: 10Eevans) [20:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:23:49] (03CR) 10Eevans: [C: 03+2] sessionstore: remove acls & seeds sessionstore200[1-3] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002594 (https://phabricator.wikimedia.org/T356828) (owner: 10Eevans) [20:25:22] (03Merged) 10jenkins-bot: sessionstore: remove acls & seeds sessionstore200[1-3] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002594 (https://phabricator.wikimedia.org/T356828) (owner: 10Eevans) [20:26:20] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [20:26:35] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [20:26:53] (03CR) 10JHathaway: puppetdb: allow both secret() and source for site key material (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [20:27:29] !log eevans@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [20:27:42] !log eevans@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [20:28:59] !log eevans@deploy2002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [20:29:46] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:30:12] !log eevans@deploy2002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [20:31:04] (03PS1) 10Ebernhardson: Bump version of extra plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/1002619 (https://phabricator.wikimedia.org/T356651) [20:42:50] (03PS2) 10DLynch: MobileFrontend: Set fallback editor to 'visual' on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999813 (owner: 10Esanders) [20:55:30] (03CR) 10Ssingh: Add module for ncmonitor (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [20:56:20] (03CR) 10Ssingh: Add module for ncmonitor (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T2100). [21:00:05] ebernhardson, Jdlrobson, and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:30] o/ [21:01:01] Mine's a complete no-op with nothing to test that just paves the way for a future patch, so feel free to merge it whenever in this process. [21:02:10] o/ [21:02:14] i can deploy [21:02:17] o/ [21:02:22] (03PS3) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-01-09-190638 to 2024-01-18-182456 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992756 (https://phabricator.wikimedia.org/T278596) [21:02:24] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-01-18-182456 to 2024-02-12-155846 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002624 (https://phabricator.wikimedia.org/T296937) [21:02:38] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-01-18-182630 to 2024-02-12-160222 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002625 (https://phabricator.wikimedia.org/T287978) [21:02:51] ebernhardson: are you around? [21:03:23] Jdlrobson: can your 3 go out together? [21:04:23] Kemayo: roger that - i'll let you know when it's live [21:05:16] Jdlrobson: i think i'll do your 1st patch separately and do your last 2 together [21:06:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002607 (https://phabricator.wikimedia.org/T357212) (owner: 10Jdlrobson) [21:07:59] cjming: yes all three could go out together [21:09:18] oops [21:13:46] (03CR) 10BCornwall: Add module for ncmonitor (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [21:14:40] (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002628 [21:16:30] Q for seasoned deployers out there -- if there are a number of patches for wmf-17, and if/since CI takes a while (sometimes up to 17-20 mins), if you manually +2 wmf17 patches, do you have to wait for each one to merge before rebasing/merging the next one? i'm assuming yes [21:18:07] so the only time saver to manually +2ing backports is that they can be scap backported together [21:20:23] cjming: No, you can merge them all in one go. You can also do that with scap itself – `scap backport 12345 12346 12347` will merge all of them and push them together. [21:21:13] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncmonitor1001.eqiad.wmnet with OS bookworm [21:22:45] thanks James_F -- gtk! and if something goes wrong, doing a revert on just one isn't problematic if they were originally scap backported together? [21:23:24] No, though if you abort during scap (e.g. choose not to roll forward from debug) you'll have to remember which ones you want to revert. [21:23:58] got it - thanks! [21:24:29] jouncebot: nowandnext [21:24:29] For the next 0 hour(s) and 35 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T2100) [21:24:29] In 0 hour(s) and 35 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T2200) [21:25:07] !log cjming@deploy2002 Started scap: Backport for [[gerrit:1002607|Use @wikimedia/mediawiki.skins.clientpreferences@1.1.1 (T357212)]] [21:25:12] T357212: Regression: Can't toggle advanced mode on Special:MobileOptions - https://phabricator.wikimedia.org/T357212 [21:25:27] Jdlrobson: if you want to test your 1st patch, it's ready on test servers [21:25:37] oh whoops - just a sec [21:25:51] almost ready [21:26:27] !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:1002607|Use @wikimedia/mediawiki.skins.clientpreferences@1.1.1 (T357212)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:26:37] now it's ready [21:27:17] Jdlrobson: ^^ [21:29:07] cjming: on it [21:31:22] cjming: LGTM please sync [21:31:28] will do [21:31:30] !log cjming@deploy2002 cjming and jdlrobson: Continuing with sync [21:36:48] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:38:06] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:1002607|Use @wikimedia/mediawiki.skins.clientpreferences@1.1.1 (T357212)]] (duration: 12m 58s) [21:38:21] T357212: Regression: Can't toggle advanced mode on Special:MobileOptions - https://phabricator.wikimedia.org/T357212 [21:39:05] Jdlrobson: 1st patch is live - 2nd + 3rd are going out now [21:39:06] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:42:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:42:43] cjming: thanks for update [21:44:47] np - sorry they didn't go out altogether - lesson learned [21:57:57] !log cjming@deploy2002 Started scap: Backport for [[gerrit:1002437|Make thanks button show again (T357202)]], [[gerrit:1002438|Diffs: Localize number in timeago (T357079)]] [21:58:04] T357202: Thanks button missing from mobile diff - https://phabricator.wikimedia.org/T357202 [21:58:04] T357079: Wrong digit on bnwiki mobile diff - https://phabricator.wikimedia.org/T357079 [21:58:18] cjming: yey they merged [21:59:16] !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:1002437|Make thanks button show again (T357202)]], [[gerrit:1002438|Diffs: Localize number in timeago (T357079)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:59:27] Jdlrobson: 2nd + 3rd patches ready for testing [22:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240212T2200). [22:00:35] cjming: both look great [22:00:37] please sync them both [22:00:44] great - syncing nw [22:00:45] now [22:00:49] !log cjming@deploy2002 cjming and jdlrobson: Continuing with sync [22:04:57] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:05:08] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:06:02] eberhardson: happy to do your patch if you are around still - i'll just do David's config patch real quick after Jon's stuff finishes [22:07:14] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:1002437|Make thanks button show again (T357202)]], [[gerrit:1002438|Diffs: Localize number in timeago (T357079)]] (duration: 09m 17s) [22:07:31] T357202: Thanks button missing from mobile diff - https://phabricator.wikimedia.org/T357202 [22:07:32] T357079: Wrong digit on bnwiki mobile diff - https://phabricator.wikimedia.org/T357079 [22:08:56] !log cjming@deploy2002 Started scap: Backport for [[gerrit:999813|MobileFrontend: Set fallback editor to 'visual' on labs]] [22:10:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:10:19] !log cjming@deploy2002 esanders and cjming: Backport for [[gerrit:999813|MobileFrontend: Set fallback editor to 'visual' on labs]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:10:23] !log cjming@deploy2002 esanders and cjming: Continuing with sync [22:11:01] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:11:06] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:11:21] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:12:25] ebernhardson: last call - happy to do yours if you're around (sorry my earlier ping spelled your handle wrong) [22:12:27] thanks cjming for your help today! [22:12:42] Jdlrobson: yw! all should be live [22:15:25] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:15:40] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:16:27] * ebernhardson totally spaced on it. will do it now [22:16:50] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:999813|MobileFrontend: Set fallback editor to 'visual' on labs]] (duration: 07m 53s) [22:16:55] Kemayo: your patch is live [22:17:06] ebernhardson: np! so you'll self-deploy? [22:17:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:17:21] cjming: i can, it'll take a few minutes to work through jenkins [22:17:44] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:17:50] ebernhardson: sounds good - thanks [22:17:58] cjming: you're all complete? [22:18:02] all done [22:18:51] all yours [22:18:56] excellent, thanks! [22:20:02] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:20:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:20:49] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ncmonitor1001.eqiad.wmnet with OS bookworm [22:20:49] !log brett@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncmonitor1001.eqiad.wmnet [22:22:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:22:53] cjming: Thanks for getting mine out! [22:25:50] Kemayo: np! [22:26:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [22:26:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [22:29:43] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on sessionstore2001.codfw.wmnet with reason: Decommissioning — T356828 [22:29:48] T356828: Decommission EOL hosts: sessionstore200[1-3] - https://phabricator.wikimedia.org/T356828 [22:29:58] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on sessionstore2001.codfw.wmnet with reason: Decommissioning — T356828 [22:30:04] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on sessionstore2002.codfw.wmnet with reason: Decommissioning — T356828 [22:30:19] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on sessionstore2002.codfw.wmnet with reason: Decommissioning — T356828 [22:30:24] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on sessionstore2003.codfw.wmnet with reason: Decommissioning — T356828 [22:30:38] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on sessionstore2003.codfw.wmnet with reason: Decommissioning — T356828 [22:38:56] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:39:24] !log deployed patch for T357101 [22:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:04] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:40:37] !log ebernhardson@deploy2002 Started scap: Backport for [[gerrit:1002434|Connection: Correct read-only detection (T354793 T356526)]] [22:40:42] T354793: SUP: Adapt saneitizer to allow SUP to operate next to cirrus jobs - https://phabricator.wikimedia.org/T354793 [22:40:43] T356526: High level of backend errors for CirrusSearch jobs in jobrunners - https://phabricator.wikimedia.org/T356526 [22:41:58] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1002434|Connection: Correct read-only detection (T354793 T356526)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:42:33] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [22:49:12] !log ebernhardson@deploy2002 Finished scap: Backport for [[gerrit:1002434|Connection: Correct read-only detection (T354793 T356526)]] (duration: 08m 35s) [22:49:18] T354793: SUP: Adapt saneitizer to allow SUP to operate next to cirrus jobs - https://phabricator.wikimedia.org/T354793 [22:49:18] T356526: High level of backend errors for CirrusSearch jobs in jobrunners - https://phabricator.wikimedia.org/T356526 [22:55:38] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:55:58] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:03:36] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host etherpad2002.codfw.wmnet with OS bookworm [23:03:36] !log dzahn@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host etherpad2002.codfw.wmnet [23:15:33] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncmonitor1001.eqiad.wmnet with OS bookworm [23:16:49] !log T357007 Running mwscript CampaignEvents:GenerateInvitationList --wiki=metawiki --listfile=/home/daimona/list2.txt [23:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:53] T357007: Generate Invitation Lists for Event Organizers - https://phabricator.wikimedia.org/T357007 [23:25:06] !log zabe@mwmaint2002:/tmp/uploads$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user="Yann" . # T357208 [23:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:14] T357208: Server-side upload request for Yann - https://phabricator.wikimedia.org/T357208 [23:44:56] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:44:56] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:45:26] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:46:04] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:46:04] RECOVERY - BFD status on cr1-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:46:36] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:51:16] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncmonitor1001.eqiad.wmnet with OS bookworm [23:51:20] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncmonitor1001.eqiad.wmnet with OS bookworm executed with errors: - ncm... [23:56:05] (03PS1) 10BCornwall: ncmonitor: Add parman config [puppet] - 10https://gerrit.wikimedia.org/r/1002674 [23:56:44] (03PS2) 10BCornwall: ncmonitor: Add partman config [puppet] - 10https://gerrit.wikimedia.org/r/1002674 (https://phabricator.wikimedia.org/T356710) [23:57:30] (03CR) 10Dzahn: [C: 03+1] ncmonitor: Add partman config [puppet] - 10https://gerrit.wikimedia.org/r/1002674 (https://phabricator.wikimedia.org/T356710) (owner: 10BCornwall)