[00:00:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P52283 and previous config saved to /var/cache/conftool/dbconfig/20230907-000004-arnaudb.json [00:15:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P52284 and previous config saved to /var/cache/conftool/dbconfig/20230907-001510-arnaudb.json [00:16:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T343198)', diff saved to https://phabricator.wikimedia.org/P52285 and previous config saved to /var/cache/conftool/dbconfig/20230907-003017-arnaudb.json [00:30:19] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [00:30:22] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [00:30:32] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [00:30:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T343198)', diff saved to https://phabricator.wikimedia.org/P52286 and previous config saved to /var/cache/conftool/dbconfig/20230907-003038-arnaudb.json [00:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/955010 [00:38:43] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/955010 (owner: 10TrainBranchBot) [00:43:23] 10SRE, 10Platform Team Initiatives (PHP7 (TEC4)), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [00:43:26] (03PS1) 10Eevans: cassandra: remove cassandra/twcs deployment [puppet] - 10https://gerrit.wikimedia.org/r/955412 (https://phabricator.wikimedia.org/T341732) [00:48:01] (03PS1) 10Tim Starling: Add the Phonos init module as a dependency of the main module [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955061 (https://phabricator.wikimedia.org/T345414) [00:55:23] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/955010 (owner: 10TrainBranchBot) [00:57:26] (03CR) 10Tim Starling: [C: 03+2] Add the Phonos init module as a dependency of the main module [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955061 (https://phabricator.wikimedia.org/T345414) (owner: 10Tim Starling) [00:58:48] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [00:59:02] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [00:59:17] (03Merged) 10jenkins-bot: Add the Phonos init module as a dependency of the main module [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955061 (https://phabricator.wikimedia.org/T345414) (owner: 10Tim Starling) [01:10:29] !log tstarling@deploy1002 Synchronized php-1.41.0-wmf.25/extensions/Phonos/extension.json: fix breakage of Phonos on parser-cached pages T345414 (duration: 06m 59s) [01:10:33] T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414 [02:06:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:30] (03PS1) 10Andrew Bogott: Horizon: update docker version [puppet] - 10https://gerrit.wikimedia.org/r/955415 [02:36:12] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update docker version [puppet] - 10https://gerrit.wikimedia.org/r/955415 (owner: 10Andrew Bogott) [02:36:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T343198)', diff saved to https://phabricator.wikimedia.org/P52287 and previous config saved to /var/cache/conftool/dbconfig/20230907-023727-arnaudb.json [02:37:30] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [02:52:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P52288 and previous config saved to /var/cache/conftool/dbconfig/20230907-025233-arnaudb.json [03:05:38] (03PS3) 10Ryan Kemper: wdqs.data-transfer: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/949146 [03:07:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P52289 and previous config saved to /var/cache/conftool/dbconfig/20230907-030739-arnaudb.json [03:15:45] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs.data-transfer: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/949146 (owner: 10Ryan Kemper) [03:22:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T343198)', diff saved to https://phabricator.wikimedia.org/P52290 and previous config saved to /var/cache/conftool/dbconfig/20230907-032245-arnaudb.json [03:22:48] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [03:22:49] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [03:23:01] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [03:23:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1210 (T343198)', diff saved to https://phabricator.wikimedia.org/P52291 and previous config saved to /var/cache/conftool/dbconfig/20230907-032306-arnaudb.json [03:24:31] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [03:26:15] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:39:53] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [03:40:03] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [05:22:24] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:23:36] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:24:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [05:26:10] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [05:43:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T343198)', diff saved to https://phabricator.wikimedia.org/P52292 and previous config saved to /var/cache/conftool/dbconfig/20230907-054320-arnaudb.json [05:43:24] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [05:58:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P52293 and previous config saved to /var/cache/conftool/dbconfig/20230907-055826-arnaudb.json [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T0600) [06:00:06] kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T0600). [06:13:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P52294 and previous config saved to /var/cache/conftool/dbconfig/20230907-061332-arnaudb.json [06:28:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T343198)', diff saved to https://phabricator.wikimedia.org/P52295 and previous config saved to /var/cache/conftool/dbconfig/20230907-062838-arnaudb.json [06:28:41] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [06:28:42] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [06:28:46] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on wdqs[1003-1004].eqiad.wmnet with reason: reboot [06:28:49] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on wdqs[1003-1004].eqiad.wmnet with reason: reboot [06:28:54] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [06:29:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52296 and previous config saved to /var/cache/conftool/dbconfig/20230907-062900-arnaudb.json [06:52:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet [06:59:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet [07:00:05] Amir1, apergos, and jnuche: Dear deployers, time to do the UTC morning backport and config training deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T0700). [07:00:20] morning! let me see what's happening today [07:01:00] (03PS1) 10Muehlenhoff: nftables::service: Make the port optional [puppet] - 10https://gerrit.wikimedia.org/r/955419 [07:01:09] no patches scheduled, hrm. [07:02:03] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) Please open a new task for that. There is already a [[ https://github.com/wikimedia/operations-cookbooks/blob/ma... [07:02:26] I do see someone signed up for a training, so I'm in the google meet. If you are that person, you should have received an invite; if not, please ping me here [07:04:24] looking for mo_abualruz who does not seem to be on line right now [07:04:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2002.codfw.wmnet [07:08:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2002.codfw.wmnet [07:12:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1002.eqiad.wmnet [07:16:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1002.eqiad.wmnet [07:24:28] I'm no longer in the hangout, and assuming that something came up for our trainee so they couldn't make it. [07:24:34] see everyone next week! [07:32:12] (03PS1) 10JMeybohm: CI: Pull mariadb_sections into general fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/955422 (https://phabricator.wikimedia.org/T340843) [07:32:16] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T345798 (10phaultfinder) [07:35:37] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) >>! In T331699#9143177, @jhathaway wrote: >>>! In T331699#9136475, @MoritzMuehlenhoff wrote: >> One other option would be to simply start with a fresh, paral... [07:37:59] (03PS2) 10ArielGlenn: move dumps-related workers and nfs shares from core platform to data engineering [puppet] - 10https://gerrit.wikimedia.org/r/955338 [07:38:47] (03CR) 10ArielGlenn: [C: 03+2] move dumps-related workers and nfs shares from core platform to data engineering [puppet] - 10https://gerrit.wikimedia.org/r/955338 (owner: 10ArielGlenn) [07:40:41] !log installing file/libmagic security updates [07:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10ayounsi) Thanks, we had a quick chat on IRC about that and indeed that's the current conclusion. The extra details your provided (and fix suggestions... [07:57:41] !log installing grub2 updates from Bullseye point release [07:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:51] (03CR) 10Ayounsi: [C: 03+1] Set system console user timeout for Juniper devices [homer/public] - 10https://gerrit.wikimedia.org/r/955342 (https://phabricator.wikimedia.org/T345710) (owner: 10Cathal Mooney) [08:00:05] hashar and jeena: Time to snap out of that daydream and deploy MediaWiki train - Utc Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T0800). [08:00:32] (DatasourceError) firing: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [08:01:22] (03CR) 10Filippo Giunchedi: [C: 03+1] P:prometheus::ops: add ensure filter for envoy [puppet] - 10https://gerrit.wikimedia.org/r/955363 (owner: 10Majavah) [08:05:09] 10SRE, 10Traffic: reprovision ping VM in esams - https://phabricator.wikimedia.org/T345743 (10fgiunchedi) Followup from IRC: it isn't clear whether ping offload should be fully rolled out everywhere (some PoPs are missing) or retired entirely, cc @cmooney @ayounsi [08:10:32] (DatasourceError) resolved: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [08:13:59] (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955579 (https://phabricator.wikimedia.org/T343727) [08:14:01] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955579 (https://phabricator.wikimedia.org/T343727) (owner: 10TrainBranchBot) [08:14:42] (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955579 (https://phabricator.wikimedia.org/T343727) (owner: 10TrainBranchBot) [08:17:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:17:37] apergos: sorry I had to skip the backport training earlier today, I had to drive a friend [08:17:53] (last minute change.. :( ) [08:19:49] php-fpm are restarting [08:19:54] I didn't see yo uon the schedule either way, hashar, so I would not have waited for you! That's a good reason to open an actual task, so I know you're coming... anyways the other person didn't show either, maybe next week [08:20:14] if something's not on the workboard I tend to forget... [08:21:25] (03CR) 10JMeybohm: [C: 03+2] CI: Pull mariadb_sections into general fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/955422 (https://phabricator.wikimedia.org/T340843) (owner: 10JMeybohm) [08:21:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:21:44] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10elukey) @cmooney @ayounsi thanks a lot! On the host side, I'd try two things (not sure if they could help or not): 1) Do a simple reboot. The hosts... [08:22:12] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.25 refs T343727 [08:22:15] T343727: 1.41.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T343727 [08:22:35] 10ops-codfw, 10DC-Ops: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Clement_Goubert) [08:22:44] (03PS2) 10Filippo Giunchedi: prometheus: sort output for class/resource targets [puppet] - 10https://gerrit.wikimedia.org/r/924948 [08:22:59] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mc2040.codfw.wmnet with reason: T345802 - hw troubleshooting [08:23:01] T345802: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 [08:23:12] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mc2040.codfw.wmnet with reason: T345802 - hw troubleshooting [08:23:55] (03Abandoned) 10Filippo Giunchedi: Add Debian packaging for 21.3.0 [software/librenms] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/674563 (https://phabricator.wikimedia.org/T278309) (owner: 10Filippo Giunchedi) [08:24:56] (03Abandoned) 10Filippo Giunchedi: [WIP] mirror udp2log data into the logging pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494254 (https://phabricator.wikimedia.org/T126989) (owner: 10Filippo Giunchedi) [08:25:59] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:prometheus::ops: add ensure filter for envoy [puppet] - 10https://gerrit.wikimedia.org/r/955363 (owner: 10Majavah) [08:26:14] I am rolling back, FlaggedRevs has a failure of some sort [08:26:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:27:05] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10cmooney) p:05Triage→03Low [08:27:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:28:00] (03PS3) 10Filippo Giunchedi: Override Cumin batch sleep+size from command line [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 [08:29:10] (03Merged) 10jenkins-bot: CI: Pull mariadb_sections into general fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/955422 (https://phabricator.wikimedia.org/T340843) (owner: 10JMeybohm) [08:30:31] hashar: ping me when you're done pretty please? [08:30:37] Still got some reboots to do [08:30:55] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/924948 (owner: 10Filippo Giunchedi) [08:30:57] (03PS2) 10JMeybohm: cxserver: update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/955333 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [08:31:03] (03PS2) 10JMeybohm: cxserver: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/955334 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [08:31:27] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10Gehel) Approved from my side. [08:33:28] (03CR) 10Elukey: [WIP] Add Helm chart for the recommendation-api-ng (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:34:46] (03PS3) 10Filippo Giunchedi: prometheus: sort output for class targets [puppet] - 10https://gerrit.wikimedia.org/r/924948 [08:34:49] (03CR) 10Filippo Giunchedi: prometheus: sort output for class targets (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/924948 (owner: 10Filippo Giunchedi) [08:35:48] (03CR) 10Volans: "I'll leave the actual review to the project owner." [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 (owner: 10Filippo Giunchedi) [08:37:24] (03CR) 10Elukey: [WIP] Add Helm chart for the recommendation-api-ng (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:38:49] !log installing librsvg security updates [08:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) Hi # TL;DR cadvisor is to blame. Adding @fgiunchedi for his information and a thumbs up on disabling cadvisor on conf2* until we can bum... [08:41:41] (03CR) 10Elukey: [WIP] Add Helm chart for the recommendation-api-ng (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:42:26] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) This isn't present in conf1* hosts, despite also running cadvisor and the same exact version, presumably because of a different kernel ver... [08:43:27] (03CR) 10Jbond: [C: 03+1] nftables::service: Make the port optional [puppet] - 10https://gerrit.wikimedia.org/r/955419 (owner: 10Muehlenhoff) [08:44:20] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: Add services to EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 (owner: 10Clément Goubert) [08:45:08] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/924948 (owner: 10Filippo Giunchedi) [08:45:27] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) There's a few more actionables here: 1. Re-evaluate our SLO target for conf hosts etcd service. Despite having exhausted the error budget... [08:46:01] (03PS1) 10Majavah: hieradata: set memcached.address for codfw1dev roles [puppet] - 10https://gerrit.wikimedia.org/r/955581 [08:46:40] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert group2 wikis to 1.41.0-wmf.24 - T343727 [08:46:41] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10cmooney) [08:46:43] T343727: 1.41.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T343727 [08:46:53] (03Merged) 10jenkins-bot: sre.discovery.datacenter: Add services to EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/954904 (owner: 10Clément Goubert) [08:47:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:47:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:47:38] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: sort output for class targets [puppet] - 10https://gerrit.wikimedia.org/r/924948 (owner: 10Filippo Giunchedi) [08:48:22] (03PS3) 10Ilias Sarantopoulos: api-gateway: change liftwing hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/940945 (https://phabricator.wikimedia.org/T342266) [08:48:31] (03CR) 10Cathal Mooney: [C: 03+2] Set system console user timeout for Juniper devices [homer/public] - 10https://gerrit.wikimedia.org/r/955342 (https://phabricator.wikimedia.org/T345710) (owner: 10Cathal Mooney) [08:48:54] (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: set memcached.address for codfw1dev roles [puppet] - 10https://gerrit.wikimedia.org/r/955581 (owner: 10Majavah) [08:49:06] (03CR) 10Majavah: [C: 03+2] hieradata: set memcached.address for codfw1dev roles [puppet] - 10https://gerrit.wikimedia.org/r/955581 (owner: 10Majavah) [08:49:08] (03Merged) 10jenkins-bot: Set system console user timeout for Juniper devices [homer/public] - 10https://gerrit.wikimedia.org/r/955342 (https://phabricator.wikimedia.org/T345710) (owner: 10Cathal Mooney) [08:50:25] the FlaggedDevs issue that caused me to rollback is T345804 [08:50:26] T345804: TypeError: Argument 1 passed to RevisionReviewForm::setTag() must be of the type int, null given, called in /srv/mediawiki/php-1.41.0-wmf.25/extensions/FlaggedRevs/frontend/specialpages/actions/RevisionReview.php on line 287 - https://phabricator.wikimedia.org/T345804 [08:50:36] (03PS2) 10AikoChou: changeprop: allow retries for liftwing streams with 500 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/954969 [08:50:38] (03PS1) 10AikoChou: ml-services: add annotations for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/955582 (https://phabricator.wikimedia.org/T344058) [08:51:34] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [08:52:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52297 and previous config saved to /var/cache/conftool/dbconfig/20230907-085159-arnaudb.json [08:52:03] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [08:52:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:52:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:53:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10fgiunchedi) >>! In T345738#9148513, @akosiaris wrote: > Hi > > # TL;DR > > cadvisor is to blame. Adding @fgiunchedi for his information and a thumb... [08:54:06] (03PS1) 10Alexandros Kosiaris: configcluster: Disable cadvisor in codfw [puppet] - 10https://gerrit.wikimedia.org/r/955583 (https://phabricator.wikimedia.org/T345738) [08:54:32] (03CR) 10CI reject: [V: 04-1] configcluster: Disable cadvisor in codfw [puppet] - 10https://gerrit.wikimedia.org/r/955583 (https://phabricator.wikimedia.org/T345738) (owner: 10Alexandros Kosiaris) [08:56:22] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [08:57:31] (03CR) 10Muehlenhoff: [C: 03+2] nftables::service: Make the port optional [puppet] - 10https://gerrit.wikimedia.org/r/955419 (owner: 10Muehlenhoff) [08:58:01] (03PS1) 10Elukey: charts: add the python-webapp chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 [08:58:41] (03PS2) 10Elukey: charts: add the python-webapp chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 [08:58:47] (03PS1) 10Jbond: systemd::timer::job: update spec to use super() [puppet] - 10https://gerrit.wikimedia.org/r/955585 (https://phabricator.wikimedia.org/T345719) [08:58:49] (03PS1) 10Jbond: systemd::timer: add support for ConditionalPathExists [puppet] - 10https://gerrit.wikimedia.org/r/955586 (https://phabricator.wikimedia.org/T345719) [08:58:51] (03PS1) 10Jbond: puppetserver: only clean reports dir if dir exists [puppet] - 10https://gerrit.wikimedia.org/r/955587 (https://phabricator.wikimedia.org/T345719) [08:59:28] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10JMeybohm) >>! In T345738#9148572, @fgiunchedi wrote: >>>! In T345738#9148513, @akosiaris wrote: >> Hi >> >> # TL;DR >> >> cad... [08:59:34] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1133.eqiad.wmnet with OS bullseye [08:59:38] (03CR) 10Elukey: "I have the feeling that we could just do https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/955584/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:59:46] (03CR) 10CI reject: [V: 04-1] systemd::timer::job: update spec to use super() [puppet] - 10https://gerrit.wikimedia.org/r/955585 (https://phabricator.wikimedia.org/T345719) (owner: 10Jbond) [08:59:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) >>! In T345738#9148572, @fgiunchedi wrote: >>>! In T345738#9148513, @akosiaris wrote: >> Hi >> >> # TL;DR >> >> ca... [09:00:00] (03CR) 10CI reject: [V: 04-1] systemd::timer: add support for ConditionalPathExists [puppet] - 10https://gerrit.wikimedia.org/r/955586 (https://phabricator.wikimedia.org/T345719) (owner: 10Jbond) [09:02:35] (03CR) 10JMeybohm: [C: 03+1] cxserver: update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/955333 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:02:41] (03CR) 10JMeybohm: [C: 03+1] cxserver: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/955334 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:03:02] (03PS2) 10Jbond: systemd::timer: add support for ConditionalPathExists [puppet] - 10https://gerrit.wikimedia.org/r/955586 (https://phabricator.wikimedia.org/T345719) [09:03:04] (03PS2) 10Jbond: puppetserver: only clean reports dir if dir exists [puppet] - 10https://gerrit.wikimedia.org/r/955587 (https://phabricator.wikimedia.org/T345719) [09:03:49] (03CR) 10CI reject: [V: 04-1] systemd::timer: add support for ConditionalPathExists [puppet] - 10https://gerrit.wikimedia.org/r/955586 (https://phabricator.wikimedia.org/T345719) (owner: 10Jbond) [09:05:29] (03PS1) 10Jbond: puppetserver::wmcs: correct parameter name [puppet] - 10https://gerrit.wikimedia.org/r/955588 (https://phabricator.wikimedia.org/T345702) [09:07:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P52298 and previous config saved to /var/cache/conftool/dbconfig/20230907-090706-arnaudb.json [09:09:25] (03CR) 10Jbond: [C: 03+2] puppetserver::wmcs: correct parameter name [puppet] - 10https://gerrit.wikimedia.org/r/955588 (https://phabricator.wikimedia.org/T345702) (owner: 10Jbond) [09:10:54] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10cmooney) p:05Triage→03Low [09:12:26] PROBLEM - Host mw1373 is DOWN: PING CRITICAL - Packet loss = 100% [09:12:28] PROBLEM - Host mw1384 is DOWN: PING CRITICAL - Packet loss = 100% [09:12:52] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1133.eqiad.wmnet with reason: host reimage [09:13:31] ^On it [09:14:47] !log foreachwikiindblist private extensions/OATHAuth/maintenance/UpdateForMultipleDevicesSupport.php | tee oathauth-multiple-private.log # T242031 [09:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:50] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [09:15:17] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1133.eqiad.wmnet with reason: host reimage [09:15:22] RECOVERY - Host mw1373 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [09:15:46] (03CR) 10EoghanGaffney: [C: 03+1] vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [09:15:52] RECOVERY - Host mw1384 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [09:17:40] (03PS5) 10Muehlenhoff: Switch sretest1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/955292 [09:19:16] (03PS1) 10Majavah: Set OATHAuth multiple devices READ_NEW for all fishbows, privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955670 (https://phabricator.wikimedia.org/T242031) [09:19:18] (03PS1) 10Majavah: Set OATHAuth multiple devices WRITE_BOTH for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955671 (https://phabricator.wikimedia.org/T242031) [09:22:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955292 (owner: 10Muehlenhoff) [09:22:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P52299 and previous config saved to /var/cache/conftool/dbconfig/20230907-092212-arnaudb.json [09:22:16] !log installing grub2 updates from Bullseye point release [09:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:31] (03PS2) 10Majavah: Set OATHAuth multiple devices READ_NEW for all fishbows, privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955670 (https://phabricator.wikimedia.org/T242031) [09:22:33] (03PS2) 10Majavah: Set OATHAuth multiple devices WRITE_BOTH for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955671 (https://phabricator.wikimedia.org/T242031) [09:24:20] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1134.eqiad.wmnet with OS bullseye [09:28:37] jouncebot: nowandnext [09:28:37] For the next 0 hour(s) and 31 minute(s): MediaWiki train - Utc Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T0800) [09:28:37] In 0 hour(s) and 31 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T1000) [09:28:38] In 0 hour(s) and 31 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T1000) [09:30:30] (03PS1) 10Ladsgroup: RevisionReviewForm: allow setting `null` tag [extensions/FlaggedRevs] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955062 (https://phabricator.wikimedia.org/T345804) [09:30:44] (03CR) 10AikoChou: [C: 03+1] "agree, we could use the same chart for python web applications we have" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey) [09:32:19] (03CR) 10Kevin Bazira: [C: 03+1] "Good idea to re-use the helm chart." [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey) [09:34:03] PROBLEM - DPKG on ms-be2064 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:34:03] PROBLEM - DPKG on ms-be2060 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:34:09] (03CR) 10Kevin Bazira: [WIP] Add Helm chart for the recommendation-api-ng (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [09:34:31] (03CR) 10CI reject: [V: 04-1] RevisionReviewForm: allow setting `null` tag [extensions/FlaggedRevs] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955062 (https://phabricator.wikimedia.org/T345804) (owner: 10Ladsgroup) [09:35:21] (03CR) 10Ladsgroup: [C: 03+2] RevisionReviewForm: allow setting `null` tag [extensions/FlaggedRevs] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955062 (https://phabricator.wikimedia.org/T345804) (owner: 10Ladsgroup) [09:35:29] (03PS2) 10Hashar: RevisionReviewForm: allow setting `null` tag [extensions/FlaggedRevs] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955062 (https://phabricator.wikimedia.org/T345804) (owner: 10Ladsgroup) [09:35:50] (03CR) 10Hashar: [C: 03+2] "Sorry I re cherry picked it by mistake :/" [extensions/FlaggedRevs] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955062 (https://phabricator.wikimedia.org/T345804) (owner: 10Ladsgroup) [09:36:10] Amir1: I guess I am not awake yet, I re cherry picked it bah :/ [09:36:38] it shouldn't allow you to re-cherry-pick it to the same branch [09:36:53] ah, it made a new PS [09:36:55] meh [09:37:09] PROBLEM - DPKG on ms-be2046 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:37:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T343198)', diff saved to https://phabricator.wikimedia.org/P52300 and previous config saved to /var/cache/conftool/dbconfig/20230907-093718-arnaudb.json [09:37:19] PROBLEM - DPKG on ms-be2061 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:37:19] PROBLEM - DPKG on ms-be2057 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:37:20] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [09:37:22] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [09:37:33] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [09:37:43] (03PS11) 10FNegri: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [09:38:05] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1134.eqiad.wmnet with reason: host reimage [09:38:11] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1133.eqiad.wmnet with OS bullseye [09:39:02] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host puppetserver1002.eqiad.wmnet with OS bookworm [09:39:04] (03Merged) 10jenkins-bot: RevisionReviewForm: allow setting `null` tag [extensions/FlaggedRevs] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955062 (https://phabricator.wikimedia.org/T345804) (owner: 10Ladsgroup) [09:39:49] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Aklapper) [09:39:57] (03PS3) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [09:40:02] (03CR) 10CI reject: [V: 04-1] P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [09:40:42] hashar: merged, are you dpeloying or should I? [09:41:08] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1134.eqiad.wmnet with reason: host reimage [09:41:10] PROBLEM - DPKG on ms-be1070 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:41:12] PROBLEM - DPKG on ms-be2059 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:41:32] Amir1: I will handle it :) [09:42:06] awesome. Let me know once you're done, I have some deploys to do [09:42:48] I made a comment on the patch to master, I dont" know what will happen when the tag is set to null :D [09:43:09] looks like one of the caller should cast null to int(0) then I don't know anything about FlaggedRevs or its data structure [09:43:18] (03CR) 10Ilias Sarantopoulos: [C: 03+1] charts: add the python-webapp chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey) [09:44:06] (03PS1) 10Hashar: Revert "all wikis to 1.41.0-wmf.25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955673 (https://phabricator.wikimedia.org/T343727) [09:44:16] PROBLEM - DPKG on ms-be2065 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:44:31] (03CR) 10Hashar: [C: 03+2] "I already deployed it but forgot to push back to Gerrit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955673 (https://phabricator.wikimedia.org/T343727) (owner: 10Hashar) [09:44:54] PROBLEM - DPKG on ms-be2047 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:45:15] (03Merged) 10jenkins-bot: Revert "all wikis to 1.41.0-wmf.25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955673 (https://phabricator.wikimedia.org/T343727) (owner: 10Hashar) [09:45:23] (03PS12) 10FNegri: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [09:46:14] !log hashar@deploy1002 Started scap: Backport for [[gerrit:955062|RevisionReviewForm: allow setting `null` tag (T345804)]] [09:46:23] T345804: TypeError: Argument 1 passed to RevisionReviewForm::setTag() must be of the type int, null given, called in /srv/mediawiki/php-1.41.0-wmf.25/extensions/FlaggedRevs/frontend/specialpages/actions/RevisionReview.php on line 287 - https://phabricator.wikimedia.org/T345804 [09:46:25] Amir1: I will promote the group2 wikis immediately after [09:46:34] sounds good [09:47:34] PROBLEM - DPKG on ms-be2062 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:47:57] !log hashar@deploy1002 ladsgroup and hashar: Backport for [[gerrit:955062|RevisionReviewForm: allow setting `null` tag (T345804)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [09:48:15] !log hashar@deploy1002 ladsgroup and hashar: Continuing with sync [09:48:49] (03PS3) 10Elukey: charts: add the python-webapp chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 [09:49:08] (03CR) 10Elukey: charts: add the python-webapp chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey) [09:50:11] (03PS11) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [09:50:45] 10SRE, 10Machine-Learning-Team, 10MinT, 10serviceops, and 2 others: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10Pginer-WMF) 05Open→03Resolved a:03Pginer-WMF Since MinT [was launched](https://diff.wikimedia.org/2023/06/13/mint-support... [09:51:14] PROBLEM - DPKG on ms-be2063 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:52:14] (03PS12) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [09:53:38] PROBLEM - DPKG on ms-be2066 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:08] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:955062|RevisionReviewForm: allow setting `null` tag (T345804)]] (duration: 07m 54s) [09:54:13] T345804: TypeError: Argument 1 passed to RevisionReviewForm::setTag() must be of the type int, null given, called in /srv/mediawiki/php-1.41.0-wmf.25/extensions/FlaggedRevs/frontend/specialpages/actions/RevisionReview.php on line 287 - https://phabricator.wikimedia.org/T345804 [09:54:18] PROBLEM - DPKG on ms-be2068 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:54:36] (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955674 (https://phabricator.wikimedia.org/T343727) [09:54:38] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955674 (https://phabricator.wikimedia.org/T343727) (owner: 10TrainBranchBot) [09:55:14] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver1002.eqiad.wmnet with reason: host reimage [09:55:37] (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955674 (https://phabricator.wikimedia.org/T343727) (owner: 10TrainBranchBot) [09:55:52] PROBLEM - DPKG on ms-be1068 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:57:22] PROBLEM - DPKG on ms-be2045 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:58:13] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver1002.eqiad.wmnet with reason: host reimage [09:59:26] versions.toolforge.org seems to be displayin incorrect information [09:59:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:59:35] PROBLEM - DPKG on ms-be2044 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:59:43] PROBLEM - DPKG on ms-be1071 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:00:05] damn rsync error :( [10:00:06] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T1000). [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T1000) [10:00:12] (03PS13) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [10:00:40] scap pull failed on a few hosts [10:00:52] hashar: reboots in progress [10:01:13] aren't they unnpooled when rebooted which would remove the hosts from the scap dsh groups? [10:01:21] PROBLEM - DPKG on ms-be1069 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:01:59] PROBLEM - DPKG on ms-be2048 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:02:07] I got failures from parse1001 , mw1460 mw1459 mw1453 mw1389 [10:02:25] though they all come from the mw1420 rsync proxy so I guess that is the reason [10:02:29] hashar: they're pooled=no, not pooled=invalid [10:02:41] yeah [10:02:49] So they stay in the dsh group [10:03:15] I'll run a manual scap pull on these hosts once 1420 is back up [10:03:29] and regardles of the pool status I guess scap still attempts to use mw1420 as a proxy (I don't know what is the source for the proxy list) [10:03:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:03:56] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1134.eqiad.wmnet with OS bullseye [10:05:15] hashar: manual scap pull done [10:05:16] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.25 refs T343727 [10:05:19] T343727: 1.41.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T343727 [10:06:07] PROBLEM - DPKG on ms-be2069 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:45] 10:02:34 6 apaches had sync errors [10:07:19] so I guess I missed one: mw1429 :) [10:07:40] I am scap pulling it [10:07:59] (03CR) 10Clément Goubert: [C: 03+1] changeprop: Rule for refreshUserImpactJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/955319 (https://phabricator.wikimedia.org/T344428) (owner: 10Urbanecm) [10:08:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:08:45] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host puppetserver1002.eqiad.wmnet with OS bookworm [10:09:55] (03CR) 10Urbanecm: [C: 03+2] changeprop: Rule for refreshUserImpactJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/955319 (https://phabricator.wikimedia.org/T344428) (owner: 10Urbanecm) [10:10:40] (03Merged) 10jenkins-bot: changeprop: Rule for refreshUserImpactJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/955319 (https://phabricator.wikimedia.org/T344428) (owner: 10Urbanecm) [10:10:59] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm [10:15:29] (03PS1) 10Jbond: cluster::management: drop hardcoded sshkeys [puppet] - 10https://gerrit.wikimedia.org/r/955677 [10:16:07] (03CR) 10Jbond: [C: 03+2] cluster::management: drop hardcoded sshkeys [puppet] - 10https://gerrit.wikimedia.org/r/955677 (owner: 10Jbond) [10:18:10] Amir1: mediawiki looks better, you can do your deploys :] [10:18:19] awesome thanks [10:18:21] PROBLEM - DPKG on ms-be2049 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:18:24] I finish something and get to it [10:19:10] there are some failure to load etcd config from a k8s container though mw-api-ext.eqiad.main-844b4bfb75-dlmqz [10:19:38] I'm still rebooting so some hosts may fail [10:19:51] hashar: Huh, that's strange [10:20:35] filed it as https://phabricator.wikimedia.org/T345812 [10:21:01] and the errors all come from host.name=mw-api-ext.eqiad.main-844b4bfb75-dlmqz [10:21:04] If it's only one pod, I'll kill it [10:21:57] !log urbanecm@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [10:22:27] maybe that was transient [10:23:01] !log urbanecm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [10:23:49] !log urbanecm@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [10:24:33] !log urbanecm@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [10:24:41] !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [10:24:54] hashar: It hasn't logged it in ~30 minutes, is it a load time error or a run time error? [10:26:07] PROBLEM - DPKG on ms-be2058 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:26:09] I can delete it as a precaution, replicaset will recreate it [10:29:06] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage [10:30:28] Considering it's got a request and a reqId, I'm inclined to say runtime, and since it hasn't logged it in a while, probably transient [10:30:46] akosiaris ^ opinion [10:32:16] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage [10:33:24] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:53] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1) [10:33:56] :( [10:34:36] Amir1: Well the cookbook failed, if you want to deploy something now's the time :D [10:35:14] !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [10:35:58] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw[1441-1442,1451].eqiad.wmnet [10:35:58] !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [10:35:59] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw[1441-1442,1451].eqiad.wmnet [10:36:22] okay, cool [10:40:22] (03PS2) 10Ladsgroup: Pin pagelinks normalization stage to old in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955331 (https://phabricator.wikimedia.org/T345732) [10:40:26] (03CR) 10Ladsgroup: [C: 03+2] Pin pagelinks normalization stage to old in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955331 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [10:40:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955331 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [10:42:33] (03Merged) 10jenkins-bot: Pin pagelinks normalization stage to old in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955331 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [10:42:51] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:955331|Pin pagelinks normalization stage to old in production (T345732)]] [10:42:54] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [10:44:19] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:955331|Pin pagelinks normalization stage to old in production (T345732)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [10:44:36] RECOVERY - DPKG on ms-be2065 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:45:40] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [10:46:06] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:11] !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [10:47:13] claime: the staging release refuses to work for some reason (but is logged as DONE in SAL, confusingly) :-/. in my terminal, it says `Error: UPGRADE FAILED: release staging failed, and has been rolled back due to atomic being set: timed out waiting for the condition`. [10:47:21] any idea what might be going on? [10:47:46] (the eqiad/codfw releases went out w/o issues) [10:47:56] Hmm give me a second to check something [10:48:49] urbanecm: we're out of IPs on staging apparently so it can't do a rollover start [10:49:37] urbanecm: 22m Warning FailedCreatePodSandBox pod/changeprop-staging-5f8fb78b9c-vbh9r Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "1c1b14205562eb5ba58859d681dacd93f25d9912d7c91932f0f39a541dd72f3a" network for pod "changeprop-staging-5f8fb78b9c-vbh9r": networkPlugin cni failed to set up pod [10:49:39] "changeprop-staging-5f8fb78b9c-vbh9r_changeprop-jobqueue" network: failed to request IPv4 addresses: Assigned 0 out of 1 requested IPv4 addresses; No more free affine blocks and strict affinity enabled [10:50:14] oh, interesting. is it ok to leave the patch deployed to eqiad/codfw only? or should we revert until we have more IPs? [10:51:02] I think it's ok, I'll try to do a hard sync since it's staging [10:51:56] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:955331|Pin pagelinks normalization stage to old in production (T345732)]] (duration: 09m 05s) [10:51:59] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [10:52:14] ok, ty. [10:52:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:54:22] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [10:55:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2065.codfw.wmnet [10:56:35] !log mwmaint1002: `/usr/local/bin/mw-cli-wrapper /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=1second --verbose --use-job-queue` (T344428, testing with r955319 deployed) [10:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:38] T344428: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 [10:57:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:58:00] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43161/console" [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [11:03:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet [11:03:38] urbanecm: ok it's going to fail again because a sync plays by the rules and won't just destroy a pod to deploy a new one if constrained by ip resources [11:04:30] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [11:05:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2065.codfw.wmnet [11:08:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10BTullis) [11:09:17] makes sense. sounds like destroying a pod might help, but...dunno if that's a good idea :D [11:09:45] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:10:00] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:10:14] (03PS6) 10Jbond: Switch sretest1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/955292 (owner: 10Muehlenhoff) [11:10:17] (03PS1) 10Jbond: nftables::service: improve error reporting [puppet] - 10https://gerrit.wikimedia.org/r/955687 [11:10:55] (03CR) 10Muehlenhoff: [C: 03+1] "Good idea :-)" [puppet] - 10https://gerrit.wikimedia.org/r/955687 (owner: 10Jbond) [11:11:22] (03CR) 10Jbond: [C: 03+2] nftables::service: improve error reporting [puppet] - 10https://gerrit.wikimedia.org/r/955687 (owner: 10Jbond) [11:11:40] urbanecm: So what I ended up doing is cordoning the kube node where it wanted to schedule it, and which was already full ipam-wise, redeploying, and uncordoning [11:11:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet [11:12:18] ty! [11:12:18] Amir1: You good with the backports, I can start rebooting again? [11:12:31] claime: yeah feel free to! [11:12:37] a'ight [11:13:07] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:16:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955292 (owner: 10Muehlenhoff) [11:17:00] (03PS7) 10Jbond: Switch sretest1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/955292 (owner: 10Muehlenhoff) [11:17:02] (03PS1) 10Jbond: nftables::service: keep port as undef if its undef [puppet] - 10https://gerrit.wikimedia.org/r/955694 [11:18:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43163/console" [puppet] - 10https://gerrit.wikimedia.org/r/955292 (owner: 10Muehlenhoff) [11:18:44] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) p:05Triage→03Medium Given this isn't urgent and we have multiple ways of dealing with this, I 've re-enabled pup... [11:18:54] (03CR) 10Jbond: [C: 03+2] nftables::service: keep port as undef if its undef [puppet] - 10https://gerrit.wikimedia.org/r/955694 (owner: 10Jbond) [11:22:46] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2050.codfw.wmnet [11:22:53] (03PS2) 10Jbond: systemd::timer::job: update spec to use super() [puppet] - 10https://gerrit.wikimedia.org/r/955585 (https://phabricator.wikimedia.org/T345719) [11:22:55] (03PS3) 10Jbond: systemd::timer: add support for ConditionalPathExists [puppet] - 10https://gerrit.wikimedia.org/r/955586 (https://phabricator.wikimedia.org/T345719) [11:22:57] (03PS3) 10Jbond: puppetserver: only clean reports dir if dir exists [puppet] - 10https://gerrit.wikimedia.org/r/955587 (https://phabricator.wikimedia.org/T345719) [11:23:07] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1050.eqiad.wmnet [11:24:42] (03PS1) 10Majavah: P:access_new_install: move files under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955697 [11:24:44] (03PS1) 10Majavah: P:icinga: move files under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955698 [11:24:46] (03PS1) 10Majavah: P:installserver::proxy: move templates under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955699 [11:24:48] (03PS1) 10Majavah: P:mediawiki::deployment: move templates under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955700 [11:24:50] (03PS1) 10Majavah: P:memcached::memkeys: move templates under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955701 [11:24:52] (03PS1) 10Majavah: P:releases: move templates under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955702 [11:25:25] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Infrastructure-Foundations: GRUB fails to determine the disks to install to on swift backends - https://phabricator.wikimedia.org/T345816 (10MoritzMuehlenhoff) [11:28:30] RECOVERY - DPKG on ms-be2045 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:29:31] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1050.eqiad.wmnet [11:29:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2050.codfw.wmnet [11:30:24] RECOVERY - DPKG on ms-be2044 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:31:46] RECOVERY - DPKG on ms-be1069 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:32:30] RECOVERY - DPKG on ms-be2048 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:34:34] RECOVERY - DPKG on ms-be2060 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:37:46] RECOVERY - DPKG on ms-be2046 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:37:56] RECOVERY - DPKG on ms-be2057 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:37:56] RECOVERY - DPKG on ms-be2061 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:38:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] "nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955422 (https://phabricator.wikimedia.org/T340843) (owner: 10JMeybohm) [11:39:29] (03PS1) 10Slyngshede: LDAPBACKEND: Add validator for checking CommonName [software/bitu] - 10https://gerrit.wikimedia.org/r/955713 [11:42:00] RECOVERY - DPKG on ms-be1070 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:42:02] RECOVERY - DPKG on ms-be2059 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:44:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, module the minor typo in the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [11:45:56] RECOVERY - DPKG on ms-be2047 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:47:00] (03PS2) 10Slyngshede: LDAPBACKEND: Add validator for checking CommonName [software/bitu] - 10https://gerrit.wikimedia.org/r/955713 [11:48:42] RECOVERY - DPKG on ms-be2062 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:49:14] RECOVERY - DPKG on ms-be2049 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:50:32] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Infrastructure-Foundations: GRUB fails to determine the disks to install to on swift backends - https://phabricator.wikimedia.org/T345816 (10MoritzMuehlenhoff) One additional data point; this only affects the disk schema setup by ms-be.cfg, none of the syst... [11:52:34] RECOVERY - DPKG on ms-be2063 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:52:57] (03PS1) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) [11:55:06] RECOVERY - DPKG on ms-be2066 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:55:46] RECOVERY - DPKG on ms-be2068 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:56:21] (03PS3) 10Jbond: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) [11:57:14] RECOVERY - DPKG on ms-be2058 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:57:20] RECOVERY - DPKG on ms-be1068 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:58:43] (03CR) 10CI reject: [V: 04-1] sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [11:59:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10jijiki) Dear @Jhancock.wm or @Papaul, server can be shutdown and checked at your convenience, this part of the stack has failovers in place. Thank you! [11:59:23] (03CR) 10JMeybohm: [C: 04-1] "There have been a bunch of module upgrade since then (maybe scaffold as well). Please update all dependencies and double check if there ha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey) [11:59:50] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T1200) [12:00:18] (03PS1) 10Jbond: stie.pp: move server to puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/955063 (https://phabricator.wikimedia.org/T340739) [12:01:56] (03CR) 10Filippo Giunchedi: [C: 03+1] P:icinga: move files under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955698 (owner: 10Majavah) [12:02:26] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:02:39] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:02:59] (03CR) 10Jbond: [C: 03+2] stie.pp: move server to puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/955063 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [12:03:18] (03PS3) 10JMeybohm: kubernetes::master: Validate SA tokens with the certs of all masters [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826) [12:04:23] !log Starting eqiad jobrunner reboots [12:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:48] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [12:04:53] 10SRE, 10Traffic: reprovision ping VM in esams - https://phabricator.wikimedia.org/T345743 (10fgiunchedi) Related: {T345743} [12:04:56] RECOVERY - DPKG on ms-be2064 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:06:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Thanks for this change. Premise appears correct to me. I have a couple of minor comments here and there though." [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [12:07:00] RECOVERY - DPKG on ms-be2069 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:08:06] godog: fyi, recursive phab comment is recursive, I'm guessing you were meaning {T345809} [12:08:07] T345809: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 [12:09:40] p858snake|L: lol! indeed, thank you I've fixed it [12:10:05] there should totally be phab automation for awards when someone does recursive comments [12:10:13] that's the second time for me [12:12:52] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: route Growth team alerts [puppet] - 10https://gerrit.wikimedia.org/r/953347 (https://phabricator.wikimedia.org/T345202) (owner: 10Urbanecm) [12:13:38] (03PS1) 10Muehlenhoff: Clean up some cruft from past experiments [puppet] - 10https://gerrit.wikimedia.org/r/955718 [12:21:14] (03CR) 10Alexandros Kosiaris: PHPFPMTooBusy: Point to public available runbook (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/954947 (owner: 10Alexandros Kosiaris) [12:21:24] (03CR) 10Filippo Giunchedi: [C: 03+2] cxserver: update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/955333 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [12:21:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1 while I rethink this." [alerts] - 10https://gerrit.wikimedia.org/r/954947 (owner: 10Alexandros Kosiaris) [12:21:31] (03CR) 10Filippo Giunchedi: [C: 03+2] cxserver: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/955334 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [12:23:01] !log filippo@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:23:40] (03CR) 10LSobanski: [C: 03+1] gitlab: Add unlock command to gitlab-backup script [puppet] - 10https://gerrit.wikimedia.org/r/954916 (owner: 10EoghanGaffney) [12:23:40] !log filippo@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:27:45] (03PS1) 10FNegri: [openstack] Replace OS version in new manifests [puppet] - 10https://gerrit.wikimedia.org/r/955720 (https://phabricator.wikimedia.org/T345810) [12:27:50] (03CR) 10Muehlenhoff: PHPFPMTooBusy: Point to public available runbook (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/954947 (owner: 10Alexandros Kosiaris) [12:28:10] (03CR) 10CI reject: [V: 04-1] [openstack] Replace OS version in new manifests [puppet] - 10https://gerrit.wikimedia.org/r/955720 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [12:28:52] (03PS1) 10Jbond: puppetserver::git: ensure we create the user directory [puppet] - 10https://gerrit.wikimedia.org/r/955721 (https://phabricator.wikimedia.org/T345830) [12:31:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/955586 (https://phabricator.wikimedia.org/T345719) (owner: 10Jbond) [12:32:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/955587 (https://phabricator.wikimedia.org/T345719) (owner: 10Jbond) [12:33:18] PROBLEM - Check systemd state on puppetserver1002 is CRITICAL: CRITICAL - degraded: The following units failed: remove_old_puppet_reports.service,sync-puppet-ca.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:53] (03CR) 10Jbond: [C: 03+2] systemd::timer::job: update spec to use super() [puppet] - 10https://gerrit.wikimedia.org/r/955585 (https://phabricator.wikimedia.org/T345719) (owner: 10Jbond) [12:33:56] (03CR) 10Jbond: [C: 03+2] systemd::timer: add support for ConditionalPathExists [puppet] - 10https://gerrit.wikimedia.org/r/955586 (https://phabricator.wikimedia.org/T345719) (owner: 10Jbond) [12:33:58] (03CR) 10Jbond: [C: 03+2] puppetserver: only clean reports dir if dir exists [puppet] - 10https://gerrit.wikimedia.org/r/955587 (https://phabricator.wikimedia.org/T345719) (owner: 10Jbond) [12:34:24] (03CR) 10Jbond: [C: 03+2] puppetserver::git: ensure we create the user directory [puppet] - 10https://gerrit.wikimedia.org/r/955721 (https://phabricator.wikimedia.org/T345830) (owner: 10Jbond) [12:34:59] !log filippo@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [12:35:06] !log filippo@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [12:36:44] (03PS2) 10FNegri: [openstack] Replace OS version in new manifests [puppet] - 10https://gerrit.wikimedia.org/r/955720 (https://phabricator.wikimedia.org/T345810) [12:37:33] (03CR) 10Elukey: charts: add the python-webapp chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey) [12:42:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two typos inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/955713 (owner: 10Slyngshede) [12:43:59] (03CR) 10JMeybohm: [C: 04-1] charts: add the python-webapp chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey) [12:45:45] jouncebot: nowandnext [12:45:45] For the next 0 hour(s) and 14 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T1200) [12:45:45] In 0 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T1300) [12:46:13] jobrunners are being rebooted, expect some scap failures [12:46:27] if there are some, just send them my way and I'll scap pull manually as needed [12:47:23] ok, thanks [12:47:38] i have some patches of my own too, but I don't think I have enough time before the backport window so I'll do those after [12:48:00] ack [12:48:38] FYI I'm about halfway through and started 45 minutes ago [12:50:47] (03CR) 10FNegri: [openstack] Replace OS version in new manifests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/955720 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [12:50:57] (03PS1) 10Jbond: puppetserver: drop the puppetserver hiera key and use puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955722 (https://phabricator.wikimedia.org/T345067) [12:52:19] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/955718 (owner: 10Muehlenhoff) [12:52:50] (03CR) 10Muehlenhoff: [C: 03+2] Clean up some cruft from past experiments [puppet] - 10https://gerrit.wikimedia.org/r/955718 (owner: 10Muehlenhoff) [12:59:28] (03CR) 10Effie Mouzeli: [C: 03+1] "I suggest we add a" [puppet] - 10https://gerrit.wikimedia.org/r/954249 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T1300). [13:00:05] kemayo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:26] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add temporary buster-based PHP7.4 icu67 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/954700 (https://phabricator.wikimedia.org/T329491) (owner: 10Alexandros Kosiaris) [13:00:39] o/ [13:00:47] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Merging per IRC ok." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/954700 (https://phabricator.wikimedia.org/T329491) (owner: 10Alexandros Kosiaris) [13:00:51] RECOVERY - DPKG on ms-be1071 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:00:57] Kemayo: does the ordering of your patches matter? [13:01:05] (03PS4) 10Elukey: charts: add the python-webapp chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 [13:01:06] Nope [13:01:07] (03PS1) 10Elukey: python-webapp: update mesh module to 1.4.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/955725 [13:01:32] (03CR) 10Majavah: [C: 03+2] Edit check: Turn on when ecenable=1 is set [extensions/VisualEditor] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955059 (https://phabricator.wikimedia.org/T345297) (owner: 10DLynch) [13:01:35] (03CR) 10Majavah: [C: 03+2] Enable edit check on en/fr beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955368 (https://phabricator.wikimedia.org/T345658) (owner: 10Esanders) [13:02:01] (03CR) 10Effie Mouzeli: [C: 03+1] P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [13:02:07] It's that great set of patches where they should both result in absolutely no visible change on any production wiki. :D [13:02:14] :P [13:02:52] (03CR) 10Elukey: charts: add the python-webapp chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955584 (owner: 10Elukey) [13:02:59] (03CR) 10Jbond: [C: 03+2] puppetserver: drop the puppetserver hiera key and use puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955722 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [13:03:04] (03CR) 10Majavah: [C: 04-1] P:mediawiki::maintenance: CampaignEvents periodic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [13:04:01] (03PS2) 10Majavah: Enable edit check on en/fr beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955368 (https://phabricator.wikimedia.org/T345658) (owner: 10Esanders) [13:04:08] (03CR) 10Majavah: [C: 03+2] Enable edit check on en/fr beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955368 (https://phabricator.wikimedia.org/T345658) (owner: 10Esanders) [13:07:16] (03CR) 10Elukey: "This is great work! I left a note in the template, lemme know!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955582 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou) [13:07:19] (03Merged) 10jenkins-bot: Enable edit check on en/fr beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955368 (https://phabricator.wikimedia.org/T345658) (owner: 10Esanders) [13:08:03] the config patch should make its way to beta automatically within the next half an hour or so [13:08:26] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts atlas2001.wikimedia.org [13:08:51] taavi: Okay, I can check on it there in a bit. [13:09:43] (03PS1) 10Jbond: puppetdb: drop secondary site on old puppetdb's [puppet] - 10https://gerrit.wikimedia.org/r/955726 [13:10:33] (03CR) 10Jbond: [C: 03+2] puppetdb: drop secondary site on old puppetdb's [puppet] - 10https://gerrit.wikimedia.org/r/955726 (owner: 10Jbond) [13:12:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10Milimetric) Approved! I think maybe you also need analytics-admins as per [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analy... [13:12:20] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [13:13:34] (03PS3) 10Clément Goubert: P:mediawiki::periodic_job: Add splay parameter [puppet] - 10https://gerrit.wikimedia.org/r/954249 (https://phabricator.wikimedia.org/T339984) [13:14:44] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: atlas2001.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1001" [13:16:44] (03CR) 10Jbond: Add cookbook to configure router's BGP sessions to k8s hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [13:16:50] (03Merged) 10jenkins-bot: Edit check: Turn on when ecenable=1 is set [extensions/VisualEditor] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955059 (https://phabricator.wikimedia.org/T345297) (owner: 10DLynch) [13:16:51] (03CR) 10Effie Mouzeli: "Reading the conversation on the phabricator task, I understand that this fact will be mostly used for informational purposes, as well as a" [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [13:16:52] there we go [13:17:00] (03PS1) 10Elukey: services: change Lift Wing's proxy setting in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955728 [13:17:09] The manifold joys of waiting for CI. [13:17:13] !log taavi@deploy1002 Started scap: Backport for [[gerrit:955059|Edit check: Turn on when ecenable=1 is set (T345297)]] [13:17:16] T345297: Create a URL parameter to enable Edit Check - https://phabricator.wikimedia.org/T345297 [13:17:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: atlas2001.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1001" [13:17:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:17:38] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts atlas2001.wikimedia.org [13:18:27] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:18:40] !log taavi@deploy1002 taavi and kemayo: Backport for [[gerrit:955059|Edit check: Turn on when ecenable=1 is set (T345297)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:18:55] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:19:02] synced to test servers. is there anything to test? [13:19:03] (03PS2) 10Elukey: services: change Lift Wing's proxy setting in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955728 [13:19:11] (03PS3) 10FNegri: [openstack] Replace OS version in new manifests [puppet] - 10https://gerrit.wikimedia.org/r/955720 (https://phabricator.wikimedia.org/T345810) [13:19:25] taavi: Yes -- just give me a second to do so. [13:19:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10BTullis) >>! In T345633#9149475, @Milimetric wrote: > Approved! I think maybe you also need analytics-admins as per [[ https://wikitech.wiki... [13:19:56] (03PS3) 10Elukey: services: change Lift Wing's proxy setting in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955728 [13:20:01] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:20:12] taavi: Looks good. 👍🏻 [13:20:27] (03PS4) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) [13:20:28] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt pki1002 - jclark@cumin1001" [13:20:35] !log taavi@deploy1002 taavi and kemayo: Continuing with sync [13:20:37] and syncing [13:20:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10BTullis) [13:21:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt pki1002 - jclark@cumin1001" [13:21:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:21:17] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host pki1002 [13:21:52] (03CR) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [13:22:23] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Looking good, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955728 (owner: 10Elukey) [13:22:25] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pki1002 [13:22:25] (03PS5) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) [13:23:05] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host pki1002.mgmt.eqiad.wmnet with reboot policy FORCED [13:23:27] (03CR) 10FNegri: [openstack] Replace OS version in new manifests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/955720 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [13:23:41] taavi: The beta config change has made its way there, and seems to also have had the desired effect. [13:24:06] great! [13:24:52] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43166/console" [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [13:26:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10BTullis) >>! In T345633#9149475, @Milimetric wrote: > Approved! I think maybe you also need analytics-admins as per [[ https://wikitech.wiki... [13:26:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:43] 10SRE, 10serviceops-radar: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10akosiaris) ICU67 images, built and pushed. [13:26:59] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:955059|Edit check: Turn on when ecenable=1 is set (T345297)]] (duration: 09m 46s) [13:27:02] T345297: Create a URL parameter to enable Edit Check - https://phabricator.wikimedia.org/T345297 [13:27:09] the backport is now live too [13:27:20] (03CR) 10Btullis: [C: 03+1] "All approvals have now been collected and the patch looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/955294 (https://phabricator.wikimedia.org/T345633) (owner: 10Vgutierrez) [13:27:39] (03PS3) 10Majavah: Set OATHAuth multiple devices READ_NEW for all fishbows, privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955670 (https://phabricator.wikimedia.org/T242031) [13:27:41] (03PS3) 10Majavah: Set OATHAuth multiple devices WRITE_BOTH for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955671 (https://phabricator.wikimedia.org/T242031) [13:27:53] (03CR) 10Majavah: [C: 03+2] Set OATHAuth multiple devices READ_NEW for all fishbows, privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955670 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [13:27:58] (03CR) 10Majavah: [C: 03+2] Set OATHAuth multiple devices WRITE_BOTH for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955671 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [13:28:17] (03PS1) 10Ayounsi: Ganeti: add sandbox vlan support [software/spicerack] - 10https://gerrit.wikimedia.org/r/955729 (https://phabricator.wikimedia.org/T307021) [13:28:37] (03Merged) 10jenkins-bot: Set OATHAuth multiple devices READ_NEW for all fishbows, privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955670 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [13:28:43] (03Merged) 10jenkins-bot: Set OATHAuth multiple devices WRITE_BOTH for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955671 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [13:28:53] RECOVERY - Host mc2040 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [13:29:05] taavi: thanks! [13:29:12] !log taavi@deploy1002 Started scap: Backport for [[gerrit:955670|Set OATHAuth multiple devices READ_NEW for all fishbows, privates (T242031)]], [[gerrit:955671|Set OATHAuth multiple devices WRITE_BOTH for wikitech (T242031)]] [13:29:16] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [13:29:53] Would it be possible to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/953998 and the other patch in the stack during this window? [13:30:08] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Jhancock.wm) I reseated DIMM_A2. I also found this ticket from January T326834. In that one B2 was having errors and I moved it to A2. Now that A2 is having iss... [13:30:24] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Jhancock.wm) a:05Papaul→03Jhancock.wm [13:30:35] (03PS1) 10Ayounsi: makevm: handle sandbox vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/955730 (https://phabricator.wikimedia.org/T307021) [13:30:41] !log taavi@deploy1002 taavi: Backport for [[gerrit:955670|Set OATHAuth multiple devices READ_NEW for all fishbows, privates (T242031)]], [[gerrit:955671|Set OATHAuth multiple devices WRITE_BOTH for wikitech (T242031)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:30:49] (03PS1) 10Jbond: puppetmaster::servers: remove puppetserveres from puppetmaster::servers hash [puppet] - 10https://gerrit.wikimedia.org/r/955731 [13:31:23] kostajh: which branch was ReportIncident first branched on? [13:31:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:31:50] !log taavi@deploy1002 taavi: Continuing with sync [13:32:08] (03CR) 10CI reject: [V: 04-1] Ganeti: add sandbox vlan support [software/spicerack] - 10https://gerrit.wikimedia.org/r/955729 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [13:33:03] (03PS1) 10Kosta Harlan: [beta] ReportIncident: Enable on kowiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955732 (https://phabricator.wikimedia.org/T339275) [13:33:05] wmf.25 [13:33:05] (03CR) 10CI reject: [V: 04-1] puppetmaster::servers: remove puppetserveres from puppetmaster::servers hash [puppet] - 10https://gerrit.wikimedia.org/r/955731 (owner: 10Jbond) [13:33:17] taavi: wmf.25. so if it gets rolled back, I guess we'd have issues [13:33:25] it could wait until Monday, there's no hurry. [13:33:54] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pki1002.mgmt.eqiad.wmnet with reboot policy FORCED [13:34:18] (03PS2) 10Ayounsi: Ganeti: add sandbox vlan support [software/spicerack] - 10https://gerrit.wikimedia.org/r/955729 (https://phabricator.wikimedia.org/T307021) [13:34:22] yep, I would prefer to wait until Monday [13:35:09] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pki1002.eqiad.wmnet'] [13:35:24] (03PS1) 10Filippo Giunchedi: cxserver: enable mesh tracing in staging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/955734 (https://phabricator.wikimedia.org/T320563) [13:36:02] (03CR) 10Ayounsi: "Requires I536c83fb1961829780fcfec6dcaebfc2ad45106f" [cookbooks] - 10https://gerrit.wikimedia.org/r/955730 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [13:36:33] (03PS1) 10Kosta Harlan: [beta] Enable ReportIncident for configured beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955735 (https://phabricator.wikimedia.org/T339275) [13:36:37] taavi: sounds good [13:38:04] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:955670|Set OATHAuth multiple devices READ_NEW for all fishbows, privates (T242031)]], [[gerrit:955671|Set OATHAuth multiple devices WRITE_BOTH for wikitech (T242031)]] (duration: 08m 52s) [13:38:07] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [13:38:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10Jclark-ctr) [13:38:55] !log taavi@mwmaint1002 ~ $ mwscript extensions/OATHAuth/maintenance/UpdateForMultipleDevicesSupport.php --wiki=labswiki | tee oathauth-multiple-labswiki.log # T242031 [13:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:33] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1135.eqiad.wmnet with OS bullseye [13:40:47] !log trunk sandbox vlan to ganeti nodes in esams BY27 - T307021 [13:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:31] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/955729 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [13:43:02] (03CR) 10Volans: [C: 03+1] "LGTM, question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/955730 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [13:43:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['pki1002.eqiad.wmnet'] [13:44:40] (03CR) 10Kamila Součková: [C: 03+1] services: change Lift Wing's proxy setting in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955728 (owner: 10Elukey) [13:47:17] (03CR) 10Ayounsi: makevm: handle sandbox vlan (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955730 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [13:47:28] (03CR) 10Ayounsi: [C: 03+2] Ganeti: add sandbox vlan support [software/spicerack] - 10https://gerrit.wikimedia.org/r/955729 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [13:47:52] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:49:51] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Jhancock.wm) SR: 175477369 [13:50:14] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbstore100{8..9} - jclark@cumin1001" [13:50:57] (03PS2) 10Jbond: puppetmaster::servers: remove puppetserveres from puppetmaster::servers hash [puppet] - 10https://gerrit.wikimedia.org/r/955731 (https://phabricator.wikimedia.org/T330490) [13:50:58] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T345798 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm symptom of T344110. idrac restarted. [13:50:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbstore100{8..9} - jclark@cumin1001" [13:50:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:51:31] (03Merged) 10jenkins-bot: Ganeti: add sandbox vlan support [software/spicerack] - 10https://gerrit.wikimedia.org/r/955729 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [13:51:35] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbstore1008 [13:51:49] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbstore1009 [13:51:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbstore1008 [13:51:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbstore1009 [13:52:12] (03PS1) 10Bking: rdf-streaming-updater: change swift/s3 username [deployment-charts] - 10https://gerrit.wikimedia.org/r/955738 (https://phabricator.wikimedia.org/T345765) [13:52:49] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbstore1008.mgmt.eqiad.wmnet with reboot policy FORCED [13:52:50] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbstore1009.mgmt.eqiad.wmnet with reboot policy FORCED [13:53:24] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: change swift/s3 username [deployment-charts] - 10https://gerrit.wikimedia.org/r/955738 (https://phabricator.wikimedia.org/T345765) (owner: 10Bking) [13:53:39] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1135.eqiad.wmnet with reason: host reimage [13:53:39] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: change swift/s3 username [deployment-charts] - 10https://gerrit.wikimedia.org/r/955738 (https://phabricator.wikimedia.org/T345765) (owner: 10Bking) [13:54:25] (03Merged) 10jenkins-bot: rdf-streaming-updater: change swift/s3 username [deployment-charts] - 10https://gerrit.wikimedia.org/r/955738 (https://phabricator.wikimedia.org/T345765) (owner: 10Bking) [13:56:05] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1136.eqiad.wmnet with OS bullseye [13:56:13] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1135.eqiad.wmnet with reason: host reimage [13:57:47] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:58:12] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:58:17] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:58:29] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:58:42] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:58:58] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:59:35] (03CR) 10Effie Mouzeli: [C: 03+1] P:mediawiki::periodic_job: Add splay parameter [puppet] - 10https://gerrit.wikimedia.org/r/954249 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [14:02:04] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:03:18] (03CR) 10Elukey: [C: 03+2] services: change Lift Wing's proxy setting in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955728 (owner: 10Elukey) [14:03:26] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:06:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:02] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1136.eqiad.wmnet with reason: host reimage [14:10:03] jouncebot: next [14:10:03] In 1 hour(s) and 49 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T1600) [14:10:44] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1051.eqiad.wmnet [14:10:45] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [14:10:51] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2051.codfw.wmnet [14:10:55] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [14:11:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:51] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [14:12:13] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [14:13:08] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1136.eqiad.wmnet with reason: host reimage [14:14:53] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [14:15:09] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [14:16:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:58] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2051.codfw.wmnet [14:16:59] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1051.eqiad.wmnet [14:19:03] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1135.eqiad.wmnet with OS bullseye [14:20:55] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:21:02] (03CR) 10Vgutierrez: "all good now 😊" [puppet] - 10https://gerrit.wikimedia.org/r/955294 (https://phabricator.wikimedia.org/T345633) (owner: 10Vgutierrez) [14:21:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:22:11] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:23:32] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:23:36] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [14:23:57] (03PS1) 10JMeybohm: Update to kubernetes client-go 0.23.14 [software/heptiolabs/eventrouter] (v0.4-wmf) - 10https://gerrit.wikimedia.org/r/955748 (https://phabricator.wikimedia.org/T329826) [14:24:20] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [14:24:39] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [14:26:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbstore1008.mgmt.eqiad.wmnet with reboot policy FORCED [14:26:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbstore1009.mgmt.eqiad.wmnet with reboot policy FORCED [14:27:22] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [14:27:35] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [14:28:02] !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [14:29:25] 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Fabfur) [14:29:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jclark-ctr) [14:30:07] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pki1002.eqiad.wmnet'] [14:30:07] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pki1002.eqiad.wmnet'] [14:30:26] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['pki1002.eqiad.wmnet'] [14:30:28] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['pki1002.eqiad.wmnet'] [14:30:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pki1002.eqiad.wmnet'] [14:30:47] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['pki1002.eqiad.wmnet'] [14:31:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jclark-ctr) [14:31:26] (03PS1) 10JMeybohm: eventrouter: Bump k8s client-go to v0.23.14 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/955750 (https://phabricator.wikimedia.org/T329826) [14:37:46] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1136.eqiad.wmnet with OS bullseye [14:37:55] (03PS2) 10JMeybohm: eventrouter: Bump k8s client-go to v0.23.14 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/955750 (https://phabricator.wikimedia.org/T329826) [14:38:23] (03PS1) 10JMeybohm: admin_ng: Deploy eventrouter 0.4.0-1 to wikikube staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/955753 (https://phabricator.wikimedia.org/T329826) [14:38:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10Vgutierrez) 05Stalled→03In progress [14:40:11] (03PS1) 10Ilias Sarantopoulos: services: changes in Lift Wing's proxy setting in the API Gateway for damaging model [deployment-charts] - 10https://gerrit.wikimedia.org/r/955754 (https://phabricator.wikimedia.org/T345850) [14:40:33] (03CR) 10Elukey: [C: 03+1] "Everything looks sound, there is no way I am able to review this so I trust Janis' process :)" [software/heptiolabs/eventrouter] (v0.4-wmf) - 10https://gerrit.wikimedia.org/r/955748 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [14:41:43] (03CR) 10Elukey: [C: 03+1] "Modulo local testing working with docker-pkg :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/955750 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [14:41:58] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2052.codfw.wmnet [14:42:03] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1052.eqiad.wmnet [14:42:36] (03CR) 10Elukey: [C: 03+1] "Optional nit - maybe a reference of the task number in the yaml files could help if people wonder why staging is different." [deployment-charts] - 10https://gerrit.wikimedia.org/r/955753 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [14:43:03] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Update to kubernetes client-go 0.23.14 [software/heptiolabs/eventrouter] (v0.4-wmf) - 10https://gerrit.wikimedia.org/r/955748 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [14:43:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: reimage cookbook should exit cleanly if no puppet role is applied to a node - https://phabricator.wikimedia.org/T338990 (10bking) Is there a way to send the puppet run output to the cookbook logs on cumin? I assume that if `install-console` can login, there's... [14:43:38] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] eventrouter: Bump k8s client-go to v0.23.14 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/955750 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [14:46:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1052.eqiad.wmnet [14:46:29] (03PS1) 10JMeybohm: eventrouter: Fix typo in template [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/955758 [14:46:42] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] eventrouter: Fix typo in template [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/955758 (owner: 10JMeybohm) [14:46:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10Fabfur) [14:47:12] (03PS3) 10Elukey: Add SLO definition for the ORES Legacy service [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955355 (https://phabricator.wikimedia.org/T327620) [14:47:15] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10Fabfur) a:03Fabfur [14:47:40] (03CR) 10Vgutierrez: [C: 03+2] admin: Add brouberol user [puppet] - 10https://gerrit.wikimedia.org/r/955294 (https://phabricator.wikimedia.org/T345633) (owner: 10Vgutierrez) [14:48:27] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2052.codfw.wmnet [14:53:50] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [14:54:03] (03CR) 10Nikki: Add Akan language (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux) [14:54:14] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [14:56:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10Vgutierrez) 05In progress→03Resolved `vgutierrez@mwmaint1002:~$ sudo -i ldapsearch -x cn=ops |grep bro member: uid=brouberol,ou=people,dc... [14:59:27] 10SRE-tools, 10Infrastructure-Foundations: Add warning when provision cookbook is ran without the virtualization flag on hypervisors - https://phabricator.wikimedia.org/T344342 (10Volans) Sure why not we can add an attempted alert, but of course would be based on some hostname matching, not super reliable. Fee... [15:00:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10brouberol) Thank you! QQ: when I try to ssh onto a server, it's asking for an ssh password: ` $ ssh -v stat1005.eqiad.wmnet ... debug1: Exe... [15:00:30] (03CR) 10Elukey: [C: 03+1] services: more changes in Lift Wing's proxy setting in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/955754 (https://phabricator.wikimedia.org/T345850) (owner: 10Ilias Sarantopoulos) [15:02:29] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10Volans) That's indeed something we might want to look at going forward. The only blocker I see right now is that most of the "prod... [15:03:35] (03PS4) 10Elukey: Add SLO definition for the ORES Legacy service [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955355 (https://phabricator.wikimedia.org/T327620) [15:04:23] (03PS3) 10Jbond: puppetmaster::servers: remove puppetserveres from puppetmaster::servers hash [puppet] - 10https://gerrit.wikimedia.org/r/955731 (https://phabricator.wikimedia.org/T330490) [15:07:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10Fabfur) Hi @WDoranWMF, could you please confirm the approval for this ticket? [15:09:49] (03CR) 10JMeybohm: [C: 03+1] cxserver: enable mesh tracing in staging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/955734 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [15:09:55] (03PS5) 10Elukey: Add SLO definition for the ORES Legacy service [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955355 (https://phabricator.wikimedia.org/T327620) [15:09:57] (03PS1) 10Elukey: slo_definitions: fix indentation and add missing descr for Lift Wing [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955769 [15:10:16] (03CR) 10Filippo Giunchedi: [C: 03+2] cxserver: enable mesh tracing in staging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/955734 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [15:11:06] (03PS4) 10Jbond: puppetmaster::servers: remove puppetservers from puppetmaster::servers hash [puppet] - 10https://gerrit.wikimedia.org/r/955731 (https://phabricator.wikimedia.org/T330490) [15:11:47] !log filippo@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [15:11:50] !log filippo@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [15:12:24] (03CR) 10Elukey: "grr preview: https://grafana.wikimedia.org/dashboard/snapshot/xgkkRZkTyP5WdpSYxc2LfF850IDizkJV" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955355 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [15:12:28] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Deploy eventrouter 0.4.0-1 to wikikube staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/955753 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [15:12:51] (03CR) 10Elukey: "Checked with `grr preview` on grafana1002 and all checks out!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955769 (owner: 10Elukey) [15:13:21] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:13:31] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:15:56] (03CR) 10Muehlenhoff: [C: 03+2] Switch sretest1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/955292 (owner: 10Muehlenhoff) [15:16:19] (03Merged) 10jenkins-bot: admin_ng: Deploy eventrouter 0.4.0-1 to wikikube staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/955753 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [15:17:18] 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) @Fabfur one of the disks is being odd. Is it safe for me to shut down the server at reseat internal components right now? [15:17:40] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:18:40] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:18:57] 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Fabfur) Hi @Jhancock.wm, think the best way is to ask someone from the service operations first [15:19:33] (03PS1) 10Alexandros Kosiaris: icu67: Setup shellbox-icu67 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955772 (https://phabricator.wikimedia.org/T329491) [15:19:46] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:20:08] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:23:18] 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10akosiaris) Yes, it is safe, we haven't put yet those in production. [15:23:24] (03PS4) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) [15:23:57] (03CR) 10CI reject: [V: 04-1] vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [15:24:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10Fabfur) 05Open→03Stalled p:05Triage→03Medium [15:28:57] (03PS5) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) [15:28:58] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:32:30] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lists1004 [15:33:47] (03CR) 10Herron: [C: 03+1] "good catch LGTM!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955769 (owner: 10Elukey) [15:33:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lists1004 [15:36:27] (03PS1) 10Muehlenhoff: Adapt transition code for ferm -> nftables [puppet] - 10https://gerrit.wikimedia.org/r/955774 (https://phabricator.wikimedia.org/T336497) [15:36:52] (03CR) 10CI reject: [V: 04-1] Adapt transition code for ferm -> nftables [puppet] - 10https://gerrit.wikimedia.org/r/955774 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:36:59] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [15:39:28] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) True, `sre.hosts.reimage` is not likely to work anytime soon. The only `sre.*` cookbooks that I think we can easily run fr... [15:39:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10brouberol) I'm going to assume that the answer to that was "yes", as I was able to ssh 40 minutes later. Thanks again! [15:40:35] (03CR) 10Herron: [C: 03+1] "LGTM from cursory check!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955355 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [15:41:14] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2053.codfw.wmnet [15:41:16] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1053.eqiad.wmnet [15:45:43] (03PS1) 10Andrea Denisse: superset: Move superset logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) [15:47:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1053.eqiad.wmnet [15:47:44] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2053.codfw.wmnet [15:47:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [15:49:10] (03PS2) 10Muehlenhoff: Adapt transition code for ferm -> nftables [puppet] - 10https://gerrit.wikimedia.org/r/955774 (https://phabricator.wikimedia.org/T336497) [15:49:40] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lists1004.eqiad.wmnet'] [15:52:03] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10Volans) Regarding keeping only the wmcs cookbooks in the conifg that's ok for me if it's ok for the WMCS team. At least for now, s... [15:53:17] (03PS2) 10Daimona Eaytoy: beta: Remove unneeded campaignevents-beta-tester user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940400 (https://phabricator.wikimedia.org/T342452) [15:53:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add cookbook to configure router's BGP sessions to k8s hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [15:55:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10WDoranWMF) Approved [15:56:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955774 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:56:56] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) Ok, I will create a patch! [15:58:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lists1004.eqiad.wmnet'] [16:00:05] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:11] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10colewhite) Ran into this today trying to `pip install wikimedia-spicerack` (Python 3.11). Worked around it with `pip install "pyyaml<5"... [16:00:43] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10Papaul) We will be using to test the new codfw spine/leaf new design contint2001 and thumbor2004. contint2001 will be rename to sretest... [16:01:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10Jclark-ctr) [16:06:59] (03PS1) 10FNegri: Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)" [puppet] - 10https://gerrit.wikimedia.org/r/955786 [16:07:09] (03PS1) 10Muehlenhoff: Use a single ensure for managing the nftables state [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) [16:07:25] (03PS2) 10Muehlenhoff: Use a single ensure for managing the nftables state [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) [16:08:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:09:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10Fabfur) 05Stalled→03In progress [16:09:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/955774 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:10:25] (03CR) 10Andrew Bogott: [C: 03+1] Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)" [puppet] - 10https://gerrit.wikimedia.org/r/955786 (owner: 10FNegri) [16:10:48] (03CR) 10FNegri: [C: 03+2] Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)" [puppet] - 10https://gerrit.wikimedia.org/r/955786 (owner: 10FNegri) [16:10:59] 10ops-codfw: Decommission furud - https://phabricator.wikimedia.org/T345867 (10Papaul) a:03MoritzMuehlenhoff [16:12:01] 10ops-codfw: Decommission furud - https://phabricator.wikimedia.org/T345867 (10Papaul) Hey @MoritzMuehlenhoff hey here is the decom task. once done. you can just assign it back to me. Thanks [16:13:34] (03PS6) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) [16:16:08] (03PS3) 10Muehlenhoff: Use a single ensure for managing the nftables state [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) [16:16:12] (03CR) 10Andrew Bogott: [C: 03+1] "It's nice to take a break from making all these :)" [puppet] - 10https://gerrit.wikimedia.org/r/955720 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [16:17:59] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/955776/43173/" [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [16:19:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:19:52] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) @colewhite from your comment on on the [[ https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/955717 | elastic search restart... [16:19:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10Fabfur) [16:22:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10Fabfur) Hi @BTullis / @odimitrijevic, I think your approval is needed to as you are listed in the `data.yaml` for that specific group... [16:22:24] (03CR) 10Muehlenhoff: "(The PCC error can be ignored)" [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:24:21] (03CR) 10Muehlenhoff: "This was part of the original patch to apply ensure to the firewall, but then in thje more limited reapply after the revert we missed to r" [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:25:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and ops for brouberol - https://phabricator.wikimedia.org/T345633 (10BTullis) >>! In T345633#9150265, @brouberol wrote: > I'm going to assume that the answer to that was "yes", as I was able to ssh 40 minutes later. Thanks again! Y... [16:27:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10BTullis) Approved :+1: [16:27:49] + [16:28:01] 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) a:03Jhancock.wm [16:29:51] (03PS1) 10FNegri: [openstack] Add Zed manifests for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/955806 (https://phabricator.wikimedia.org/T345810) [16:30:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10Fabfur) [16:31:19] (03CR) 10Andrew Bogott: [C: 03+1] [openstack] Add Zed manifests for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/955806 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [16:32:06] (03PS2) 10FNegri: [openstack] Add Zed manifests for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/955806 (https://phabricator.wikimedia.org/T345810) [16:33:02] (03CR) 10FNegri: [C: 03+2] [openstack] Add Zed manifests for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/955806 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [16:33:32] (03CR) 10FNegri: [C: 03+2] [openstack] Replace OS version in new manifests [puppet] - 10https://gerrit.wikimedia.org/r/955720 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [16:35:04] (03PS1) 10Fabfur: admin: added phuedx user to analytics-admin [puppet] - 10https://gerrit.wikimedia.org/r/955807 (https://phabricator.wikimedia.org/T345696) [16:35:08] (03PS1) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955808 (https://phabricator.wikimedia.org/T345791) [16:35:33] (03CR) 10CI reject: [V: 04-1] webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955808 (https://phabricator.wikimedia.org/T345791) (owner: 10Andrea Denisse) [16:36:14] (03CR) 10Btullis: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/955807 (https://phabricator.wikimedia.org/T345696) (owner: 10Fabfur) [16:38:04] (03CR) 10Fabfur: [C: 03+2] admin: added phuedx user to analytics-admin [puppet] - 10https://gerrit.wikimedia.org/r/955807 (https://phabricator.wikimedia.org/T345696) (owner: 10Fabfur) [16:41:48] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10cmooney) Thanks @Papaul ! [16:45:03] !log running moveToExternal on all wikis [16:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:55] (03CR) 10Btullis: [C: 03+2] Bump mediawiki_history_snapshot to 2023-08 [puppet] - 10https://gerrit.wikimedia.org/r/955389 (owner: 10Clare Ming) [16:48:56] (03PS1) 10Andrew Bogott: Revert "Revert "Horizon sudo panel: use ldaps for the ldap uri"" [puppet] - 10https://gerrit.wikimedia.org/r/955811 [16:51:09] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10colewhite) >>! In T345337#9150415, @jbond wrote: > @colewhite from your comment on on the [[ https://gerrit.wikimedia.org/r/c/operations/... [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T1700) [17:13:25] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Revert "Horizon sudo panel: use ldaps for the ldap uri"" [puppet] - 10https://gerrit.wikimedia.org/r/955811 (owner: 10Andrew Bogott) [17:19:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10Fabfur) I think you should have access now, please let me know if it's not the case and I'll investigate further! [17:19:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for phuedx - https://phabricator.wikimedia.org/T345696 (10Fabfur) 05In progress→03Stalled [17:33:24] "why early content of bgwiktionary have been exclusively stored in text table and not ES" and other questions I will never get answer of [17:33:58] (03PS1) 10Milimetric: Map Jade content handler to UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) [17:40:50] (03PS1) 10Kosta Harlan: ReportIncident: Set default help page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955821 (https://phabricator.wikimedia.org/T343382) [17:42:37] (03CR) 10Urbanecm: [C: 04-1] Enable PageNotice on enwiktionary beta (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [17:43:32] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [17:43:45] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [17:43:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T343198)', diff saved to https://phabricator.wikimedia.org/P52303 and previous config saved to /var/cache/conftool/dbconfig/20230907-174351-arnaudb.json [17:43:54] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [17:46:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T343198)', diff saved to https://phabricator.wikimedia.org/P52304 and previous config saved to /var/cache/conftool/dbconfig/20230907-174613-arnaudb.json [17:49:54] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) [17:51:39] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [17:52:05] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [18:01:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P52305 and previous config saved to /var/cache/conftool/dbconfig/20230907-180120-arnaudb.json [18:05:12] (03PS1) 10Jdlrobson: Fix settings button not working on reference previews [extensions/Popups] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955791 (https://phabricator.wikimedia.org/T345829) [18:11:13] (03PS5) 10Ebernhardson: Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126 [18:12:08] 10SRE, 10SRE-Access-Requests: Requesting access to deployment rights for acooper - https://phabricator.wikimedia.org/T345877 (10sbassett) [18:12:15] (03PS1) 10Jdlrobson: Preserve Gadget prefs when they can't be enabled [extensions/Gadgets] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955792 (https://phabricator.wikimedia.org/T341421) [18:12:35] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10sbassett) [18:13:13] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43175/console" [puppet] - 10https://gerrit.wikimedia.org/r/954126 (owner: 10Ebernhardson) [18:16:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P52306 and previous config saved to /var/cache/conftool/dbconfig/20230907-181626-arnaudb.json [18:17:02] (03CR) 10Ebernhardson: [V: 03+1] "PCC looks reasonable to me, this should be ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/954126 (owner: 10Ebernhardson) [18:22:15] (03PS1) 10Eevans: thanos: remove wdqs:flink user [puppet] - 10https://gerrit.wikimedia.org/r/955825 (https://phabricator.wikimedia.org/T345765) [18:27:31] (03CR) 10Gehel: "This change is ready for review." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [18:29:25] (03PS1) 10Eevans: Remove mock password for wdqs_flink [labs/private] - 10https://gerrit.wikimedia.org/r/955831 (https://phabricator.wikimedia.org/T345765) [18:29:49] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/955825 (https://phabricator.wikimedia.org/T345765) (owner: 10Eevans) [18:31:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T343198)', diff saved to https://phabricator.wikimedia.org/P52307 and previous config saved to /var/cache/conftool/dbconfig/20230907-183132-arnaudb.json [18:31:35] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [18:31:37] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [18:31:48] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [18:31:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T343198)', diff saved to https://phabricator.wikimedia.org/P52308 and previous config saved to /var/cache/conftool/dbconfig/20230907-183153-arnaudb.json [18:36:52] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:37:02] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:37:05] (03PS1) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/955832 (https://phabricator.wikimedia.org/T342361) [18:38:04] ^Something to be worried about? [18:38:10] Only seeing maintenance on dbs [18:42:16] (03CR) 10Gehel: [C: 03+2] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/955832 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [18:42:34] (03CR) 10Bking: [C: 03+1] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/955832 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [18:43:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:48:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:48:42] (SystemdUnitFailed) firing: (2) prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:49:10] noise on wdqs1010 is me, sorry, will disable alerting [18:49:58] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1010.eqiad.wmnet with reason: T342361 [18:50:01] T342361: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 [18:50:23] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1010.eqiad.wmnet with reason: T342361 [18:51:28] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:53:28] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:53:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10VRiley-WMF) kubernets1037 - B 8. U 17. port 14 CableID 1796 kubernets1038 - B 8. U 21. port 19 CableID 1801 kubernets1039 - B 8. U 22. port 15 CableID 1797 kubernets1040 - B... [19:00:22] (03PS1) 10Gehel: Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/955793 [19:05:37] (03CR) 10Bking: [C: 03+1] Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/955793 (owner: 10Gehel) [19:05:58] (03CR) 10Gehel: [C: 03+2] Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/955793 (owner: 10Gehel) [19:06:29] (03PS1) 10Cwhite: aptrepo: amend pin to allow grafana 9.4.x [puppet] - 10https://gerrit.wikimedia.org/r/955014 (https://phabricator.wikimedia.org/T345362) [19:11:02] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:13:22] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:13:35] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS bullseye [19:21:27] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [19:24:06] (03CR) 10Bking: [C: 03+1] Remove mock password for wdqs_flink [labs/private] - 10https://gerrit.wikimedia.org/r/955831 (https://phabricator.wikimedia.org/T345765) (owner: 10Eevans) [19:24:09] (03CR) 10Bking: [C: 03+1] thanos: remove wdqs:flink user [puppet] - 10https://gerrit.wikimedia.org/r/955825 (https://phabricator.wikimedia.org/T345765) (owner: 10Eevans) [19:30:08] (03CR) 10Eevans: [C: 03+2] thanos: remove wdqs:flink user [puppet] - 10https://gerrit.wikimedia.org/r/955825 (https://phabricator.wikimedia.org/T345765) (owner: 10Eevans) [19:33:01] !log eevans@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [19:37:37] !log eevans@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [19:37:47] (03CR) 10Eevans: [C: 03+2] Remove mock password for wdqs_flink [labs/private] - 10https://gerrit.wikimedia.org/r/955831 (https://phabricator.wikimedia.org/T345765) (owner: 10Eevans) [19:37:52] (03CR) 10Eevans: [V: 03+2 C: 03+2] Remove mock password for wdqs_flink [labs/private] - 10https://gerrit.wikimedia.org/r/955831 (https://phabricator.wikimedia.org/T345765) (owner: 10Eevans) [19:38:50] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage [19:39:35] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:39:41] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:41:54] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage [19:43:40] (03CR) 10Herron: [C: 03+1] aptrepo: amend pin to allow grafana 9.4.x [puppet] - 10https://gerrit.wikimedia.org/r/955014 (https://phabricator.wikimedia.org/T345362) (owner: 10Cwhite) [19:55:42] (03CR) 10Bking: [C: 03+1] "OK, the zk hosts are up now. It looks like the info from this global.pp is rendered into /etc/helmfile-defaults/${env}.yaml . Excited to " [puppet] - 10https://gerrit.wikimedia.org/r/954126 (owner: 10Ebernhardson) [19:56:10] (03CR) 10Bking: [C: 03+2] Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126 (owner: 10Ebernhardson) [20:00:06] brennen and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230907T2000). [20:00:06] danisztls and Jdlrobson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:43] PROBLEM - Host mw2444 is DOWN: PING CRITICAL - Packet loss = 100% [20:01:52] (I'm here as well, sorry, forgot to add my username before my patch :P) [20:02:21] o/ [20:02:23] o/ I can deploy [20:03:07] (03PS2) 10DDesouza: Undeploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955388 (https://phabricator.wikimedia.org/T345158) [20:03:14] (03PS4) 10DDesouza: Pre-deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954724 (https://phabricator.wikimedia.org/T344393) [20:04:42] o/ [20:04:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955388 (https://phabricator.wikimedia.org/T345158) (owner: 10DDesouza) [20:05:16] (03CR) 10Herron: "had a quick look" [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [20:05:27] Jdlrobson: around for your backports? [20:05:35] (03Merged) 10jenkins-bot: Undeploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955388 (https://phabricator.wikimedia.org/T345158) (owner: 10DDesouza) [20:05:53] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:955388|Undeploy Campaigns Event Discovery survey (T345158)]] [20:06:01] T345158: Deploy QuickSurvey for Campaigns Event Discovery project - https://phabricator.wikimedia.org/T345158 [20:07:22] !log thcipriani@deploy1002 thcipriani and dani: Backport for [[gerrit:955388|Undeploy Campaigns Event Discovery survey (T345158)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:07:33] I will have to patch 954724 due to previous patch merge [20:08:17] danisztls: your undeploy campaign events patch should be live on testwikis, lemme know if it looks good to you [20:09:12] and, re:954724, yeah, merge conflicts not uncommon in IS.php :) [20:09:31] (03PS5) 10DDesouza: Pre-deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954724 (https://phabricator.wikimedia.org/T344393) [20:10:19] thcipriani: it does, thanks [20:10:36] danisztls: ok, going live [20:10:42] (03PS1) 10Herron: profile::prometheus::statsd_exporter: add support for empty mappings [puppet] - 10https://gerrit.wikimedia.org/r/955838 (https://phabricator.wikimedia.org/T345377) [20:10:56] thcipriani: yeah is just that one patch undeploys and the other deploys something else, so remove line and readd same line :) [20:11:12] (03PS7) 10Herron: profile::mediawiki::common: include prometheus statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) [20:11:18] !log thcipriani@deploy1002 thcipriani and dani: Continuing with sync [20:11:30] (03PS8) 10Herron: profile::mediawiki::common: include prometheus statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) [20:12:45] (03CR) 10Herron: "also rebasing on top of https://gerrit.wikimedia.org/r/c/operations/puppet/+/955838/1 should help to address lookup failures for profile::" [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [20:14:23] (03CR) 10Herron: "splitting this off from the mw statsd_exporter patch since it should be useful elsewhere too" [puppet] - 10https://gerrit.wikimedia.org/r/955838 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron) [20:16:41] hrm, looks like mw2444 is having some trouble [20:17:03] or, at least, deployment is havign trouble connecting to it to deploy new code [20:17:07] ssh connection timeout [20:17:56] don't see anything in the sal, is ^ a known? [20:20:33] hey sorry im late [20:20:39] thcipriani: ^ [20:20:52] did i miss the window? [20:21:29] Jdlrobson: nope you're in time [20:21:51] phew [20:22:07] (although got a host timing out so syncing is slower than normal :( ) [20:23:04] (03CR) 10Thcipriani: [C: 03+2] Preserve Gadget prefs when they can't be enabled [extensions/Gadgets] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955792 (https://phabricator.wikimedia.org/T341421) (owner: 10Jdlrobson) [20:23:30] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1016.eqiad.wmnet with OS bullseye [20:23:51] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:955388|Undeploy Campaigns Event Discovery survey (T345158)]] (duration: 17m 58s) [20:23:54] T345158: Deploy QuickSurvey for Campaigns Event Discovery project - https://phabricator.wikimedia.org/T345158 [20:24:03] Jdlrobson: this is the backport for the 2nd patch, right? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/955791 [20:24:10] thcipriani: correct [20:24:19] (03CR) 10Thcipriani: [C: 03+2] Fix settings button not working on reference previews [extensions/Popups] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955791 (https://phabricator.wikimedia.org/T345829) (owner: 10Jdlrobson) [20:24:23] cool, thanks :) [20:24:53] danisztls: your first change is live, how's your second config change conflict fixup going? [20:25:15] (03CR) 10Thcipriani: [C: 03+2] beta: Remove unneeded campaignevents-beta-tester user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940400 (https://phabricator.wikimedia.org/T342452) (owner: 10Daimona Eaytoy) [20:25:56] (03Merged) 10jenkins-bot: beta: Remove unneeded campaignevents-beta-tester user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940400 (https://phabricator.wikimedia.org/T342452) (owner: 10Daimona Eaytoy) [20:26:02] thcipriani: already rebased [20:26:34] Daimona: merged yours, since it's a beta-only it should be live in beta in 10 mins. It'll go to the production machines with the next sync, but probably no reason to hang around to test. [20:26:58] Yup, I'll just test in beta when it gets there [20:27:01] Thank you! [20:27:25] perfect, thank you [20:27:39] thcipriani: If the extra wait time is bothersome, I suggest `ps uxwww` and greeping for the ssh process and killing it. [20:27:44] *grepping [20:28:09] danisztls: perfect, I'll get that one out while waiting on zuul for the others [20:28:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954724 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [20:29:03] dancy: good call [20:29:31] (03Merged) 10jenkins-bot: Pre-deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954724 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [20:30:06] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:954724|Pre-deploy Reader Demographics 2 pilot survey (T344393)]] [20:30:11] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [20:31:34] !log thcipriani@deploy1002 dani and thcipriani: Backport for [[gerrit:954724|Pre-deploy Reader Demographics 2 pilot survey (T344393)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:32:11] thcipriani: this one will be impossible to test [20:32:39] danisztls: k. I'll check for explosions and continue with sync then :) [20:32:52] thcipriani: thanks! [20:33:34] !log thcipriani@deploy1002 dani and thcipriani: Continuing with sync [20:33:40] nothing explody :) [20:35:59] Well, my patch doesn't work :| But it doesn't break anything, it's just a no-op. I'll work on it again [20:36:56] (03Merged) 10jenkins-bot: Preserve Gadget prefs when they can't be enabled [extensions/Gadgets] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955792 (https://phabricator.wikimedia.org/T341421) (owner: 10Jdlrobson) [20:36:59] (03Merged) 10jenkins-bot: Fix settings button not working on reference previews [extensions/Popups] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/955791 (https://phabricator.wikimedia.org/T345829) (owner: 10Jdlrobson) [20:37:06] Oh no wait, it actually works [20:37:15] Is Special:ListGroupRights cached or what?! [20:37:28] everything in mw is cached [20:37:42] (very useful response, I know) [20:41:06] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:954724|Pre-deploy Reader Demographics 2 pilot survey (T344393)]] (duration: 10m 59s) [20:41:10] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [20:41:32] ^ danisztls your 2nd patch should be live [20:41:59] mw2444 seems maybe back? We'll see with this next one. [20:42:01] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/955838 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron) [20:42:30] Jdlrobson: ready? Fine for both of these to go at the same time? [20:43:18] Daimona: IS/CS is generally too early to be overriding extension-provided groups, considering that the extension hasn't actually been loaded yet. not sure what's the state-of-the-art for doing that, at some point that was a $wgExtensionFunction [20:43:26] permission loading stuff is hacky and terrible [20:43:47] thcipriani: are they on debug1001? [20:43:50] they can go out together no problem [20:44:04] Jdlrobson: not yet, I'll get them started now [20:44:07] great [20:44:08] Amir1: yeah, not surprising :) [20:45:22] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:955792|Preserve Gadget prefs when they can't be enabled (T341421)]], [[gerrit:955791|Fix settings button not working on reference previews (T345829)]] [20:45:27] T341421: Changing mobile gadget preferences disables gadgets that don't support Minerva - https://phabricator.wikimedia.org/T341421 [20:45:27] T345829: Regression: Setting icon in ReferencePreview popups does nothing - https://phabricator.wikimedia.org/T345829 [20:45:29] taavi: this particular group is defined in IS (I think) for meta, it's just that beta happens to inherit it. I still had a 50/50 chance of getting the fix wrong, of course :D [20:46:11] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:46:20] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:46:48] !log thcipriani@deploy1002 jdlrobson and thcipriani: Backport for [[gerrit:955792|Preserve Gadget prefs when they can't be enabled (T341421)]], [[gerrit:955791|Fix settings button not working on reference previews (T345829)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD o [20:46:48] ption) [20:46:57] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:47:06] Jdlrobson: now both are on mwdebug servers, check please [20:48:22] lookig [20:49:08] !log taavi@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2444.codfw.wmnet [20:49:55] thcipriani: both LGTM please sync away [20:50:02] * thcipriani does [20:50:08] !log thcipriani@deploy1002 jdlrobson and thcipriani: Continuing with sync [20:50:28] (03PS1) 10Andrew Bogott: Galera: allow installing debian-hosted packages for Bookworm or later [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) [20:51:29] (03CR) 10Majavah: [C: 04-1] Galera: allow installing debian-hosted packages for Bookworm or later (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/955841 (https://phabricator.wikimedia.org/T302482) (owner: 10Andrew Bogott) [20:52:46] 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10colewhite) [20:56:34] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:955792|Preserve Gadget prefs when they can't be enabled (T341421)]], [[gerrit:955791|Fix settings button not working on reference previews (T345829)]] (duration: 11m 12s) [20:56:40] T341421: Changing mobile gadget preferences disables gadgets that don't support Minerva - https://phabricator.wikimedia.org/T341421 [20:56:40] T345829: Regression: Setting icon in ReferencePreview popups does nothing - https://phabricator.wikimedia.org/T345829 [20:56:44] Jdlrobson: ^ should be live now [20:56:52] 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10colewhite) [20:58:12] thcipriani: thanks! [20:58:42] yw, thanks for the patch checks :) [20:59:46] ACKNOWLEDGEMENT - SSH on mw2444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds cole_white T345884 https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:59:46] ACKNOWLEDGEMENT - Host mw2444 is DOWN: PING CRITICAL - Packet loss = 100% cole_white T345884 [21:01:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T343198)', diff saved to https://phabricator.wikimedia.org/P52309 and previous config saved to /var/cache/conftool/dbconfig/20230907-210122-arnaudb.json [21:01:26] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [21:02:33] 10ops-codfw, 10serviceops-radar: mw2444 down - https://phabricator.wikimedia.org/T345884 (10colewhite) [21:16:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P52310 and previous config saved to /var/cache/conftool/dbconfig/20230907-211628-arnaudb.json [21:31:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P52311 and previous config saved to /var/cache/conftool/dbconfig/20230907-213134-arnaudb.json [21:42:52] (03CR) 10Bking: "Adding ServiceOps teammates since we're making changes to base." [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson) [21:46:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T343198)', diff saved to https://phabricator.wikimedia.org/P52312 and previous config saved to /var/cache/conftool/dbconfig/20230907-214640-arnaudb.json [21:46:44] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [21:46:46] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [21:46:57] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [21:46:58] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [21:47:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [21:47:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T343198)', diff saved to https://phabricator.wikimedia.org/P52313 and previous config saved to /var/cache/conftool/dbconfig/20230907-214717-arnaudb.json [21:51:17] PROBLEM - Disk space on mwmaint1002 is CRITICAL: DISK CRITICAL - free space: / 884 MB (0% inode=93%): /tmp 884 MB (0% inode=93%): /var/tmp 884 MB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops [21:55:11] Amir1: your movoeToExternal run is filling up mwmaint1002 disk completely [21:55:37] what is it putting on disk? [21:56:02] 32G enwikivoyage.undo.sql [21:56:08] (03PS1) 10Cwhite: Add StatsLib settings for Test env [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) [21:57:02] 'undo' sounds important.. but / is completely full now [21:57:02] (03PS2) 10Cwhite: Add StatsLib settings for Test env [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) [21:57:12] He's inbound (pinged him OOB) [21:58:09] sigh [21:58:11] let me see [21:58:16] hi, thanks [21:58:45] for what it's worth, that's basically back up of the content moved from core tables to eS [21:58:47] *ES [21:58:58] in case things go wrong, I need it somewhere [22:05:36] now it's at 96% [22:11:23] (03CR) 10Jon Harald Søby: Add Akan language (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux) [22:24:07] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:29:16] !log installing scap v4.59.0 [22:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:12] !log jhuneidi@deploy1002 Installing scap version "4.59.0" for 595 hosts [22:39:11] now the usage is at 63% [22:44:26] !log jhuneidi@deploy1002 Installing scap version "4.59.0" for 594 hosts [22:45:26] !log jhuneidi@deploy1002 Installation of scap version "4.59.0" completed for 594 hosts [22:52:31] RECOVERY - Disk space on mwmaint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops [23:01:51] I'm re-running the script again, now removed 65GB from mwmaint, I hope it doesn't get filled that fast [23:05:35] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:15:39] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:22:51] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:25:12] (03PS2) 10Srishakatux: Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) [23:27:09] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:30:09] (03PS3) 10Srishakatux: Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) [23:32:21] (03CR) 10Srishakatux: Add Akan language (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux) [23:43:21] (03CR) 10Jon Harald Søby: [C: 03+1] Add Akan language (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux) [23:47:27] (03CR) 10RLazarus: [C: 03+1] icu67: Setup shellbox-icu67 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955772 (https://phabricator.wikimedia.org/T329491) (owner: 10Alexandros Kosiaris)