[00:00:50] <mutante>	 unlocked: *quadruple revert achievement*
[00:01:32] <mutante>	 nothing to hand-over, no alerts. going afk
[00:06:23] <wikibugs>	 (03PS4) 10Ladsgroup: Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905609 (https://phabricator.wikimedia.org/T326800)
[00:07:15] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+1] "Tested as many ways as possible in mwdebug, I'm about to go to sleep, otherwise I would have deployed it now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905609 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup)
[00:23:27] <wikibugs>	 (03CR) 10Tim Starling: "I have no idea why you did this." [puppet] - 10https://gerrit.wikimedia.org/r/902064 (https://phabricator.wikimedia.org/T328872) (owner: 10MVernon)
[00:30:31] <TimStarling>	 I mean, if you're going to revert my changes, it seems like it would be courteous to at least add me as a reviewer or add a comment to the change you're reverting
[00:35:20] <wikibugs>	 (03PS1) 10Legoktm: Remove possibly significant whitespace from robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905764
[00:39:26] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/905550
[00:39:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/905550 (owner: 10TrainBranchBot)
[00:40:53] <wikibugs>	 (03PS2) 10Legoktm: Remove possibly significant whitespace from robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905764 (https://phabricator.wikimedia.org/T334038)
[00:51:58] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) There was no isolation or resolution of root causes, so we can expect the issue to recur peri...
[00:58:04] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/905550 (owner: 10TrainBranchBot)
[01:03:07] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10ssingh) Apologies for the wrong commits attached to this; those were for T333456.  @Milimetric @JAllemandou, sorry for the ping but per the above comment, this nee...
[01:08:12] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "LGTM, we'll see if it does anything useful :)" [puppet] - 10https://gerrit.wikimedia.org/r/905746 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[01:15:30] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:19:38] <icinga-wm_>	 PROBLEM - PHP opcache health on mw2430 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[01:27:36] <wikibugs>	 (03PS1) 10Ssingh: admin: add fnavas-foundation to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482)
[01:34:03] <wikibugs>	 (03PS2) 10Ssingh: admin: add fnavas-foundation to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482)
[01:41:01] <wikibugs>	 (03PS1) 10Nray: Add static mobile United_States page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905769 (https://phabricator.wikimedia.org/T331681)
[01:41:39] <wikibugs>	 (03PS2) 10Nray: Add static mobile United_States page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905769 (https://phabricator.wikimedia.org/T331681)
[01:50:00] <wikibugs>	 (03PS3) 10Nray: Add static mobile United_States page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905769 (https://phabricator.wikimedia.org/T331681)
[01:51:42] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack Designate: role back codfw1dev change to default policies [puppet] - 10https://gerrit.wikimedia.org/r/905770 (https://phabricator.wikimedia.org/T330759)
[01:55:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Designate: role back codfw1dev change to default policies [puppet] - 10https://gerrit.wikimedia.org/r/905770 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[01:56:18] <wikibugs>	 (03PS4) 10Nray: Add static mobile United_States page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905769 (https://phabricator.wikimedia.org/T331681)
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:38:44] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] admin: add fnavas-foundation to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482) (owner: 10Ssingh)
[02:57:23] <wikibugs>	 (03PS6) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix)
[03:05:39] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) I would love to see the HTTP error response body. FileOperation logs show 502 errors, but the...
[04:12:16] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "LGTM. Deploy any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905764 (https://phabricator.wikimedia.org/T334038) (owner: 10Legoktm)
[04:17:22] <TimStarling>	 !log restarted swift-proxy on ms-fe* T328872
[04:17:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:17:27] <stashbot>	 T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872
[05:08:17] <wikibugs>	 (03PS3) 10KartikMistry: Remove akwiki from CX config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904952
[05:22:12] <icinga-wm_>	 PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:53:42] <icinga-wm_>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[05:54:30] <icinga-wm_>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T0600)
[06:04:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482) (owner: 10Ssingh)
[06:06:31] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10MoritzMuehlenhoff) >>! In T331899#8731595, @taavi wrote: >> To be able to access deplyed Wiki instances and ensure that wikibase (namely wikibas...
[06:17:50] <icinga-wm_>	 RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:22:40] <icinga-wm_>	 RECOVERY - PHP opcache health on mw2430 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[06:26:14] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Cookbook to depool a site in AuthDNS - https://phabricator.wikimedia.org/T334048 (10ayounsi)
[06:39:07] <wikibugs>	 (03PS1) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/905893 (https://phabricator.wikimedia.org/T333961)
[06:39:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/905893 (https://phabricator.wikimedia.org/T333961) (owner: 10Marostegui)
[06:41:16] <icinga-wm_>	 RECOVERY - MariaDB Replica SQL: es4 on es1022 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:41:17] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) Restarting the proxy servers temporarily fixed it again. The restart caused a doubling of the...
[06:41:22] <icinga-wm_>	 RECOVERY - MariaDB read only es4 on es1022 is OK: Version 10.6.12-MariaDB-log, Uptime 42s, read_only: True, event_scheduler: True, 58.07 QPS, connection latency: 0.004521s, query latency: 0.000601s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[06:41:26] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: es4 on es1022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:41:40] <icinga-wm_>	 RECOVERY - mysqld processes on es1022 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[06:50:22] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: es4 on es1022 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:56:39] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Remove akwiki from CX config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904952 (owner: 10KartikMistry)
[06:57:28] <wikibugs>	 (03Merged) 10jenkins-bot: Remove akwiki from CX config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904952 (owner: 10KartikMistry)
[06:58:34] <kart_>	 I accidently merged change instead of scap backport a few minutes back :/
[06:59:57] <taavi>	 kart_: scap backport can operate with manually merged changes just fine
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T0700).
[07:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:35] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Persistence, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui)
[07:00:55] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui)
[07:01:35] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/905895 (https://phabricator.wikimedia.org/T333377)
[07:01:57] <kart_>	 taavi: good to know :)
[07:01:57] <kart_>	 It's time!
[07:02:18] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] Signup: Add captcha to signups. [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809) (owner: 10Slyngshede)
[07:02:24] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+1] Signup: Add captcha to signups. [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809) (owner: 10Slyngshede)
[07:02:26] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Signup: Add captcha to signups. [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809) (owner: 10Slyngshede)
[07:03:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/905895 (https://phabricator.wikimedia.org/T333377) (owner: 10Marostegui)
[07:03:45] <marostegui>	 !log Failover m3-master T333377
[07:03:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:50] <stashbot>	 T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377
[07:03:58] <wikibugs>	 (03CR) 10Slyngshede: "Looks good, that will allow me to re-enable the log shipping to logstash." [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond)
[07:04:02] <logmsgbot>	 !log kartik@deploy2002 Started scap: Backport for [[gerrit:904952|Remove akwiki from CX config]]
[07:04:09] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] url_downloader: switch squid logs to hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond)
[07:04:19] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 33
[07:04:33] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 33
[07:05:21] <logmsgbot>	 !log kartik@deploy2002 kartik: Backport for [[gerrit:904952|Remove akwiki from CX config]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[07:08:22] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui)
[07:08:34] <wikibugs>	 (03PS2) 10Slyngshede: P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676)
[07:08:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede)
[07:10:13] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) @jcrespo could you double check the backup-related hosts? Thanks!
[07:10:45] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/905927 (https://phabricator.wikimedia.org/T333377)
[07:11:19] <marostegui>	 !log Failover m5-master T333377
[07:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:23] <stashbot>	 T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377
[07:11:25] <logmsgbot>	 !log kartik@deploy2002 Finished scap: Backport for [[gerrit:904952|Remove akwiki from CX config]] (duration: 07m 22s)
[07:11:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:11:43] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/905927 (https://phabricator.wikimedia.org/T333377) (owner: 10Marostegui)
[07:12:36] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui)
[07:13:14] <wikibugs>	 (03PS3) 10Slyngshede: P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676)
[07:13:22] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) m3-master and m5-master have been failed over.
[07:15:33] <kart_>	 I saw errors in scap backport.
[07:15:57] <kart_>	 `07:08:06 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'mw2300.codfw.wmnet', 'mw1420.eqiad.wmnet', 'mw2289.codfw.wmnet', 'mw1486.eqiad.wmnet', 'deploy1002.eqiad.wmnet', 'mw1398.eqiad.wmnet', 'mw2259.codfw.wmnet', 'mw1366.eqiad.wmnet', 'deploy2002.codfw.wmnet', 'mw1404.eqiad.wmnet'] (ran as mwdeploy@deploy1002.eqiad.wmnet) returned [1]: Aborting: Scap is disabled on this 
[07:15:58] <kart_>	 host. If you really need to run Scap here, you can override by passing "-Dblock_execution:False" to the call`
[07:16:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:16:57] <wikibugs>	 (03PS3) 10Elukey: Upgrade kafka-main to use PKI TLS certificates for brokers [puppet] - 10https://gerrit.wikimedia.org/r/905251 (https://phabricator.wikimedia.org/T319372)
[07:17:35] <kart_>	 scap error I got: https://pastebin.com/YWxNsMRJ @Amir1 @urbanecm @taavi 
[07:18:27] <wikibugs>	 (03PS1) 10Marostegui: db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/905928 (https://phabricator.wikimedia.org/T331381)
[07:18:54] <elukey>	 kart_: o/ did you deploy from deplo1002?
[07:18:56] <urbanecm>	 kart_: afaik we're still on codfw? Are you on the correct host?
[07:19:01] <elukey>	 if so please use 2002
[07:19:48] <urbanecm>	 ^
[07:19:49] <elukey>	 kart_: "Aborting: Scap is disabled on this host."
[07:20:27] <marostegui>	 !log Stop mariadb on db1101 T331381
[07:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:32] <stashbot>	 T331381: decommission db1101.eqiad.wmnet - https://phabricator.wikimedia.org/T331381
[07:22:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/905928 (https://phabricator.wikimedia.org/T331381) (owner: 10Marostegui)
[07:24:10] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host kafka-test1006.eqiad.wmnet with OS bullseye
[07:24:15] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host kafka-test1006.eqiad.wmnet with OS bullseye
[07:26:02] <kart_>	 elukey: no. Used 2002.
[07:27:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond)
[07:27:28] <kart_>	 elukey: `kartik@deploy2002:~$ scap backport 904952`
[07:29:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:29:39] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db1104 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/905930 (https://phabricator.wikimedia.org/T329481)
[07:29:44] <kart_>	 I use ssh to deployment.codfw.wmnet - which automatically points to current dc. Is that changed? :/
[07:30:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1104 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/905930 (https://phabricator.wikimedia.org/T329481) (owner: 10Marostegui)
[07:30:17] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2067.codfw.wmnet with OS bullseye
[07:30:24] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye
[07:30:43] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10akosiaris) Yes, we 'll have to depool codfw.
[07:31:01] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) >>! In T334049#8757732, @Marostegui wrote: > @ayounsi to confirm, codfw will be depooled before this maintenance right? @akosiaris @Joe ?  That's my under...
[07:31:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1104 from dbctl T329481', diff saved to https://phabricator.wikimedia.org/P46035 and previous config saved to /var/cache/conftool/dbconfig/20230405-073102-marostegui.json
[07:31:07] <stashbot>	 T329481: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481
[07:31:29] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff)
[07:31:54] <urbanecm>	 kart_: deployment.codfw.wmnet  should work fine
[07:32:07] <urbanecm>	 I'm confused as to `07:08:34 sudo -u mwdeploy -n -- /usr/bin/scap cdb-rebuild (ran as mwdeploy@deploy1002.eqiad.wmnet) returned [1]: Aborting: Scap is disabled on this host. If you really need to run Scap here, you can override by passing "-Dblock_execution:False" to the call `
[07:32:25] <urbanecm>	 maybe scap deploys _to_ deploy1002 and fails to, because scap's disabled there?
[07:33:22] <urbanecm>	 sounds plausible, as scap pull complains as well
[07:34:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:34:23] <hashar>	 good morning
[07:35:03] <urbanecm>	 good morning hashar. any idea what to do with the above mentioned problem? :-)
[07:35:31] <hashar>	 this job is never ending, I haven't drink my coffee yet :D
[07:35:59] <hashar>	 the primary deployment server is `deploy2002.codfw.wmnet` for sure
[07:36:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[07:36:28] <urbanecm>	 hashar: apologies, and feel free to drink it before you help :))
[07:36:32] <hashar>	 scap has a `sync-master` step which rsync to the other(s) deployment server which includes deploy1002.eqiad.wmnet
[07:36:50] <urbanecm>	 and should it continue to do so even when deploy1002's not primary?
[07:36:52] <elukey>	 kart_: didn't mean to upset you, from the logs it seemed as if you were deploying from 1002, apologies
[07:36:57] <hashar>	 that is well to keep the spare deployment server up-to-date in case we need to switch over or the primarly magically disappears
[07:37:27] <hashar>	 one should not be able to deploy from the spare deploy1002.eqiad.wmnet
[07:37:41] <hashar>	 I can't remember how that is prevented, but a global lock sounds likely
[07:37:42] <icinga-wm_>	 RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[07:37:57] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff)
[07:38:03] <hashar>	 upon connecting to the spare deploy1002.eqiad.wmnet, a message of the day should show up in the prompt stating "DO NOT USE THIS SERVER"
[07:38:16] <kart_>	 elukey: ah, no issue :) 
[07:38:45] <hashar>	 so if you then deploy from the primary deploy2002.codfw.wmnet , I would expect it to be able to sync to the spare deploy1002
[07:39:07] <taavi>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/0a5edcbbb78b72e749c411192ea4c8e6912dde4c ("scap: block Scap execution on inactive deployment hosts") was committed last night
[07:39:17] <hashar>	 and whatever got done in /srv/deployment or /srv/mediawiki-staging on deploy1002 will be erased/restored to the state of the primary deploy2002
[07:40:01] <urbanecm>	 taavi: sounds like the cause to me. not sure if we should revert that patch or remove the lock temporarily. 
[07:41:31] <elukey>	 a revert seems probably the best thing, it is blocking deployments
[07:42:01] <wikibugs>	 (03PS1) 10Elukey: Revert "scap: block Scap execution on inactive deployment hosts" [puppet] - 10https://gerrit.wikimedia.org/r/905741
[07:42:45] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "jnuche: that broke scap cause we do a few scap operations on the spare deployment server for MediaWiki deployment notably `scap pull` or `" [puppet] - 10https://gerrit.wikimedia.org/r/905741 (owner: 10Elukey)
[07:42:49] <hashar>	 elukey: +1 ed :)
[07:42:53] <hashar>	 well
[07:42:57] <hashar>	 should tag T330756
[07:42:57] <stashbot>	 T330756: Improve behavior around global Scap lock + communicate changes - https://phabricator.wikimedia.org/T330756
[07:43:17] <wikibugs>	 (03PS2) 10Hashar: Revert "scap: block Scap execution on inactive deployment hosts" [puppet] - 10https://gerrit.wikimedia.org/r/905741 (https://phabricator.wikimedia.org/T330756) (owner: 10Elukey)
[07:43:19] <hashar>	 amended
[07:43:26] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] Revert "scap: block Scap execution on inactive deployment hosts" [puppet] - 10https://gerrit.wikimedia.org/r/905741 (https://phabricator.wikimedia.org/T330756) (owner: 10Elukey)
[07:43:32] <wikibugs>	 (03CR) 10Hashar: "Amended to attach this change to T330756" [puppet] - 10https://gerrit.wikimedia.org/r/905741 (https://phabricator.wikimedia.org/T330756) (owner: 10Elukey)
[07:43:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops, 10Patch-For-Review: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10ayounsi) 05Open→03Resolved a:03ayounsi This has been rolled to all k8s clusters.
[07:44:23] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40543/console" [puppet] - 10https://gerrit.wikimedia.org/r/905741 (https://phabricator.wikimedia.org/T330756) (owner: 10Elukey)
[07:45:22] <elukey>	 from the pcc it seems super safe, https://puppet-compiler.wmflabs.org/output/905741/40543/
[07:45:33] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db1104 [puppet] - 10https://gerrit.wikimedia.org/r/905932 (https://phabricator.wikimedia.org/T329481)
[07:45:36] <elukey>	 merging, thanks for the reviews
[07:45:42] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10jcrespo)
[07:45:55] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "scap: block Scap execution on inactive deployment hosts" [puppet] - 10https://gerrit.wikimedia.org/r/905741 (https://phabricator.wikimedia.org/T330756) (owner: 10Elukey)
[07:46:02] <hashar>	 in /etc/scap/scap.cfg that should remove the block_execution setting yeah
[07:46:14] <hashar>	 I don't know what the default is
[07:46:33] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10jcrespo) >>! In T333377#8757686, @Marostegui wrote: > @jcrespo could you double check the backup-related hosts? Thanks!  Documented- minor to no disruption.
[07:46:35] <hashar>	 taavi: I still don't get how you manage to find the root cause commits so fast :]
[07:46:44] <hashar>	 scap/config.py:    "block_execution": (bool, False),
[07:46:44] <hashar>	 tests/scap/test_cli.py:    cmd.config = {"block_execution": False}
[07:46:45] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage
[07:47:00] <elukey>	 taavi++
[07:47:09] <urbanecm>	 ^^
[07:47:12] <hashar>	 so I guess no blocking by default :-]   jnuche will be able to follow up
[07:47:28] <elukey>	 ok running puppet on deploy1002 and 2002, kart_ gimme 2 mins and then you can retry
[07:47:35] <taavi>	 hashar: I just tend to lurk in this channel and have a good memory :-P so I saw the commit yesterday evening and the error message today and connected the dots
[07:47:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1104.eqiad.wmnet
[07:47:40] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10jcrespo)
[07:49:41] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage
[07:49:49] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gitlab: Fix listen_https typo [puppet] - 10https://gerrit.wikimedia.org/r/905653 (owner: 10BCornwall)
[07:50:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "That's a great cleanup! Happy to merge this myself, but would like to sort out a time when someone from the WMCS SREs is around just in ca" [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah)
[07:50:16] * urbanecm recorded the issue + error message at T330756
[07:50:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1104 [puppet] - 10https://gerrit.wikimedia.org/r/905932 (https://phabricator.wikimedia.org/T329481) (owner: 10Marostegui)
[07:51:21] <elukey>	 kart_: green light
[07:52:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[07:54:03] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::kubeadm: checker is a toolforge-specific feature [puppet] - 10https://gerrit.wikimedia.org/r/905933
[07:54:05] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::k8s::etcd: load list of control nodes from PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499)
[07:54:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1104.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[07:54:42] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kafka-test1006.eqiad.wmnet with OS bullseye
[07:54:47] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host kafka-test1006.eqiad.wmnet with OS bullseye executed with errors: - kafka-test1006 (**FAIL**)   - Downtimed...
[07:56:06] <icinga-wm_>	 RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 117.45 ms
[07:56:26] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host kafka-test1006.eqiad.wmnet with OS bullseye
[07:56:30] <icinga-wm_>	 RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:56:31] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host kafka-test1006.eqiad.wmnet with OS bullseye
[07:57:38] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40544/console" [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) (owner: 10Majavah)
[07:57:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Upgrade kafka-main to use PKI TLS certificates for brokers [puppet] - 10https://gerrit.wikimedia.org/r/905251 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey)
[07:59:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1104.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[07:59:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:59:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1104.eqiad.wmnet
[07:59:09] <wikibugs>	 (03CR) 10MVernon: [V: 03+2 C: 03+2] Provision the revised Swift dashboard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902064 (https://phabricator.wikimedia.org/T328872) (owner: 10MVernon)
[07:59:45] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481 (10Marostegui) This is ready for DC-Ops
[08:00:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) 05Open→03Stalled Marking it as stalled until the cookbook is reviewed/merged.
[08:00:04] <jouncebot>	 hashar and dduvall: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T0800).
[08:00:13] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481 (10Marostegui) a:05Marostegui→03None
[08:00:16] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481 (10Marostegui)
[08:01:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10ayounsi) a:03cmooney
[08:02:01] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui)
[08:02:49] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-main1002.eqiad.wmnet with reason: restart kafka, switch to PKI
[08:02:52] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-main1002.eqiad.wmnet with reason: restart kafka, switch to PKI
[08:02:57] <kart_>	 elukey: Oh, I woas bit away. Do I need to run backport again?
[08:03:33] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) @jcrespo kindly check backup servers needs. Thanks
[08:04:16] <elukey>	 kart_: ah ok maybe not, but others will probably have better/more info
[08:04:27] <kart_>	 hashar: ^
[08:04:33] <elukey>	 (I am fairly ignorant about sca)
[08:05:06] <kart_>	 Change seems deployed in akwiki, so it should be good IMHO.
[08:06:30] <elukey>	 ack super
[08:06:33] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1006.eqiad.wmnet with reason: host reimage
[08:07:11] <elukey>	 !log restart kafka on kafka-main1002 to pick up the new TLS certificate (PKI based) - T319372
[08:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:14] <stashbot>	 T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372
[08:09:57] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1006.eqiad.wmnet with reason: host reimage
[08:11:13] <wikibugs>	 (03PS26) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505)
[08:11:29] <hashar>	 kart_: I have no idea :-]
[08:11:50] <hashar>	 I guess we can check both deployment servers and a random mw app server to verify
[08:13:39] <hashar>	 deploy1002 still has akwiki => true
[08:13:45] <hashar>	 grep -A6 wgContentTranslationAsBetaFeature /srv/mediawiki-staging/wmf-config/InitialiseSettings.php 
[08:13:49] <icinga-wm_>	 RECOVERY - Disk space on ms-be2067 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops
[08:14:38] <hashar>	 same on mw1473 (randomly picked up host)
[08:14:42] <kart_>	 :/
[08:14:57] <hashar>	 and that is the same for deploy1002  ( in /srv/mediawiki )
[08:15:02] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10dcausse) >>! In T330693#8756120, @Ottomata wrote: > Generally implementers wo...
[08:15:02] <hashar>	 so I guess we need to redeploy it
[08:15:09] <hashar>	 `scap sync-file` should do it.
[08:15:09] <wikibugs>	 (03CR) 10David Caro: "The PCC looks weird no? there's less nodes now:" [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) (owner: 10Majavah)
[08:15:47] <hashar>	 even mwdebug1001 still has akwiki
[08:16:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Marostegui) Thank you Papaul, they look good!
[08:16:24] <hashar>	 kart_: want me to do the sync ?
[08:17:37] <kart_>	 hashar: go ahead. I guess, patch itself has not desired effect, but I'll followup on that.
[08:17:52] <kart_>	 We need to disable CX on closed Wikis.
[08:18:13] <hashar>	 +1
[08:18:13] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] P:toolforge::k8s::etcd: load list of control nodes from PuppetDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) (owner: 10Majavah)
[08:19:09] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10jcrespo)
[08:19:16] <hashar>	 sorry for the mess kart_ !
[08:19:27] <hashar>	 I guess changes to scap config should require a verification
[08:20:17] <elukey>	 +100
[08:22:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 T326669', diff saved to https://phabricator.wikimedia.org/P46036 and previous config saved to /var/cache/conftool/dbconfig/20230405-082240-root.json
[08:22:45] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[08:22:51] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/905933 (owner: 10Majavah)
[08:22:53] <kart_>	 hashar: no issue :)
[08:23:03] <elukey>	 hashar: I'd argue that changes in general would require a verification :D
[08:23:15] <wikibugs>	 (03PS1) 10KartikMistry: Disable ContentTranslation for Closed Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905935
[08:23:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Disable ContentTranslation for Closed Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905935 (owner: 10KartikMistry)
[08:25:06] <logmsgbot>	 !log hashar@deploy2002 Synchronized wmf-config/InitialiseSettings.php: Remove akwiki from CX config (take 2, it was not fully deployed due to a scap lock issue on the spare server) (duration: 06m 06s)
[08:25:21] <wikibugs>	 (03CR) 10Majavah: P:wmcs::kubeadm: checker is a toolforge-specific feature (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905933 (owner: 10Majavah)
[08:25:42] <wikibugs>	 (03PS1) 10Slyngshede: partman: allow partitions to take up the whole disk on no-swap. [puppet] - 10https://gerrit.wikimedia.org/r/905936
[08:26:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[08:27:12] <wikibugs>	 (03CR) 10Slyngshede: "Follow up patch for addressing comments made on the merged patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/905160" [puppet] - 10https://gerrit.wikimedia.org/r/905936 (owner: 10Slyngshede)
[08:27:16] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kafka-test1006.eqiad.wmnet with OS bullseye
[08:27:21] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host kafka-test1006.eqiad.wmnet with OS bullseye completed: - kafka-test1006 (**PASS**)   - Removed from Puppet a...
[08:28:08] <wikibugs>	 (03CR) 10David Caro: ceph: Allow setting a crush location hook for the rack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro)
[08:28:18] <wikibugs>	 (03Abandoned) 10KartikMistry: Disable ContentTranslation for Closed Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905935 (owner: 10KartikMistry)
[08:28:48] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2067.codfw.wmnet with OS bullseye
[08:28:55] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye completed: - ms-be2067 (**PASS**)   - Downtim...
[08:31:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: add rsyslog-namespaced fields to syslog_json [puppet] - 10https://gerrit.wikimedia.org/r/904597 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite)
[08:32:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) https://www.juniper.net/documentation/us/en/software/junos/system-mgmt-monitoring/topics/ref/statement/enhanced-hash-key-e...
[08:39:04] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe1003.eqiad.wmnet,service=thanos-web
[08:43:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10MatthewVernon) ...it took about 10 minutes for sdx to start producing errors in the kernel log: ` Apr  5 08:21:22 ms-be2067 kernel: [   22.166159] Process accounting resumed Apr...
[08:43:45] <hashar>	 I am going to check the logs a bit then do group1
[08:45:03] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905938 (https://phabricator.wikimedia.org/T330209)
[08:45:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905938 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot)
[08:45:47] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905938 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot)
[08:46:58] <jnuche>	 hashar: back from the doctor, sorry about the issue with the inactive deployment server
[08:47:01] <jnuche>	 thanks for the revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/905741/
[08:50:39] <icinga-wm_>	 PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[08:51:24] <jnuche>	 that caused a real mess, really sorry about it :(
[08:52:05] <icinga-wm_>	 RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3754 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Docker
[08:52:17] <logmsgbot>	 !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.3  refs T330209
[08:52:21] <stashbot>	 T330209: 1.41.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T330209
[08:53:00] <hashar>	 jnuche: as I get it the spare deployment server had a scap.cfg with `block_exection: true`  which prevents scap 2 from deploying mediawiki cause we run a `scap pull` and a `scap rebuild-cdbs` on the spare server in order to populate /srv/mediawiki).
[08:53:35] <hashar>	 then I am not sure whether we need a full deploy of mediawiki on the deployment server, maybe that is needed to run mwscripts
[08:53:51] <hashar>	 anyway, that was an easy fix :-]
[08:55:12] <jnuche>	 hashar: yes, the flag replaced another blocking mechanism we had, but apparently the old mechanism still allowed scap to run in some cases
[08:55:19] <jnuche>	 also, I thought all the sync from primary to secondary master was done via rsync, apparently not
[08:55:25] <jnuche>	 so I need to revisit
[08:55:56] <jnuche>	 and by sheer coincidence the puppet change was finally merged last night and this morning I was at the doctor and not available
[08:56:04] <jnuche>	 apologies again for the mess :(
[08:56:49] <wikibugs>	 (03PS1) 10Filippo Giunchedi: network: add LVS ranges for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/905939 (https://phabricator.wikimedia.org/T333949)
[08:57:01] <hashar>	 I think the issue is that the config change was applied but not verified after deployment or surely we would have caught it by running a `scap sync-file`
[08:57:29] <hashar>	 anyway no worries, it was an easy find (well thanks to t.aavi) and an easy revert (thanks e.lukey) :]
[08:58:04] <logmsgbot>	 !log hashar@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.3  refs T330209 (duration: 05m 46s)
[08:58:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333960 (10Peachey88)
[08:58:08] <stashbot>	 T330209: 1.41.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T330209
[08:58:09] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Peachey88)
[08:59:15] <wikibugs>	 (03CR) 10Jaime Nuche: "Really sorry about this affecting the deployments. Thanks for the revert." [puppet] - 10https://gerrit.wikimedia.org/r/905741 (https://phabricator.wikimedia.org/T330756) (owner: 10Elukey)
[09:00:20] <wikibugs>	 (03PS9) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602)
[09:00:49] <wikibugs>	 (03PS10) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602)
[09:01:27] <wikibugs>	 (03CR) 10Jcrespo: mediabackups: Add static console port for easier remote management (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo)
[09:02:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I would be great if you could confirm these updated recipes to be working as expected by reimaging two of the testvm* hosts (w" [puppet] - 10https://gerrit.wikimedia.org/r/905936 (owner: 10Slyngshede)
[09:03:59] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo)
[09:05:25] <wikibugs>	 (03CR) 10Ayounsi: "lgtm! thx for completing it!" [puppet] - 10https://gerrit.wikimedia.org/r/905939 (https://phabricator.wikimedia.org/T333949) (owner: 10Filippo Giunchedi)
[09:05:29] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] network: add LVS ranges for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/905939 (https://phabricator.wikimedia.org/T333949) (owner: 10Filippo Giunchedi)
[09:07:18] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert)
[09:09:19] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] partman: allow partitions to take up the whole disk on no-swap. [puppet] - 10https://gerrit.wikimedia.org/r/905936 (owner: 10Slyngshede)
[09:09:27] <wikibugs>	 (03PS1) 10Clément Goubert: linkrecommendation: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905941 (https://phabricator.wikimedia.org/T334060)
[09:11:27] <wikibugs>	 (03PS1) 10Clément Goubert: push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061)
[09:12:00] <wikibugs>	 (03CR) 10Jaime Nuche: "Apparently `/var/lock/scap-global-lock` allowed some Scap commands to run. In particular `scap pull` and `scap cdb-rebuild` still need to " [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche)
[09:12:55] <wikibugs>	 (03PS1) 10Clément Goubert: recommendation-api: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905943 (https://phabricator.wikimedia.org/T334062)
[09:15:08] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert)
[09:15:29] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert)
[09:15:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: hashar: some more git aliases [puppet] - 10https://gerrit.wikimedia.org/r/905715 (owner: 10Hashar)
[09:15:45] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reimage for host testvm2002.codfw.wmnet with OS bullseye
[09:16:40] <hashar>	 group1 wikis look fine
[09:17:00] <wikibugs>	 (03PS1) 10Clément Goubert: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905944 (https://phabricator.wikimedia.org/T334064)
[09:18:14] <wikibugs>	 (03CR) 10Muehlenhoff: Add an in place Debian upgrade script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway)
[09:19:55] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10MoritzMuehlenhoff) >>! In T331706#8755038, @jhathaway wrote: >>>! In T331706#8753210, @Ladsgroup wrote: >> I'll try to take a look at the grants (it's a bit unusual...
[09:21:01] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1125 to test-cluster master [puppet] - 10https://gerrit.wikimedia.org/r/905945
[09:22:20] <wikibugs>	 (03PS1) 10Ayounsi: cr: switch bootp to dhcp-relay; asw-drmrs: manage dhcp [homer/public] - 10https://gerrit.wikimedia.org/r/905946 (https://phabricator.wikimedia.org/T320508)
[09:22:41] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert)
[09:22:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cr: switch bootp to dhcp-relay; asw-drmrs: manage dhcp [homer/public] - 10https://gerrit.wikimedia.org/r/905946 (https://phabricator.wikimedia.org/T320508) (owner: 10Ayounsi)
[09:23:23] <wikibugs>	 (03CR) 10Jbond: exim: fix hard-coded vrts hostname (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth)
[09:23:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1125 to test-cluster master [puppet] - 10https://gerrit.wikimedia.org/r/905945 (owner: 10Marostegui)
[09:25:23] <wikibugs>	 (03PS1) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065)
[09:26:30] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage
[09:26:54] <wikibugs>	 (03CR) 10Ayounsi: "Also adds `enhanced-hash-key` on drmrs switches for consistency with the routers." [homer/public] - 10https://gerrit.wikimedia.org/r/905946 (https://phabricator.wikimedia.org/T320508) (owner: 10Ayounsi)
[09:28:14] <wikibugs>	 (03PS2) 10Ayounsi: cr: switch bootp to dhcp-relay; asw-drmrs: manage dhcp [homer/public] - 10https://gerrit.wikimedia.org/r/905946 (https://phabricator.wikimedia.org/T320508)
[09:29:23] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/905553 (https://phabricator.wikimedia.org/T334067)
[09:29:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s2 T334067
[09:29:50] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage
[09:29:54] <stashbot>	 T334067: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T334067
[09:30:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/905637 (owner: 10Jbond)
[09:30:14] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1007.eqiad.wmnet
[09:30:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Also add component/pybal for pybaltest hosts [puppet] - 10https://gerrit.wikimedia.org/r/905543 (owner: 10Muehlenhoff)
[09:30:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T334067
[09:31:52] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update Java images to OpenJDK 11.0.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905592 (owner: 10Muehlenhoff)
[09:31:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1162 with weight 0 T334067', diff saved to https://phabricator.wikimedia.org/P46038 and previous config saved to /var/cache/conftool/dbconfig/20230405-093155-marostegui.json
[09:32:03] <wikibugs>	 (03CR) 10David Caro: P:toolforge::k8s::etcd: load list of control nodes from PuppetDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) (owner: 10Majavah)
[09:32:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/905553 (https://phabricator.wikimedia.org/T334067) (owner: 10Gerrit maintenance bot)
[09:33:31] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::k8s::etcd: load list of control nodes from PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499)
[09:33:57] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff)
[09:34:04] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-main1003.eqiad.wmnet with reason: restart kafka, switch to PKI
[09:34:19] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-main1003.eqiad.wmnet with reason: restart kafka, switch to PKI
[09:34:22] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff)
[09:34:45] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1007.eqiad.wmnet
[09:35:31] <elukey>	 !log restart kafka on kafka-main1003 to pick up the new TLS certificate (PKI based) - T319372
[09:35:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:34] <stashbot>	 T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372
[09:35:39] <wikibugs>	 (03PS3) 10Majavah: P:toolforge::k8s::etcd: load list of control nodes from PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499)
[09:36:23] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host kafka-test1007.eqiad.wmnet with OS bullseye
[09:36:27] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host kafka-test1007.eqiad.wmnet with OS bullseye
[09:36:37] <wikibugs>	 (03CR) 10Majavah: P:toolforge::k8s::etcd: load list of control nodes from PuppetDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) (owner: 10Majavah)
[09:42:22] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Looks okay to me; I can help with testing it in #wikimedia-operations if you like." [deployment-charts] - 10https://gerrit.wikimedia.org/r/905944 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[09:42:37] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2002.codfw.wmnet with OS bullseye
[09:43:41] <claime>	 Lucas_WMDE: 
[09:43:49] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey)
[09:43:50] <claime>	 I'd like that :)
[09:43:54] <Lucas_WMDE>	 ok :)
[09:44:10] <claime>	 It's very low trafic so if we can trigger requests to the backend, I'll take it :D
[09:44:31] <claime>	 (sorry for the no-message ping, ssh had a hiccup :P)
[09:44:36] <Lucas_WMDE>	 my main issue is that I don’t remember which… uh, slot? idk – is targeted by test.wikidata.org
[09:44:44] <Lucas_WMDE>	 we have staging/eqiad/codfw
[09:44:52] <Lucas_WMDE>	 and then I think there’s another thing with two options?
[09:45:03] <Lucas_WMDE>	 and test.wikidata.org goes to some combination of them but I don’t remember which one
[09:45:57] <Lucas_WMDE>	 (and www.wikidata.org presumably goes to the least staging/test-y one ^^)
[09:46:24] <claime>	 Ah yes wait
[09:46:36] <claime>	 There's another reference to the mw api that isn't through envoy
[09:46:39] <claime>	   10   │     WIKIBASE_REPO_HOSTNAME_ALIAS: api-ro.discovery.wmnet
[09:47:24] <claime>	 And there seems to be three values files, staging, test, and plain values.yaml
[09:48:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[09:48:18] <claime>	 So in the staging environmenet there are two releases deployed, test and staging
[09:48:39] <claime>	 And then in eqiad and codfw, just a production release
[09:48:40] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:toolforge::k8s::etcd: load list of control nodes from PuppetDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) (owner: 10Majavah)
[09:48:47] <Lucas_WMDE>	 ok
[09:49:06] <Lucas_WMDE>	 so eqiad and codfw are probably for www.wikidata.org, depending on which dc is active
[09:49:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] squid: Add support for hourly log rotation [puppet] - 10https://gerrit.wikimedia.org/r/905637 (owner: 10Jbond)
[09:49:14] <claime>	 So I'll change the WIKIBASE_REPO_HOSTNAME_ALIAS too maybe ?
[09:49:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] url_downloader: switch squid logs to hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond)
[09:49:22] <claime>	 (that's for test)
[09:49:33] <Lucas_WMDE>	 and testwikidatawiki goes to http://termbox-test.staging.svc.eqiad.wmnet:3031/termbox if I found the right setting in IS.php
[09:49:45] <Lucas_WMDE>	 let me see
[09:50:06] <Lucas_WMDE>	 yeah probably change that too
[09:50:30] <claime>	 production release has this calling itself I think
[09:50:32] <Lucas_WMDE>	 I assume that means “connect to DNS api-ro.discovery but send HTTP Host: test.wikidata.org”
[09:50:32] <claime>	     WIKIBASE_REPO: http://www.wikidata.org:6500/w
[09:50:34] <claime>	     WIKIBASE_REPO_HOSTNAME_ALIAS: localhost
[09:50:51] <Lucas_WMDE>	 yeah, that’s some proxy running on the same system I think
[09:51:00] <claime>	 Yeah
[09:51:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] url_downloader: switch squid logs to hourly rotation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond)
[09:51:38] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1007.eqiad.wmnet with reason: host reimage
[09:51:42] <wikibugs>	 (03CR) 10Hashar: gerrit: replace Icinga with Prometheus monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[09:51:43] <claime>	 We'll change the test values and deploy staging and test first and see
[09:52:20] <wikibugs>	 (03PS2) 10Clément Goubert: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905944 (https://phabricator.wikimedia.org/T334064)
[09:52:40] <Lucas_WMDE>	 sounds good
[09:52:44] <claime>	 back in a minute
[09:52:46] <Lucas_WMDE>	 I know how to trigger requests at least
[09:52:59] <Lucas_WMDE>	 and if you say the request volume is low, I assume you can also see that the requests were triggered successfully
[09:54:38] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1007.eqiad.wmnet with reason: host reimage
[09:55:38] <marostegui>	 !log Starting s2 eqiad failover from db1122 to db1162 - T334067
[09:55:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:42] <stashbot>	 T334067: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T334067
[09:56:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1162 to s2 primary T334067', diff saved to https://phabricator.wikimedia.org/P46039 and previous config saved to /var/cache/conftool/dbconfig/20230405-095600-root.json
[09:56:39] <claime>	 Lucas_WMDE: I'm basing myself on https://grafana.wikimedia.org/goto/AO_QnKYVk?orgId=1
[09:57:54] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reimage for host testvm2002.codfw.wmnet with OS bullseye
[09:58:45] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905944 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[09:59:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1122', diff saved to https://phabricator.wikimedia.org/P46040 and previous config saved to /var/cache/conftool/dbconfig/20230405-095954-marostegui.json
[10:00:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46041 and previous config saved to /var/cache/conftool/dbconfig/20230405-100003-root.json
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1000)
[10:00:55] <claime>	 Lucas_WMDE: So if I change to this https://grafana.wikimedia.org/goto/Gdyg4KLVk?orgId=1 I should see the requests go through to mw-api-int 
[10:01:13] <Lucas_WMDE>	 ok
[10:01:18] <Lucas_WMDE>	 sounds good
[10:01:39] <claime>	 Ok, merging and deploying staging and test then ?
[10:01:41] <Lucas_WMDE>	 it sounds like www and test wikidata might go through different paths to the API anyways?
[10:01:47] <Lucas_WMDE>	 yeah I think you can go ahead
[10:02:02] <Lucas_WMDE>	 even if real wikidata has unexpected issues that we don’t catch via test wikidata, it shouldn’t be a huge problem
[10:02:12] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: enable add link frontend in 7th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950
[10:02:18] <Lucas_WMDE>	 jouncebot: now
[10:02:18] <jouncebot>	 For the next 0 hour(s) and 57 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1000)
[10:02:22] <claime>	 Yeah I'm a bit confused about the WIKIBASE_REPO setting
[10:02:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] openstack::nutcracker: Remove redis support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris)
[10:02:58] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1207 [puppet] - 10https://gerrit.wikimedia.org/r/905951 (https://phabricator.wikimedia.org/T326669)
[10:03:03] <claime>	 I *think* it means it calls the local envoy proxy on port 6500
[10:03:33] <wikibugs>	 (03PS2) 10Sergio Gimeno: GrowthExperiments: enable add link frontend in 7th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950
[10:03:36] <claime>	 Which means it'd call mw-api-int, because I used that same port for the new mw-api-int-asynclistener
[10:04:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905944 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[10:05:37] <wikibugs>	 (03CR) 10Volans: "Nice addition! I'll leave it to your team for the actual logic, I did a pass for general cookbook's related stuff." [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney)
[10:05:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] network: add LVS ranges for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/905939 (https://phabricator.wikimedia.org/T333949) (owner: 10Filippo Giunchedi)
[10:06:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1207 [puppet] - 10https://gerrit.wikimedia.org/r/905951 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[10:06:24] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage
[10:06:30] <Lucas_WMDE>	 ok
[10:07:32] <godog>	 akosiaris: thank you for the merge! FWIW I got a PCC running here as I wasn't sure about the exact implications https://puppet-compiler.wmflabs.org/output/905939/40546/
[10:07:33] <Lucas_WMDE>	 so it calls mw-api-int in mw-on-k8s, but the older api on non-k8s deployments, because they have different things running on port 6500?
[10:08:13] <godog>	 though should be safe AFAICS
[10:08:18] <wikibugs>	 (03PS1) 10Marostegui: mariadb: db1207 remove from insetup [puppet] - 10https://gerrit.wikimedia.org/r/905952 (https://phabricator.wikimedia.org/T326669)
[10:08:32] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[10:08:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: db1207 remove from insetup [puppet] - 10https://gerrit.wikimedia.org/r/905952 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[10:09:07] <wikibugs>	 (03PS1) 10Elukey: profile::kafka::broker: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954
[10:09:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] thumbor: increase memory quota, per-container memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/905654 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan)
[10:09:13] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage
[10:09:31] <wikibugs>	 (03PS1) 10Marostegui: Revert "Add new db nodes to site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/905745
[10:09:41] <claime>	 Lucas_WMDE: It'll call the defined listener for port 6500 yes (which is what I changed in the values files) https://gerrit.wikimedia.org/r/c/operations/puppet/+/903595/
[10:09:44] <wikibugs>	 (03PS2) 10Marostegui: Revert "Add new db nodes to site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/905745
[10:10:05] <wikibugs>	 (03Merged) 10jenkins-bot: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905944 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[10:10:11] <Lucas_WMDE>	 ok, nice
[10:10:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "Add new db nodes to site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/905745 (owner: 10Marostegui)
[10:10:38] <godog>	 volans: FYI drmrs blackbox probes are now live as per T333949, in case they mis-page
[10:10:39] <stashbot>	 T333949: service::catalog probes are not deployed in drmrs - https://phabricator.wikimedia.org/T333949
[10:11:06] <volans>	 godog: ack
[10:11:27] <akosiaris>	 godog: prego. In reality, nothing. Just adding a couple of more ferm macros (unused) an adding those nets to 2 used macros
[10:11:40] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kafka-test1007.eqiad.wmnet with OS bullseye
[10:11:43] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host kafka-test1007.eqiad.wmnet with OS bullseye completed: - kafka-test1007 (**PASS**)   - Downtimed on Icinga/A...
[10:11:47] <akosiaris>	 there will be ferm restarts across the fleet, but that's going to be ok
[10:12:00] <akosiaris>	 at most we will discovery something weird with the firewall of some host due to an alert.
[10:12:02] <claime>	 Hmm there's a big networkpolicy change that wasn't in CI
[10:12:06] <claime>	 I need an adult :p
[10:12:12] <akosiaris>	 there are none :P
[10:12:12] <godog>	 akosiaris: sweet! thank you that's informative
[10:12:21] <claime>	 No adults? :(
[10:12:26] <Lucas_WMDE>	 in the helmfile diff?
[10:12:31] <claime>	 Lucas_WMDE: yeah
[10:12:39] * Lucas_WMDE has also been confused by extra diffs in the past
[10:12:42] <elukey>	 !log restart purged on cp6015 to verify if connection to brokers failed are only temporary or not
[10:12:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:56] <akosiaris>	 claime: what's the diff? 
[10:13:20] <akosiaris>	 aka, how do I see it ? deploy2002 ; helmfile -e <what> diff --context=5 ? 
[10:13:25] <akosiaris>	 what is what here?
[10:14:06] <claime>	 akosiaris: deploy2002, cd /srv/deployment-charts/helmfile.d/services/termbox; helmfile -e staging -l name=staging diff --context=5
[10:14:41] <claime>	 Or I made a phaste https://phabricator.wikimedia.org/P46042
[10:14:44] <elukey>	 !log restart purged on cp5032, cp1082, cp6004, cp1090 - errors after restart of kafka main eqiad brokers
[10:14:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:49] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: increase memory quota, per-container memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/905654 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan)
[10:15:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46043 and previous config saved to /var/cache/conftool/dbconfig/20230405-101507-root.json
[10:15:26] <claime>	 There's also a chart version bump that apparently wasn't deployed
[10:16:30] <akosiaris>	 claime: that's where it comes from, I think
[10:16:30] <claime>	 IMO it's either the removal of default-network-policy-conf.yaml, or the upgrade to mesh 1.1 that wasn't deployed
[10:16:41] <akosiaris>	 it's the removal of default-network-policy
[10:16:46] <akosiaris>	 go ahead, I reviewed the diff
[10:16:49] <claime>	 ack
[10:16:49] <akosiaris>	 it's going to be fine
[10:17:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply
[10:17:52] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply
[10:18:07] <claime>	 Lucas_WMDE: test and staging releases updates
[10:18:11] <claime>	 updated*
[10:18:15] <Lucas_WMDE>	 ok
[10:19:09] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] kubernetes: set NO_HOME for bulidservice (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro)
[10:19:33] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: increase memory quota, per-container memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/905654 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan)
[10:19:59] <wikibugs>	 (03PS1) 10Elukey: istio: upgrade to upstream version 1.17.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905956 (https://phabricator.wikimedia.org/T334068)
[10:20:03] <Lucas_WMDE>	 claime: hmm, without JS I don’t see a termbox at https://test.m.wikidata.org/wiki/Q229877
[10:20:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:20:17] <Lucas_WMDE>	 let’s see if I can find any errors in logstash…
[10:20:25] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2002.codfw.wmnet with OS bullseye
[10:20:27] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/905554 (https://phabricator.wikimedia.org/T334077)
[10:20:29] <wikibugs>	 (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/istio/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905956 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey)
[10:22:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T326669', diff saved to https://phabricator.wikimedia.org/P46044 and previous config saved to /var/cache/conftool/dbconfig/20230405-102215-marostegui.json
[10:22:20] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[10:23:01] <Lucas_WMDE>	 claime: Failed to connect to termbox-test.staging.svc.eqiad.wmnet port 3031: Connection timed out
[10:23:08] <Lucas_WMDE>	 logstash _id eHnvUIcBtuN2AbPY_giz
[10:23:31] <claime>	 ok let me check the releases
[10:24:54] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Make db1179 candidate for x1 [puppet] - 10https://gerrit.wikimedia.org/r/905957
[10:24:59] <claime>	 Container is started and is supposed to be listening on 3031
[10:25:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:25:07] <akosiaris>	 curl http://termbox-test.staging.svc.eqiad.wmnet:3031
[10:25:07] <akosiaris>	 <!DOCTYPE html>
[10:25:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Make db1179 candidate for x1 [puppet] - 10https://gerrit.wikimedia.org/r/905957 (owner: 10Marostegui)
[10:26:00] <claime>	 Same from a random mediawiki server
[10:26:14] <Lucas_WMDE>	 hm, “cannot GET /” might just be because it’s not the right URL
[10:26:16] <Lucas_WMDE>	 let me dig up a proper one
[10:26:30] <claime>	 Yeah, but it means it is listening on 3031
[10:26:31] <wikibugs>	 (03PS1) 10Slyngshede: partman: test updated flat-noswap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/905958
[10:26:38] <Lucas_WMDE>	 yeah
[10:27:19] <wikibugs>	 (03PS1) 10Elukey: Add upstream release 1.15.7 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/905959 (https://phabricator.wikimedia.org/T334068)
[10:27:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm not tested" [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway)
[10:27:57] <Lucas_WMDE>	 http://termbox-test.staging.svc.eqiad.wmnet:3031/termbox?entity=Q229877&revision=630197&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ229877&preferredLanguages=en gives me a 500 Internal Server Error
[10:28:16] <Lucas_WMDE>	 after 3.something seconds
[10:28:22] <wikibugs>	 (03CR) 10Slyngshede: "Not really sure if this is the correct way to test the flat-noswap, no other hosts seems to  use the noswap only." [puppet] - 10https://gerrit.wikimedia.org/r/905958 (owner: 10Slyngshede)
[10:28:22] <Lucas_WMDE>	 I could imagine the timeout being configured shorter than that
[10:28:34] <claime>	 msg
[10:28:36] <claime>	  timeout of 3000ms exceeded
[10:28:43] <claime>	 (a bunch of them in termbox logstash)
[10:29:21] <Lucas_WMDE>	 so that’s the termbox service itself waiting for something else and timing out after 3 seconds?
[10:29:25] <Lucas_WMDE>	 (waiting for mediawiki, probably)
[10:29:32] <wikibugs>	 (03Abandoned) 10Marostegui: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/905554 (https://phabricator.wikimedia.org/T334077) (owner: 10Gerrit maintenance bot)
[10:30:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46046 and previous config saved to /var/cache/conftool/dbconfig/20230405-103012-root.json
[10:30:20] <wikibugs>	 (03CR) 10Daniel Kinzler: "Thank you for finding this, Lucas!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905598 (https://phabricator.wikimedia.org/T333926) (owner: 10Lucas Werkmeister (WMDE))
[10:31:05] <Lucas_WMDE>	 ok I found the logstash
[10:31:08] <Lucas_WMDE>	 not a lot of information in there :/
[10:31:18] <Lucas_WMDE>	 other than the message you posted
[10:32:35] <Lucas_WMDE>	 yeah pretty sure this is a timeout from termbox trying to talk to mediawiki
[10:34:42] <wikibugs>	 (03PS3) 10Sergio Gimeno: GrowthExperiments: enable add link frontend and backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950 (https://phabricator.wikimedia.org/T304551)
[10:35:21] <claime>	 Lucas_WMDE: I'm trying to find out if it's failing in termbox, or in envoy
[10:37:24] <Lucas_WMDE>	 I tried to kubectl exec bash into the pod but apparently I’m not allowed. maybe that’s for the better ^^
[10:38:18] <claime>	 I can do that by going to the actual node, and docker exec in the container
[10:38:26] <Lucas_WMDE>	 heh
[10:38:41] <claime>	 But there's not much in terms of tools (which is normal, tbf)
[10:38:59] <Lucas_WMDE>	 ok
[10:39:56] <Lucas_WMDE>	 the pod has a HEALTHCHECK_QUERY in its environment, which looks similar to the URL I used above, but I don’t remember what that’s used for (it’s different from the k8s liveness and readiness probes, at least)
[10:40:32] <wikibugs>	 (03CR) 10Muehlenhoff: partman: test updated flat-noswap.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905958 (owner: 10Slyngshede)
[10:40:47] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[10:40:53] <wikibugs>	 (03PS1) 10Jcrespo: Revert "monitoring: Disable notifications for db1150 after crash" [puppet] - 10https://gerrit.wikimedia.org/r/905966
[10:41:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "monitoring: Disable notifications for db1150 after crash" [puppet] - 10https://gerrit.wikimedia.org/r/905966 (owner: 10Jcrespo)
[10:41:16] <wikibugs>	 (03CR) 10Jbond: "lgtm small nit/q inline" [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway)
[10:41:35] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[10:42:15] <claime>	 Lucas_WMDE: Hmm there's a big difference between the test deployment and the staging deployment
[10:42:29] <claime>	 staging has a tls-proxy (so envoy)
[10:42:32] <claime>	 test doesn't
[10:42:42] <wikibugs>	 (03PS2) 10Jcrespo: Revert "monitoring: Disable notifications for db1150 after crash" [puppet] - 10https://gerrit.wikimedia.org/r/905966
[10:42:50] <wikibugs>	 (03PS2) 10Slyngshede: partman: test updated flat-noswap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/905958
[10:43:02] <Lucas_WMDE>	 hm
[10:43:13] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[10:43:14] <claime>	 test has mesh_enabled: false
[10:43:18] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/894571 (https://phabricator.wikimedia.org/T331302) (owner: 10Gerrit maintenance bot)
[10:43:22] <Lucas_WMDE>	 but we’re still trying to talk to mw-api-int.discovery.wmnet over TLS?
[10:43:23] <wikibugs>	 (03CR) 10Slyngshede: partman: test updated flat-noswap.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905958 (owner: 10Slyngshede)
[10:43:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s5 T331302
[10:43:30] <stashbot>	 T331302: Switchover s5 master (db1100 -> db1130) - https://phabricator.wikimedia.org/T331302
[10:43:31] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[10:44:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T331302
[10:44:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/905958 (owner: 10Slyngshede)
[10:44:19] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] partman: test updated flat-noswap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/905958 (owner: 10Slyngshede)
[10:44:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1130 with weight 0 T331302', diff saved to https://phabricator.wikimedia.org/P46047 and previous config saved to /var/cache/conftool/dbconfig/20230405-104422-marostegui.json
[10:44:41] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10Ladsgroup) Yeah, I was about to say from the application point of view, the more the better, like why not 400? But I don't know the limitations the infra so I can't...
[10:44:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/894571 (https://phabricator.wikimedia.org/T331302) (owner: 10Gerrit maintenance bot)
[10:45:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46048 and previous config saved to /var/cache/conftool/dbconfig/20230405-104517-root.json
[10:45:18] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Thanks Jaime, Do you want me to deploy?" [puppet] - 10https://gerrit.wikimedia.org/r/905966 (owner: 10Jcrespo)
[10:46:56] <wikibugs>	 (03PS2) 10Hnowlan: admin: update platform engineering approvers [puppet] - 10https://gerrit.wikimedia.org/r/889967 (https://phabricator.wikimedia.org/T300244)
[10:47:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46049 and previous config saved to /var/cache/conftool/dbconfig/20230405-104732-root.json
[10:47:55] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reimage for host testvm2002.codfw.wmnet with OS bullseye
[10:48:32] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[10:48:48] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[10:49:17] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[10:50:06] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[10:50:32] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[10:50:39] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[10:50:56] <Lucas_WMDE>	 claime: I’ll probably be away or mostly unresponsive for the next two hours, sorry
[10:51:02] <claime>	 Lucas_WMDE: Do you have a way to test staging ?
[10:51:08] <claime>	 rather than test ?
[10:51:16] <Lucas_WMDE>	 not sure
[10:51:44] <Lucas_WMDE>	 hm, termbox-staging.staging.svc.eqiad.wmnet isn’t a real host apparently
[10:52:48] <wikibugs>	 (03PS4) 10Sergio Gimeno: GrowthExperiments: enable add link frontend (7th) and backend (8,9th) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950 (https://phabricator.wikimedia.org/T304551)
[10:52:51] <claime>	 Because the production deployment is actually using the tls proxy
[10:53:04] <claime>	 Anyways, I'll revert the change
[10:53:14] <claime>	 We'll see when you get back, or later, it's not urgent
[10:53:18] <Lucas_WMDE>	 ok
[10:53:28] <wikibugs>	 (03CR) 10Jbond: partman: allow partitions to take up the whole disk on no-swap. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905936 (owner: 10Slyngshede)
[10:53:45] <Lucas_WMDE>	 I can tell you how to end-to-end test it from the wiki, at least
[10:53:47] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "termbox: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905967
[10:53:54] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Revert "termbox: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905967 (owner: 10Clément Goubert)
[10:54:13] <Lucas_WMDE>	 create a new item on test wikidata (Special:NewItem), load it on the mobile domain, check if you see anything above “statements” when javascript is not enabled
[10:54:39] <Lucas_WMDE>	 on real wikidata, creating a test item would be frowned upon ;) but you could get mostly the same effect by loading random items on the mobile site
[10:54:51] <Lucas_WMDE>	 (assuming that they won’t have a cached termbox already, this should still test the server-side rendering)
[10:55:09] <Lucas_WMDE>	 but I’m not so sure how to test the individual parts internally
[10:56:24] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage
[10:56:30] <Amir1>	 jouncebot: nowandnext
[10:56:31] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1000)
[10:56:31] <jouncebot>	 In 2 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1300)
[10:56:38] <Amir1>	 Shall I deploy a patch?
[10:56:45] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[10:57:05] * Lucas_WMDE afk
[10:58:22] <wikibugs>	 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10Ladsgroup) I'm inclined to mark this as decline...
[10:58:31] <claime>	 q
[10:59:18] <Amir1>	 I can press F to pay respects if needed
[10:59:22] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "termbox: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905967 (owner: 10Clément Goubert)
[10:59:38] <claime>	 Amir1: SSH lockups make me type strange things.
[10:59:46] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[10:59:46] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage
[10:59:47] <claime>	 Amir1: You can go ahead for my part
[10:59:57] <Amir1>	 noted, merci
[11:00:02] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply
[11:00:16] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply
[11:00:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46050 and previous config saved to /var/cache/conftool/dbconfig/20230405-110022-root.json
[11:00:50] <wikibugs>	 (03CR) 10Ladsgroup: "💔" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905967 (owner: 10Clément Goubert)
[11:02:00] <claime>	 Oh I think I may know what's happening though
[11:02:20] <claime>	 the old api-ro listens on 443, so it doesn't need a port specified
[11:02:30] <claime>	 :lightbulb:
[11:02:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P46051 and previous config saved to /var/cache/conftool/dbconfig/20230405-110237-root.json
[11:04:53] <wikibugs>	 (03PS2) 10Phuedx: VisualEditorFeatureUse sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905601 (https://phabricator.wikimedia.org/T333168)
[11:05:11] <marostegui>	 !log Starting s5 eqiad failover from db1100 to db1130 - T331302
[11:05:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:15] <stashbot>	 T331302: Switchover s5 master (db1100 -> db1130) - https://phabricator.wikimedia.org/T331302
[11:05:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1130 to s5 primary T331302', diff saved to https://phabricator.wikimedia.org/P46052 and previous config saved to /var/cache/conftool/dbconfig/20230405-110530-root.json
[11:06:00] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) @wiki_willy that's because all those have `N/A` in the Accounting tab of the spreadsheet in the `Asset tag` column and so they don't match.
[11:06:17] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+1 C: 03+2] Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905609 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup)
[11:07:10] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905609 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup)
[11:07:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1100 with 1% weight', diff saved to https://phabricator.wikimedia.org/P46053 and previous config saved to /var/cache/conftool/dbconfig/20230405-110717-root.json
[11:07:50] <Amir1>	 ugh, need to restart my pc
[11:09:12] <wikibugs>	 (03CR) 10David Caro: P:ldap::client: split config and utils to a separate profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah)
[11:09:54] <marostegui>	 Amir1: you need to stop using Windows
[11:10:02] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[11:11:17] <urbanecm>	 marostegui: I feel that Windows can be used since WSL became a part of it.
[11:11:55] <wikibugs>	 (03PS1) 10Superpes15: [mgwiki] Replace the wordmark on Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905960 (https://phabricator.wikimedia.org/T334022)
[11:12:14] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2002.codfw.wmnet with OS bullseye
[11:12:24] <moritzm>	 !log installing systemd security updates on buster
[11:12:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:36] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:12:37] <Amir1>	 marostegui: it's better than using Mac
[11:12:38] <marostegui>	 urbanecm: XDD
[11:12:59] <Amir1>	 but I respect everyone's flaws
[11:13:02] <wikibugs>	 (03PS1) 10Slyngshede: Revert "partman: test updated flat-noswap.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/905969
[11:13:04] <marostegui>	 Amir1: and Emacs?
[11:13:12] <wikibugs>	 (03PS2) 10Majavah: hieradata: remove unused keys from labsdnsconfig [puppet] - 10https://gerrit.wikimedia.org/r/903258
[11:13:27] <Amir1>	 nope that is not respectable :P
[11:14:41] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Revert "partman: test updated flat-noswap.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/905969 (owner: 10Slyngshede)
[11:14:57] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[11:15:01] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:905609|Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" (T326800)]]
[11:15:05] <stashbot>	 T326800: Make Wikimedia mwscript use run.php to run maintenance scripts - https://phabricator.wikimedia.org/T326800
[11:15:11] <wikibugs>	 (03CR) 10MVernon: "Hi," [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon)
[11:15:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46054 and previous config saved to /var/cache/conftool/dbconfig/20230405-111527-root.json
[11:15:30] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Add db1220 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/905962 (https://phabricator.wikimedia.org/T326669)
[11:16:01] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] P:ldap::client: split config and utils to a separate profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah)
[11:16:11] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064)
[11:16:27] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:905609|Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" (T326800)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[11:16:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] puppet-enc: added some tests for the api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875398 (owner: 10David Caro)
[11:17:04] <TheresNoTime>	 > Revert "Revert "Revert "Revert "mwscript: Switch to use run.php
[11:17:22] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[11:17:24] <TheresNoTime>	 10/10, no notes.
[11:17:26] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:17:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P46055 and previous config saved to /var/cache/conftool/dbconfig/20230405-111742-root.json
[11:17:44] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[11:18:11] <wikibugs>	 (03PS2) 10Clément Goubert: Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064)
[11:19:34] <Amir1>	 :P
[11:19:36] <wikibugs>	 (03PS2) 10Clément Goubert: linkrecommendation: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905941 (https://phabricator.wikimedia.org/T334060)
[11:19:52] <wikibugs>	 (03PS2) 10Clément Goubert: push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061)
[11:20:14] <wikibugs>	 (03PS2) 10Clément Goubert: recommendation-api: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905943 (https://phabricator.wikimedia.org/T334062)
[11:20:38] <wikibugs>	 (03PS2) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065)
[11:21:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] puppet-enc: rename so it can be imported and mocked [puppet] - 10https://gerrit.wikimedia.org/r/875824 (owner: 10David Caro)
[11:22:08] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert)
[11:22:33] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reimage for host testvm2004.codfw.wmnet with OS bullseye
[11:22:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P46056 and previous config saved to /var/cache/conftool/dbconfig/20230405-112240-root.json
[11:22:53] <TheresNoTime>	 Amir1: can you ping me when you're done? I just want to get `905764: Remove possibly significant whitespace from robots.txt | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/905764` out
[11:23:08] <Amir1>	 sure
[11:23:18] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[11:23:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Add db1220 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/905962 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[11:23:47] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:905609|Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" (T326800)]] (duration: 08m 45s)
[11:23:51] <stashbot>	 T326800: Make Wikimedia mwscript use run.php to run maintenance scripts - https://phabricator.wikimedia.org/T326800
[11:24:18] <wikibugs>	 (03CR) 10Jcrespo: "No, thank you, I need to make sure I finish the transfer and setup first" [puppet] - 10https://gerrit.wikimedia.org/r/905966 (owner: 10Jcrespo)
[11:24:56] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:25:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/875866 (owner: 10David Caro)
[11:25:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/875825 (owner: 10David Caro)
[11:26:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:27:30] <wikibugs>	 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10jcrespo) > I'm inclined to mark this as decline...
[11:28:03] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[11:28:29] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[11:29:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mw1414.eqiad.wmnet
[11:30:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46057 and previous config saved to /var/cache/conftool/dbconfig/20230405-113031-root.json
[11:30:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46058 and previous config saved to /var/cache/conftool/dbconfig/20230405-113052-root.json
[11:31:00] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release thumbor/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:31:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] P:ldap::client: split config and utils to a separate profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah)
[11:31:31] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2004.codfw.wmnet with reason: host reimage
[11:31:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:31:43] <Amir1>	 TheresNoTime: done now
[11:31:48] <TheresNoTime>	 ty!
[11:32:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905764 (https://phabricator.wikimedia.org/T334038) (owner: 10Legoktm)
[11:32:03] <wikibugs>	 (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/905986
[11:32:42] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10hnowlan)
[11:32:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P46059 and previous config saved to /var/cache/conftool/dbconfig/20230405-113246-root.json
[11:32:52] <wikibugs>	 (03Merged) 10jenkins-bot: Remove possibly significant whitespace from robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905764 (https://phabricator.wikimedia.org/T334038) (owner: 10Legoktm)
[11:33:11] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:905764|Remove possibly significant whitespace from robots.txt (T334038)]]
[11:33:15] <stashbot>	 T334038: Excess whitespace in English Wikipedia robots.txt file could cause problems in some implementations - https://phabricator.wikimedia.org/T334038
[11:33:38] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10hnowlan)
[11:34:19] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2004.codfw.wmnet with reason: host reimage
[11:34:42] <logmsgbot>	 !log samtar@deploy2002 legoktm and samtar: Backport for [[gerrit:905764|Remove possibly significant whitespace from robots.txt (T334038)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[11:34:45] <TheresNoTime>	 (testing)
[11:35:14] <TheresNoTime>	 (syncing)
[11:35:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw1414.eqiad.wmnet
[11:37:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P46060 and previous config saved to /var/cache/conftool/dbconfig/20230405-113745-root.json
[11:37:54] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "monitoring: Disable notifications for db1150 after crash" [puppet] - 10https://gerrit.wikimedia.org/r/905966 (owner: 10Jcrespo)
[11:38:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/905986 (owner: 10Muehlenhoff)
[11:38:48] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[11:40:25] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:905764|Remove possibly significant whitespace from robots.txt (T334038)]] (duration: 07m 14s)
[11:40:29] <stashbot>	 T334038: Excess whitespace in English Wikipedia robots.txt file could cause problems in some implementations - https://phabricator.wikimedia.org/T334038
[11:41:00] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release thumbor/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:41:09] <wikibugs>	 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) >>! In T310980#8724070, @MoritzMuehlenhoff wrote: > Looking at https://cassandra.apache.org/doc/latest/cassandra/getting_started/java11.html we should probably also continue to...
[11:43:42] <TheresNoTime>	 hm, how does one clear what I assume is a cached robots.txt? (for T334038)
[11:44:08] <taavi>	 purgeList.php would work here too I assume
[11:44:16] <TheresNoTime>	 ah, makes sense, thank you
[11:44:59] <TheresNoTime>	 yup :)
[11:45:15] <TheresNoTime>	 !log `[samtar@mwmaint2002 ~]$ echo 'https://en.wikipedia.org/robots.txt' | mwscript purgeList.php` T334038
[11:45:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46061 and previous config saved to /var/cache/conftool/dbconfig/20230405-114557-root.json
[11:47:35] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2004.codfw.wmnet with OS bullseye
[11:47:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46062 and previous config saved to /var/cache/conftool/dbconfig/20230405-114751-root.json
[11:52:16] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] "Fix LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/905304 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy)
[11:52:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P46063 and previous config saved to /var/cache/conftool/dbconfig/20230405-115249-root.json
[11:53:59] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[11:54:45] <moritzm>	 !log installing apache2 security updates on buster
[11:54:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46064 and previous config saved to /var/cache/conftool/dbconfig/20230405-120101-root.json
[12:02:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46065 and previous config saved to /var/cache/conftool/dbconfig/20230405-120256-root.json
[12:04:21] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[12:06:16] <wikibugs>	 (03PS3) 10Samtar: Remove WikiEditor's Realtime Preview config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901553 (https://phabricator.wikimedia.org/T327515) (owner: 10Samwilson)
[12:06:29] <TheresNoTime>	 jouncebot: nowandnext
[12:06:29] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 53 minute(s)
[12:06:29] <jouncebot>	 In 0 hour(s) and 53 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1300)
[12:07:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ayounsi) a:03ayounsi
[12:07:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46066 and previous config saved to /var/cache/conftool/dbconfig/20230405-120754-root.json
[12:09:26] <TheresNoTime>	 o/ I intend to deploy `901553: Remove WikiEditor's Realtime Preview config vars | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/901553` — any reason not to?
[12:09:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10ayounsi) a:03ayounsi Taking that task, even if the current CR does the job, it could be refactored with @cmooney work to remove the duplicated co...
[12:10:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901553 (https://phabricator.wikimedia.org/T327515) (owner: 10Samwilson)
[12:11:26] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi)
[12:12:00] <wikibugs>	 (03Merged) 10jenkins-bot: Remove WikiEditor's Realtime Preview config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901553 (https://phabricator.wikimedia.org/T327515) (owner: 10Samwilson)
[12:12:26] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:901553|Remove WikiEditor's Realtime Preview config vars (T327515)]]
[12:12:30] <stashbot>	 T327515: Remove Realtime Preview's Beta Feature and Onboarding UI - https://phabricator.wikimedia.org/T327515
[12:13:51] <logmsgbot>	 !log samtar@deploy2002 samwilson and samtar: Backport for [[gerrit:901553|Remove WikiEditor's Realtime Preview config vars (T327515)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[12:13:52] <TheresNoTime>	 (testing)
[12:14:42] <TheresNoTime>	 (syncing)
[12:14:53] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission frdata1001.frack.eqiad.wmnet (WMF7292) - https://phabricator.wikimedia.org/T333971 (10Jgreen)
[12:15:09] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looks good, LGTM assuming that varnish tests are still happy (OoO today and I cannot 4un the tests)" [puppet] - 10https://gerrit.wikimedia.org/r/904883 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[12:16:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46067 and previous config saved to /var/cache/conftool/dbconfig/20230405-121606-root.json
[12:18:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46068 and previous config saved to /var/cache/conftool/dbconfig/20230405-121801-root.json
[12:18:36] <wikibugs>	 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10Ladsgroup) 05Open→03Declined Yeah, it was t...
[12:20:08] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:901553|Remove WikiEditor's Realtime Preview config vars (T327515)]] (duration: 07m 41s)
[12:20:12] <stashbot>	 T327515: Remove Realtime Preview's Beta Feature and Onboarding UI - https://phabricator.wikimedia.org/T327515
[12:20:19] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "0 tests failed, 0 tests skipped, 17 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/904883 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[12:20:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:23:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46069 and previous config saved to /var/cache/conftool/dbconfig/20230405-122259-root.json
[12:23:31] <wikibugs>	 (03CR) 10Jelto: "I'm exited about the new cookbook! I left some gitlab-specific comments in-line. I'm happy to take another look on future patchsets." [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney)
[12:25:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:27:16] <moritzm>	 !log installing xapian-core security updates
[12:27:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845 (10ayounsi) 05Open→03Resolved a:03ayounsi This is completed in drmrs, the same will be applied to the other sites when we bring L3 on the ToR switches as I don't think...
[12:28:41] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: move varnishkafka-exporter stats to counters [puppet] - 10https://gerrit.wikimedia.org/r/906000 (https://phabricator.wikimedia.org/T334085)
[12:31:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46070 and previous config saved to /var/cache/conftool/dbconfig/20230405-123111-root.json
[12:33:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46071 and previous config saved to /var/cache/conftool/dbconfig/20230405-123305-root.json
[12:38:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46072 and previous config saved to /var/cache/conftool/dbconfig/20230405-123804-root.json
[12:46:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46073 and previous config saved to /var/cache/conftool/dbconfig/20230405-124616-root.json
[12:48:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46074 and previous config saved to /var/cache/conftool/dbconfig/20230405-124810-root.json
[12:52:36] <wikibugs>	 (03PS1) 10Muehlenhoff: debdeploy-revdeps: Omit Breaks, Enhances, Conflicts, Replaces [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/906009
[12:53:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46075 and previous config saved to /var/cache/conftool/dbconfig/20230405-125308-root.json
[12:55:15] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Allow different port than default 22 (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi)
[12:56:07] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Ottomata) @BTullis gave @Jgiannelos sql_role perms in T328457#8734396  I think we can close this.
[12:57:19] <wikibugs>	 (03Merged) 10jenkins-bot: Allow different port than default 22 [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi)
[12:58:32] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1008.eqiad.wmnet
[12:59:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: mute etcd-mirror pint promql checks [alerts] - 10https://gerrit.wikimedia.org/r/906011 (https://phabricator.wikimedia.org/T309182)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1300)
[13:00:05] <jouncebot>	 sergi0 and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:12] <sergi0>	 hi
[13:00:13] <phuedx>	 o/
[13:00:29] <Lucas_WMDE>	 o/ busy but probably available in 5mins or so
[13:00:45] <wikibugs>	 (03PS6) 10Sergio Gimeno: GrowthExperiments: enable add link backend in wiki rounds (8,9th) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950 (https://phabricator.wikimedia.org/T308133)
[13:01:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46076 and previous config saved to /var/cache/conftool/dbconfig/20230405-130121-root.json
[13:03:04] <wikibugs>	 (03PS3) 10Clément Goubert: Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064)
[13:03:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46077 and previous config saved to /var/cache/conftool/dbconfig/20230405-130315-root.json
[13:03:24] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1008.eqiad.wmnet
[13:04:51] <wikibugs>	 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Aklapper)
[13:04:52] <Lucas_WMDE>	 ok, I can deploy!
[13:04:54] * Lucas_WMDE looks
[13:05:08] <wikibugs>	 (03PS1) 10Stevemunene: Decommission an-worker1132 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/906017 (https://phabricator.wikimedia.org/T334092)
[13:06:04] <sergi0>	 Lucas_WMDE: we won't be able to test much from my patch since the flag is only read on a maintenance script triggered by a periodic job. Should be safe though, we've been using it for a while.
[13:06:15] <Lucas_WMDE>	 ok
[13:07:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950 (https://phabricator.wikimedia.org/T308133) (owner: 10Sergio Gimeno)
[13:08:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46078 and previous config saved to /var/cache/conftool/dbconfig/20230405-130813-root.json
[13:08:35] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: enable add link backend in wiki rounds (8,9th) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950 (https://phabricator.wikimedia.org/T308133) (owner: 10Sergio Gimeno)
[13:08:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:905950|GrowthExperiments: enable add link backend in wiki rounds (8,9th) (T308133 T308134)]]
[13:09:03] <stashbot>	 T308134: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134
[13:09:03] <stashbot>	 T308133: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133
[13:10:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and sgimeno: Backport for [[gerrit:905950|GrowthExperiments: enable add link backend in wiki rounds (8,9th) (T308133 T308134)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[13:11:15] * Lucas_WMDE quickly checks that fiwiki isn’t totally broken
[13:11:27] <Lucas_WMDE>	 heh, of course they have the NATO logo in the recent news section ^^
[13:11:32] <Lucas_WMDE>	 anyway, looks fine enough, syncing
[13:11:52] <taavi>	 of course we do, it's rather major news here :P
[13:12:09] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[13:12:28] <Lucas_WMDE>	 :D
[13:12:29] <sergi0>	 heh, thanks for checking Lucas_WMDE
[13:12:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/906009 (owner: 10Muehlenhoff)
[13:14:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[13:14:37] * Lucas_WMDE checks the enwiki and dewiki front pages for comparison and 🤮
[13:15:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:16:08] <wikibugs>	 (03CR) 10Jbond: "ill merge theses after easter" [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[13:16:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED
[13:16:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:905950|GrowthExperiments: enable add link backend in wiki rounds (8,9th) (T308133 T308134)]] (duration: 08m 00s)
[13:17:04] <stashbot>	 T308134: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134
[13:17:05] <stashbot>	 T308133: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133
[13:17:11] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): mediawiki.edit_attempt: Ignore events from PHP MPC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905261 (https://phabricator.wikimedia.org/T309985) (owner: 10Phuedx)
[13:17:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905261 (https://phabricator.wikimedia.org/T309985) (owner: 10Phuedx)
[13:17:39] <Lucas_WMDE>	 phuedx: can the edit_attempt change be tested on mwdebug?
[13:17:45] <wikibugs>	 (03PS1) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019
[13:17:57] <phuedx>	 Lucas_WMDE: Yes
[13:18:02] <Lucas_WMDE>	 yay
[13:18:09] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki.edit_attempt: Ignore events from PHP MPC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905261 (https://phabricator.wikimedia.org/T309985) (owner: 10Phuedx)
[13:18:20] <wikibugs>	 (03PS2) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019
[13:18:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:905261|mediawiki.edit_attempt: Ignore events from PHP MPC (T309985)]]
[13:18:37] <stashbot>	 T309985: Migrate WikiEditor EditAttemptStep instrument to Metrics Platform - https://phabricator.wikimedia.org/T309985
[13:19:15] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED
[13:19:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: data-engineering: ignore 'status' label pint check [alerts] - 10https://gerrit.wikimedia.org/r/906020 (https://phabricator.wikimedia.org/T309182)
[13:19:50] <wikibugs>	 (03CR) 10Volans: "nits inline" [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi)
[13:19:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi)
[13:19:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and phuedx: Backport for [[gerrit:905261|mediawiki.edit_attempt: Ignore events from PHP MPC (T309985)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[13:20:11] <Lucas_WMDE>	 phuedx: then please test now :)
[13:21:42] <wikibugs>	 (03PS3) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019
[13:21:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED
[13:22:08] <wikibugs>	 (03CR) 10Ayounsi: "thx" [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi)
[13:23:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46079 and previous config saved to /var/cache/conftool/dbconfig/20230405-132318-root.json
[13:23:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi)
[13:23:59] <phuedx>	 Lucas_WMDE: LGTM. I tested the change by opening the editor on a random page on enwiki and typing a few characters (but not saving the change) and observing several analytics events being logged to the mediawiki.edit_attempt stream
[13:24:23] <Lucas_WMDE>	 ok!
[13:24:29] <Lucas_WMDE>	 thanks
[13:26:59] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED
[13:27:06] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[13:27:59] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] debdeploy-revdeps: Omit Breaks, Enhances, Conflicts, Replaces [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/906009 (owner: 10Muehlenhoff)
[13:28:18] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:28:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED
[13:29:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:905261|mediawiki.edit_attempt: Ignore events from PHP MPC (T309985)]] (duration: 10m 52s)
[13:29:29] <stashbot>	 T309985: Migrate WikiEditor EditAttemptStep instrument to Metrics Platform - https://phabricator.wikimedia.org/T309985
[13:30:13] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): VisualEditorFeatureUse sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905601 (https://phabricator.wikimedia.org/T333168) (owner: 10Phuedx)
[13:30:19] <wikibugs>	 (03PS4) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019
[13:30:49] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] admin: add fnavas-foundation to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482) (owner: 10Ssingh)
[13:30:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905601 (https://phabricator.wikimedia.org/T333168) (owner: 10Phuedx)
[13:31:32] <wikibugs>	 (03Merged) 10jenkins-bot: VisualEditorFeatureUse sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905601 (https://phabricator.wikimedia.org/T333168) (owner: 10Phuedx)
[13:31:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:905601|VisualEditorFeatureUse sampling rate to 1 everywhere (T333168)]]
[13:32:01] <stashbot>	 T333168: Increase VisualEditorFeatureUse sampling rate to 100% - https://phabricator.wikimedia.org/T333168
[13:32:25] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10ssingh) 05Open→03Resolved Oh, great. Thanks for sharing @Ottomata!
[13:33:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and phuedx: Backport for [[gerrit:905601|VisualEditorFeatureUse sampling rate to 1 everywhere (T333168)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[13:33:31] <Lucas_WMDE>	 phuedx: is this one testable too?
[13:34:08] <phuedx>	 Lucas_WMDE: Yeah. I can do a quick spot check on mwdebug2002. I'll be monitoring it after it rolls out too
[13:34:14] <Lucas_WMDE>	 ok thanks
[13:35:24] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Assign proper insetup Puppet roles to machines [puppet] - 10https://gerrit.wikimedia.org/r/906023
[13:35:43] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:35:44] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi)
[13:36:05] <icinga-wm_>	 PROBLEM - Host ml-serve2004 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10ssingh) 05Open→03Resolved a:03ssingh @FNavas-foundation: Your access request has been merged. Please try again (in about 30 minutes from thi...
[13:36:40] <wikibugs>	 (03PS2) 10Elukey: istio: upgrade to upstream version 1.15.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905956 (https://phabricator.wikimedia.org/T334068)
[13:38:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi)
[13:39:19] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:40:35] <Lucas_WMDE>	 phuedx: are you doing the spot check?
[13:40:38] <Lucas_WMDE>	 just want to make sure we’re not both waiting for the other to say something ^^
[13:40:38] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi)
[13:40:47] <phuedx>	 Lucas_WMDE: Yes. Sorry. Thanks :)
[13:41:07] <Lucas_WMDE>	 ok, I’ll wait for your confirmation then ^^
[13:41:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Aligns with the role annotations we have in Hiera." [puppet] - 10https://gerrit.wikimedia.org/r/906023 (owner: 10Alexandros Kosiaris)
[13:41:38] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:41:39] <phuedx>	 Lucas_WMDE: LGTM. Thanks
[13:41:43] <Lucas_WMDE>	 ok thanks!
[13:41:47] <Lucas_WMDE>	 syncing
[13:42:36] <phuedx>	 I'll monitor the impact closely over at https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=VisualEditorFeatureUse
[13:43:57] <claime>	 Lucas_WMDE: If the port from WIKIBASE_REPO shows up for test, it'll show up for prod too where WIKIBASE_REPO: http://www.wikidata.org:6500/w
[13:44:09] <Lucas_WMDE>	 hm, good point ^^
[13:44:35] <claime>	 I'll wait until you're done with the deploy and merge the new patch, if that's all right with you?
[13:45:34] <Lucas_WMDE>	 sounds good!
[13:46:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:905601|VisualEditorFeatureUse sampling rate to 1 everywhere (T333168)]] (duration: 14m 47s)
[13:46:49] <stashbot>	 T333168: Increase VisualEditorFeatureUse sampling rate to 100% - https://phabricator.wikimedia.org/T333168
[13:47:18] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[13:47:32] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[13:47:57] <Lucas_WMDE>	 claime: I’m done
[13:48:04] <claime>	 ack
[13:48:07] <claime>	 let's go :P
[13:48:27] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[13:48:32] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1009.eqiad.wmnet
[13:48:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:48:57] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-main1004.eqiad.wmnet with reason: restart kafka, switch to PKI
[13:49:12] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-main1004.eqiad.wmnet with reason: restart kafka, switch to PKI
[13:52:10] <elukey>	 !log restart kafka on kafka-main1004 to pick up the new TLS certificate (PKI based) - T319372
[13:52:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:14] <stashbot>	 T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372
[13:52:30] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1009.eqiad.wmnet
[13:52:55] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1010.eqiad.wmnet
[13:52:59] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[13:53:30] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply
[13:53:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:53:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply
[13:53:52] <claime>	 Lucas_WMDE: deployed the test releas
[13:53:54] <claime>	 e
[13:54:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply
[13:54:15] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply
[13:54:16] <Lucas_WMDE>	 ok, checking
[13:54:30] <claime>	 And the staging release for good measure, even if it doesn't seem used
[13:54:42] <wikibugs>	 (03PS1) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172)
[13:54:55] <Lucas_WMDE>	 hm, no termbox on https://test.m.wikidata.org/wiki/Q229878 with noscript
[13:55:09] * claime grumbles
[13:55:25] <Lucas_WMDE>	 “timeout of 3000ms exceeded” in logstash :(
[13:55:29] <claime>	 Yeah
[13:55:57] <wikibugs>	 (03PS10) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669)
[13:56:04] <wikibugs>	 (03PS17) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272)
[13:56:57] <claime>	 Lucas_WMDE: Can test.wikidata be switched to use the staging release and not the test release ?
[13:57:05] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1010.eqiad.wmnet
[13:57:07] <claime>	 (the one that's using the service mesh)
[13:57:52] <claime>	 On port 4004, not 3031
[13:57:58] <Lucas_WMDE>	 no idea tbh
[13:58:32] <wikibugs>	 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Ottomata) From a brief glance, those look like normal consumer reassignment messages.  Probably shouldn't be alerts.
[13:58:34] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host kafka-test1008.eqiad.wmnet with OS bullseye
[13:58:39] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host kafka-test1008.eqiad.wmnet with OS bullseye
[13:58:41] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) 05Open→03Resolved updated backplane firmware looks like errors have resolved
[13:59:24] <Lucas_WMDE>	 I tried curling termbox-test.staging.svc.eqiad.wmnet:4004 and got “empty reply from server”
[14:00:05] <claime>	 Lucas_WMDE: curl -k https://termbox-test.staging.svc.eqiad.wmnet:4004/?spec 
[14:00:26] <claime>	 curl -k https://termbox-test.staging.svc.eqiad.wmnet:3031/?spec 
[14:00:28] <claime>	 curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number
[14:00:30] <claime>	 lol.
[14:00:32] <claime>	 awesome.
[14:00:48] <Lucas_WMDE>	 ok, but
[14:00:49] <Lucas_WMDE>	 curl -vk 'https://termbox-test.staging.svc.eqiad.wmnet:4004/termbox?entity=Q229877&revision=630197&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ229877&preferredLanguages=en'
[14:00:51] <elukey>	 !log powercycle an-worker1132
[14:00:52] <Lucas_WMDE>	 is a 500 Internal Server Error
[14:00:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Bird: remove anycast subnet filter [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[14:01:32] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Bird: remove anycast subnet filter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[14:03:15] <Lucas_WMDE>	 hm, but this is interesting
[14:03:26] <Lucas_WMDE>	 if I curl termbox-test:4004, there are different logstash errors
[14:03:31] <Lucas_WMDE>	 “Request failed with status code 503”
[14:03:43] <Lucas_WMDE>	 and it’s apparently talking to www.wikidata.org:6500 ?
[14:04:07] <Lucas_WMDE>	 so https://termbox-test.staging.svc.eqiad.wmnet:4004 is somehow a prod termbox instead of a testwikidatawiki one?
[14:04:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) As an update, this is now blocked on {T297596}. The previous implementation discussion led to a finalization of guidelines...
[14:04:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs)
[14:04:44] <claime>	 Lucas_WMDE: Yeah, staging uses the same endpoints as prod
[14:04:45] <Lucas_WMDE>	 I guess that’s because values-staging.yaml doesn’t change the WIKIBASE_REPO etc. like values-test.yaml does
[14:04:48] <claime>	 yeah
[14:04:51] <claime>	 exactly
[14:04:53] <Lucas_WMDE>	 ok
[14:05:04] <Lucas_WMDE>	 but doesn’t that mean we can’t use it for test wikidata?
[14:05:08] <claime>	 I have a meeting, I'll roll back
[14:05:12] <Lucas_WMDE>	 ok thanks
[14:05:19] <claime>	 We'll figure that out :P
[14:05:47] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "Revert "Revert "termbox: Switch to mw-api-int-async on k8s""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905973
[14:05:59] <Lucas_WMDE>	 competing with Amir1 for the most reverts in a commit message, I see :P
[14:06:17] <claime>	 Yes :D
[14:06:34] <claime>	 I'll end up doing Revert^5
[14:08:37] <Amir1>	 you have a lot to catch on. I'm not worried
[14:08:43] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Revert "Revert "Revert "termbox: Switch to mw-api-int-async on k8s""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905973 (owner: 10Clément Goubert)
[14:10:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:10:30] <wikibugs>	 (03PS1) 10Jbond: P:netbox: add consumeres fo prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/906031
[14:11:34] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1008.eqiad.wmnet with reason: host reimage
[14:11:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED
[14:12:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:netbox: add consumeres fo prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/906031 (owner: 10Jbond)
[14:13:55] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Revert "termbox: Switch to mw-api-int-async on k8s""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905973 (owner: 10Clément Goubert)
[14:14:04] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply
[14:14:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply
[14:14:24] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1008.eqiad.wmnet with reason: host reimage
[14:14:29] <claime>	 Lucas_WMDE: rollback done
[14:15:15] <ottomata>	  phuedx, looks like VisualEditorFeatureUse validation errors are  creeping up
[14:16:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10JArguello-WMF)
[14:16:18] <Lucas_WMDE>	 claime: thanks
[14:22:54] <wikibugs>	 (03CR) 10Muehlenhoff: install_server: simplify gitlab disk layout, drop lvm, use four SSDs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[14:23:59] <ottomata>	 phuedx: e.g. https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-default-1-7.0.0-1-2023.04.05?id=E13NUYcBwEtI0jFYGROB
[14:24:37] <phuedx>	 ottomata: Looking. There appears to be a bunch of events with missing data. Happy to roll the change back and then investigate what's going on with the instrument
[14:24:48] * Lucas_WMDE still around if needed
[14:24:51] <ottomata>	 phuedx: no need to roll back, just as long as you know and are working on it.  
[14:25:06] <ottomata>	 it looks like maybe these errors were there before, its just now there are more of them
[14:25:23] <ottomata>	 the errors aren't hurting anything right now
[14:25:51] <ottomata>	 https://grafana.wikimedia.org/goto/qhs4YFLVz?orgId=1
[14:26:42] <wikibugs>	 (03PS2) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172)
[14:27:25] <wikibugs>	 (03CR) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[14:30:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:30:32] <wikibugs>	 (03PS1) 10Majavah: cinderutils: stop provisioning old filename on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/906034
[14:30:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10elukey) 05Resolved→03Open @Jclark-ctr hi! I tried to reboot the node and it gets blocked when checking the hard drivers, telling me about possible preserved cache et...
[14:30:51] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-main1005.eqiad.wmnet with reason: restart kafka, switch to PKI
[14:31:05] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-main1005.eqiad.wmnet with reason: restart kafka, switch to PKI
[14:31:20] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kafka-test1008.eqiad.wmnet with OS bullseye
[14:31:24] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host kafka-test1008.eqiad.wmnet with OS bullseye completed: - kafka-test1008 (**PASS**)   - Downtimed on Icinga/A...
[14:33:37] <elukey>	 !log restart kafka on kafka-main1005 to pick up the new TLS certificate (PKI based) - T319372
[14:33:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:41] <stashbot>	 T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372
[14:34:13] <wikibugs>	 (03CR) 10Volans: P:netbox: add consumeres fo prefixes and net devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906031 (owner: 10Jbond)
[14:34:38] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: increase mem overhead to 45% [deployment-charts] - 10https://gerrit.wikimedia.org/r/906035
[14:36:33] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host kafka-test1009.eqiad.wmnet with OS bullseye
[14:36:40] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host kafka-test1009.eqiad.wmnet with OS bullseye
[14:38:11] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: increase mem overhead to 45% [deployment-charts] - 10https://gerrit.wikimedia.org/r/906035 (owner: 10DCausse)
[14:43:30] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: increase mem overhead to 45% [deployment-charts] - 10https://gerrit.wikimedia.org/r/906035 (owner: 10DCausse)
[14:44:38] <wikibugs>	 (03CR) 10Muehlenhoff: install_server: simplify gitlab disk layout, drop lvm, use four SSDs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[14:48:12] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:48:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:48:22] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[14:48:28] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:49:27] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:49:50] <wikibugs>	 (03CR) 10AOkoth: exim: fix hard-coded vrts hostname (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth)
[14:51:36] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1009.eqiad.wmnet with reason: host reimage
[14:54:45] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1009.eqiad.wmnet with reason: host reimage
[14:55:09] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: increase jvm-overhead.max [deployment-charts] - 10https://gerrit.wikimedia.org/r/906040
[14:55:49] <wikibugs>	 (03CR) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan)
[14:58:19] <wikibugs>	 (03PS3) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172)
[14:58:22] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] scap: block Scap execution on inactive deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche)
[14:59:35] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] scap: block Scap execution on inactive deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche)
[14:59:43] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:59:50] <wikibugs>	 (03CR) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[15:00:08] <phuedx>	 ottomata: I should have looked at the error rate and nature of the errors before increasing the sampling rate. I'm OoO next week and am trying to close out a few dangling threads. I think it's best to revert for now and take a look at the instrument when I get back
[15:00:11] <phuedx>	 ^ Lucas_WMDE
[15:01:03] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:01:05] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:02:36] <wikibugs>	 (03PS1) 10Phuedx: Revert "VisualEditorFeatureUse sampling rate to 1 everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905979
[15:02:38] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki::scap: Ensure Exec['fetch_mediawiki'] resource always exists [puppet] - 10https://gerrit.wikimedia.org/r/905304 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy)
[15:03:25] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/906000 (https://phabricator.wikimedia.org/T334085) (owner: 10Filippo Giunchedi)
[15:03:29] <Lucas_WMDE>	 o/
[15:03:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) @ssingh -- thank you. Now i can't get in but i think it is an ITS issue.
[15:03:47] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:03:51] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[15:04:15] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.376 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:04:21] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10ssingh) >>! In T331482#8759447, @FNavas-foundation wrote: > @ssingh -- thank you. Now i can't get in but i think it is an ITS issue.  Make sure you are logging in with...
[15:04:33] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:04:43] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:05:00] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[15:05:25] <Lucas_WMDE>	 phuedx: we’re reverting VisualEditorFeatureUse, not edit_attempt, right?
[15:05:33] <Lucas_WMDE>	 ah, I see the revert already exists :)
[15:05:42] <phuedx>	 Lucas_WMDE: Yes. That's correct. Revert exists: https://gerrit.wikimedia.org/r/905979
[15:05:54] <Lucas_WMDE>	 jouncebot: now
[15:05:54] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 54 minute(s)
[15:05:56] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10colewhite)
[15:06:26] <Lucas_WMDE>	 don’t see anything else going on that looks like I shouldn’t deploy right now
[15:06:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905979 (owner: 10Phuedx)
[15:06:32] <Lucas_WMDE>	 let’s go
[15:07:00] <wikibugs>	 (03PS1) 10Mazevedo: Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481)
[15:07:22] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "VisualEditorFeatureUse sampling rate to 1 everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905979 (owner: 10Phuedx)
[15:07:50] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:905979|Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"]]
[15:07:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo)
[15:08:45] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync
[15:09:13] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync
[15:09:18] <moritzm>	 !log installing nodejs security updates on buster
[15:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and phuedx: Backport for [[gerrit:905979|Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[15:09:34] <Lucas_WMDE>	 phuedx: anything to test, or should I just deploy right away?
[15:10:01] <phuedx>	 Lucas_WMDE: Deploy right away I think. I'll monitor error rate and event rate
[15:10:09] <Lucas_WMDE>	 ok, doing
[15:10:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kafka-test1009.eqiad.wmnet with OS bullseye
[15:10:31] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host kafka-test1009.eqiad.wmnet with OS bullseye completed: - kafka-test1009 (**PASS**)   - Downtimed on Icinga/A...
[15:11:00] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:13:08] <phuedx>	 Thanks, Lucas_WMDE 
[15:13:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[15:13:35] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] rsyslog: add rsyslog-namespaced fields to syslog_json [puppet] - 10https://gerrit.wikimedia.org/r/904597 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite)
[15:14:20] <wikibugs>	 (03PS1) 10Eevans: cassandra: create aqs cluster user for 'fgoodwin' [puppet] - 10https://gerrit.wikimedia.org/r/906044 (https://phabricator.wikimedia.org/T334099)
[15:14:56] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:15:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:905979|Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"]] (duration: 07m 42s)
[15:15:40] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] cassandra: create aqs cluster user for 'fgoodwin' [puppet] - 10https://gerrit.wikimedia.org/r/906044 (https://phabricator.wikimedia.org/T334099) (owner: 10Eevans)
[15:16:09] * Lucas_WMDE done
[15:16:24] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[15:17:00] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) You missed a part of this conversation that involved ITS removing fnavas-foundation in favor of the verified WMF ITS created SUL wiki account FNavas-...
[15:21:12] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=7; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[15:21:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10ssingh) >>! In T331482#8759537, @FNavas-foundation wrote: > You missed a part of this conversation that involved ITS > removing fnavas-foundation in favor of the verifi...
[15:21:53] <moritzm>	 !log installing pcre2 security updates on buster
[15:21:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:00] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release thumbor/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:24:37] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:25:05] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[15:26:11] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49853 bytes in 5.530 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:27:14] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host kafka-test1010.eqiad.wmnet with OS bullseye
[15:27:19] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host kafka-test1010.eqiad.wmnet with OS bullseye
[15:28:00] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release thumbor/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:30:21] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=8; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[15:31:25] <moritzm>	 !log restarting FPM on mediawiki canaries to pick up pcre security update
[15:31:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:24] <wikibugs>	 (03CR) 10Andrew Bogott: "Hello everyone!  This is still a useful patch, still in need of review." [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott)
[15:37:38] <wikibugs>	 (03CR) 10Andrew Bogott: "Hello everyone!  This is still a useful patch, still in need of review." [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (https://phabricator.wikimedia.org/T325773) (owner: 10Andrew Bogott)
[15:37:44] <wikibugs>	 (03PS1) 10Ahmon Dancy: mediawiki::scap: force creation of the symlink when enabled [puppet] - 10https://gerrit.wikimedia.org/r/906051 (https://phabricator.wikimedia.org/T329857)
[15:38:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki::scap: force creation of the symlink when enabled [puppet] - 10https://gerrit.wikimedia.org/r/906051 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy)
[15:38:48] <wikibugs>	 (03CR) 10Dzahn: "Yes, Arnold, that's correct. Change would be just in DNS then." [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth)
[15:39:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[15:39:40] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[15:40:07] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] "Good catch, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/906000 (https://phabricator.wikimedia.org/T334085) (owner: 10Filippo Giunchedi)
[15:40:19] <wikibugs>	 (03PS2) 10Ahmon Dancy: mediawiki::scap: force creation of the symlink when enabled [puppet] - 10https://gerrit.wikimedia.org/r/906051 (https://phabricator.wikimedia.org/T329857)
[15:41:21] <wikibugs>	 (03PS1) 10Andrew Bogott: wikireplica_dns.yaml: move toolsdb DNS to new server in 'tools' project [puppet] - 10https://gerrit.wikimedia.org/r/906053 (https://phabricator.wikimedia.org/T333471)
[15:41:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1010.eqiad.wmnet with reason: host reimage
[15:42:06] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[15:42:20] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[15:42:57] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40548/console" [puppet] - 10https://gerrit.wikimedia.org/r/906000 (https://phabricator.wikimedia.org/T334085) (owner: 10Filippo Giunchedi)
[15:44:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10ssingh) @KFrancis: Hi! @MarcoAurelio needs an NDA for this request to proceed. Thank you!
[15:44:19] <wikibugs>	 (03CR) 10Andrew Bogott: "To be merged during migration window" [puppet] - 10https://gerrit.wikimedia.org/r/906053 (https://phabricator.wikimedia.org/T333471) (owner: 10Andrew Bogott)
[15:44:48] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1010.eqiad.wmnet with reason: host reimage
[15:47:36] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki::scap: force creation of the symlink when enabled [puppet] - 10https://gerrit.wikimedia.org/r/906051 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy)
[15:47:53] <brett>	 !log Disable Puppet/PyBal on lvs4008 in preparation for reimaging - T321309
[15:47:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:57] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[15:50:59] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[15:51:09] <herzog>	 Amir1: got a minute, may I PM?
[15:51:24] <Amir1>	 sure
[15:51:25] <Amir1>	 what's up
[15:51:33] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[15:51:37] <icinga-wm_>	 PROBLEM - pybal on lvs4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[15:51:50] <sukhe>	 ^ expected, brett is reimaging lvs4008
[15:52:23] <icinga-wm_>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:52:43] <icinga-wm_>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:54:47] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[15:55:31] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[16:02:44] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kafka-test1010.eqiad.wmnet with OS bullseye
[16:02:50] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host kafka-test1010.eqiad.wmnet with OS bullseye completed: - kafka-test1010 (**PASS**)   - Downtimed on Icinga/A...
[16:04:01] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[16:04:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[16:04:45] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[16:11:19] <wikibugs>	 (03PS1) 10BCornwall: hiera: lvs/interfaces: update lvs4008 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906057 (https://phabricator.wikimedia.org/T321309)
[16:12:46] <wikibugs>	 (03CR) 10Ssingh: hiera: lvs/interfaces: update lvs4008 iface name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906057 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[16:13:43] <wikibugs>	 (03PS2) 10BCornwall: hiera: lvs/interfaces: update lvs4008 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906057 (https://phabricator.wikimedia.org/T321309)
[16:13:45] <wikibugs>	 (03CR) 10BCornwall: hiera: lvs/interfaces: update lvs4008 iface name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906057 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[16:16:01] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hiera: lvs/interfaces: update lvs4008 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906057 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[16:18:08] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=8; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet
[16:18:55] <wikibugs>	 (03CR) 10Tchanders: [C: 03+1] Undeploy SimilarEditors from Beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896936 (https://phabricator.wikimedia.org/T331718) (owner: 10TsepoThoabala)
[16:19:38] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update lvs4008 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906057 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[16:20:45] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS bullseye
[16:20:55] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye
[16:22:01] <wikibugs>	 (03CR) 10JHathaway: exim: fix hard-coded vrts hostname (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth)
[16:24:29] <wikibugs>	 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10elukey) 05Open→03Resolved a:03elukey All nodes migrated to Bullseye!  To keep archives happy - I didn't preserve any data when reimaging the VMs, Kafka's data was not a lot and the brokers were abl...
[16:24:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10elukey)
[16:28:19] <wikibugs>	 (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065
[16:29:22] <wikibugs>	 (03PS1) 10Jbond: spicerack: install python3-aiohttp [puppet] - 10https://gerrit.wikimedia.org/r/906066
[16:30:01] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[16:30:14] <wikibugs>	 (03PS2) 10Jbond: spicerack: install python3-aiohttp [puppet] - 10https://gerrit.wikimedia.org/r/906066
[16:30:43] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[16:31:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond)
[16:31:04] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "-1: re volans comment" [puppet] - 10https://gerrit.wikimedia.org/r/906031 (owner: 10Jbond)
[16:31:09] <wikibugs>	 (03PS2) 10Jbond: sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065
[16:33:15] <wikibugs>	 (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond)
[16:34:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond)
[16:34:14] <wikibugs>	 (03PS3) 10JHathaway: Add an in place Debian upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706)
[16:35:18] <wikibugs>	 (03CR) 10JHathaway: Add an in place Debian upgrade script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway)
[16:36:35] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=6; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet
[16:37:11] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage
[16:40:29] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage
[16:45:10] <wikibugs>	 (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: add asincio (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond)
[16:46:29] <wikibugs>	 (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: add asincio (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond)
[16:47:08] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route depool restbase-async in codfw: Depool from primary DC following network maintenance
[16:47:09] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache restbase-async.discovery.wmnet on all recursors
[16:47:12] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase-async.discovery.wmnet on all recursors
[16:47:43] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs4008.ulsfo.wmnet with OS bullseye
[16:47:50] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye executed with errors: - lvs4008 (**FAIL**)   - Downtimed on Icinga/Alertmanager...
[16:47:55] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS bullseye
[16:48:02] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye
[16:52:11] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in codfw: Depool from primary DC following network maintenance
[16:54:11] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet
[16:54:21] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[16:56:31] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on lists1003.wikimedia.org with reason: Moar CPUs!
[16:56:46] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lists1003.wikimedia.org with reason: Moar CPUs!
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1700)
[17:00:20] <wikibugs>	 (03PS1) 10Ahmon Dancy: Revert "mediawiki::scap: force creation of the symlink when enabled" [puppet] - 10https://gerrit.wikimedia.org/r/905983
[17:00:32] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10jhathaway) I bumped the CPU count to four and as @MoritzMuehlenhoff mentioned we can always bump higher if the need arises.
[17:02:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "mediawiki::scap: force creation of the symlink when enabled" [puppet] - 10https://gerrit.wikimedia.org/r/905983 (owner: 10Ahmon Dancy)
[17:03:06] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage
[17:04:40] <wikibugs>	 (03PS1) 10Jdlrobson: ReadingLists: Show previews on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906069
[17:06:27] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage
[17:08:11] <cjming>	 jouncebot: now
[17:08:12] <jouncebot>	 For the next 0 hour(s) and 51 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1700)
[17:18:45] <wikibugs>	 10SRE-swift-storage: Bring ms-fe101[3-4] into service - https://phabricator.wikimedia.org/T334122 (10Eevans)
[17:19:31] <wikibugs>	 10SRE-swift-storage: Bring ms-fe101[3-4] into service - https://phabricator.wikimedia.org/T334122 (10Eevans) p:05Triage→03Medium
[17:19:31] <wikibugs>	 (03CR) 10Volans: Add an in place Debian upgrade script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway)
[17:22:30] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4008.ulsfo.wmnet with OS bullseye
[17:22:37] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye completed: - lvs4008 (**WARN**)   - Downtimed on Icinga/Alertmanager   - //Unable...
[17:23:47] <icinga-wm_>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:27:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906069 (owner: 10Jdlrobson)
[17:28:35] <wikibugs>	 (03Merged) 10jenkins-bot: ReadingLists: Show previews on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906069 (owner: 10Jdlrobson)
[17:28:35] <icinga-wm_>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:28:40] <cjming>	 !log deploying labs-only change
[17:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:27] <wikibugs>	 (03Abandoned) 10David Caro: buildservice: use /app as workingdir [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/906005 (owner: 10David Caro)
[17:32:23] <brett>	 !log Disable Puppet/PyBal on lvs4009 in preparation for reimaging - T321309
[17:32:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:27] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[17:34:36] <wikibugs>	 (03PS1) 10BCornwall: hiera: lvs/interfaces: update lvs4009 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906076 (https://phabricator.wikimedia.org/T321309)
[17:35:02] <wikibugs>	 (03PS2) 10BCornwall: hiera: lvs/interfaces: update lvs4009 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906076 (https://phabricator.wikimedia.org/T321309)
[17:35:47] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hiera: lvs/interfaces: update lvs4009 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906076 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[17:35:59] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs4009 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[17:36:05] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update lvs4009 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906076 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[17:36:07] <icinga-wm_>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:36:14] <sukhe>	 ^ expected
[17:36:17] <icinga-wm_>	 PROBLEM - pybal on lvs4009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[17:36:27] <icinga-wm_>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:36:40] <wikibugs>	 (03PS1) 10Eevans: swift: add ms-fe101[3-4] as new Swift proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/906078 (https://phabricator.wikimedia.org/T334122)
[17:38:39] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs4009 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[17:41:53] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[17:46:13] <wikibugs>	 (03CR) 10David Caro: "This is going to help a lot testing stuff \o/" [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[17:50:56] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4009.ulsfo.wmnet with OS bullseye
[17:51:06] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs4009.ulsfo.wmnet with OS bullseye
[17:51:22] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[17:54:36] <wikibugs>	 (03PS2) 10Mazevedo: Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481)
[17:54:42] <wikibugs>	 (03CR) 10Dzahn: "I am going through the users of this role at https://openstack-browser.toolforge.org/puppetclass/role::simplelamp2 to check what their cur" [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn)
[17:55:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo)
[17:57:25] <wikibugs>	 (03PS3) 10Mazevedo: Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481)
[18:00:05] <jouncebot>	 hashar and dduvall: (Dis)respected human, time to deploy Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1800). Please do the needful.
[18:00:05] <jouncebot>	 hashar and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1800).
[18:03:16] <wikibugs>	 (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[18:06:33] <wikibugs>	 (03PS2) 10Ahmon Dancy: Revert "mediawiki::scap: force creation of the symlink when enabled" [puppet] - 10https://gerrit.wikimedia.org/r/905983 (https://phabricator.wikimedia.org/T329857)
[18:09:43] <wikibugs>	 (03CR) 10Dzahn: "example compile on cloud VPS host name: https://puppet-compiler.wmflabs.org/output/888800/40549/signwriting-swis-2022.signwriting.eqiad1.w" [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn)
[18:11:33] <wikibugs>	 (03CR) 10Dzahn: "I think now that I have to go through each existing project, check their data dir and whether they have restarted, then those that actuall" [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn)
[18:16:53] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Let's deploy this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 (owner: 10Legoktm)
[18:22:08] <wikibugs>	 (03PS1) 10Majavah: openstack: puppet-enc: add foreign keys for hiera/role tables [puppet] - 10https://gerrit.wikimedia.org/r/906085
[18:22:10] <wikibugs>	 (03PS1) 10Majavah: openstack: puppet-enc: add endpoint for deleting entire projects [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127)
[18:22:12] <wikibugs>	 (03PS1) 10Majavah: openstack: admin_scripts: properly remove old projects from enc [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127)
[18:31:46] <wikibugs>	 (03PS2) 10Majavah: openstack: puppet-enc: add foreign keys for hiera/role tables [puppet] - 10https://gerrit.wikimedia.org/r/906085
[18:31:48] <wikibugs>	 (03PS2) 10Majavah: openstack: puppet-enc: add endpoint for deleting entire projects [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127)
[18:31:50] <wikibugs>	 (03PS2) 10Majavah: openstack: admin_scripts: properly remove old projects from enc [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127)
[18:37:32] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs4009.ulsfo.wmnet with OS bullseye
[18:37:39] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs4009.ulsfo.wmnet with OS bullseye executed with errors: - lvs4009 (**FAIL**)   - Downtimed on Icinga/Alertmanager...
[18:37:56] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4009.ulsfo.wmnet with OS bullseye
[18:38:03] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs4009.ulsfo.wmnet with OS bullseye
[18:38:13] <wikibugs>	 (03CR) 10JHathaway: Add an in place Debian upgrade script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway)
[18:48:49] <wikibugs>	 (03CR) 10Herron: [C: 03+1] sre: mute etcd-mirror pint promql checks [alerts] - 10https://gerrit.wikimedia.org/r/906011 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[18:52:59] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4009.ulsfo.wmnet with reason: host reimage
[18:56:57] <wikibugs>	 (03PS4) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172)
[18:58:25] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4009.ulsfo.wmnet with reason: host reimage
[19:12:16] <icinga-wm_>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:12:57] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4009.ulsfo.wmnet with OS bullseye
[19:13:03] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs4009.ulsfo.wmnet with OS bullseye completed: - lvs4009 (**PASS**)   - Removed from Puppet and PuppetDB if present...
[19:19:01] <logmsgbot>	 !log mforns@deploy2002 Started deploy [analytics/refinery@944a995]: Regular analytics weekly train [analytics/refinery@944a995]
[19:19:50] <icinga-wm_>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:24:46] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[19:25:21] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) ITS should have removed fnavas-foundation entirely. Do you still see it? or is the only issue that FNavas-WMF is not LADP?
[19:25:33] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [analytics/refinery@944a995]: Regular analytics weekly train [analytics/refinery@944a995] (duration: 06m 31s)
[19:25:42] <logmsgbot>	 !log mforns@deploy2002 Started deploy [analytics/refinery@944a995] (thin): Regular analytics weekly train THIN [analytics/refinery@944a995]
[19:25:43] <wikibugs>	 (03PS2) 10Dzahn: simplelamp2: change default mariadb datadir to /var/lib/mysql/ [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571)
[19:25:51] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [analytics/refinery@944a995] (thin): Regular analytics weekly train THIN [analytics/refinery@944a995] (duration: 00m 08s)
[19:25:59] <logmsgbot>	 !log mforns@deploy2002 Started deploy [analytics/refinery@944a995] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@944a995]
[19:26:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:27:28] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [analytics/refinery@944a995] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@944a995] (duration: 01m 29s)
[19:27:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] simplelamp2: change default mariadb datadir to /var/lib/mysql/ [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn)
[19:29:33] <wikibugs>	 (03PS1) 10BCornwall: hiera: lvs/interfaces: update 5004 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906098 (https://phabricator.wikimedia.org/T321309)
[19:30:35] <brett>	 !log Disable Puppet/PyBal on lvs5004 in preparation for reimaging - T321309
[19:30:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:39] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[19:31:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:32:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) @FNavas-foundation What matters here is what login you are using on the wikitech wiki ( https://wikitech.wikimedia.org/wiki/Main_Page). If the user works there t...
[19:32:22] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Add new owners to the wikies-l mailing list - https://phabricator.wikimedia.org/T334135 (10MarcoAurelio)
[19:32:27] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Add new owners to the wikies-l mailing list - https://phabricator.wikimedia.org/T334135 (10MarcoAurelio)
[19:34:20] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs5004 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[19:34:30] <herzog>	 hmm - https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1 <-- mailman queue got again somewhat backlogged again today
[19:35:34] <icinga-wm_>	 PROBLEM - pybal on lvs5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[19:35:44] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[19:35:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) Now, when checking LDAP we can see there are 2 users:  - uid: fnavas (43544) - uid: fnavas-foundation (43670)  Both are using the same -ctr@wikimedia email addre...
[19:39:05] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hiera: lvs/interfaces: update 5004 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906098 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[19:41:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10MarcoAurelio) Not sure if there's anything actionable here left to do. Lo...
[19:41:36] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update 5004 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906098 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[19:42:24] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:42:36] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:48:03] <brett>	 sukhe: Just making sure this isn't possibly my fault
[19:50:16] <sukhe>	 brett: no, not related to us
[19:50:18] <sukhe>	 all good
[19:52:53] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5004.eqsin.wmnet with OS bullseye
[19:53:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs5004.eqsin.wmnet with OS bullseye
[19:54:14] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Add new owners to the wikies-l mailing list - https://phabricator.wikimedia.org/T334135 (10MarcoAurelio)
[19:54:56] <logmsgbot>	 !log mforns@deploy2002 Started deploy [analytics/refinery@eb4c2b2]: Regular analytics weekly train [analytics/refinery@eb4c2b2]
[19:55:44] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:55:58] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:56:59] <urandom>	 If no one objects, I will depool sessionstore in eqiad in the next 30 minutes or so to conduct some experiments (see: T327954)
[19:56:59] <stashbot>	 T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T2000). Please do the needful.
[20:00:05] <jouncebot>	 tsepoThoabala, nray, and Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:01:01] <nray>	 o/
[20:01:07] <Superpes>	 Hello :)
[20:01:22] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [analytics/refinery@eb4c2b2]: Regular analytics weekly train [analytics/refinery@eb4c2b2] (duration: 06m 26s)
[20:01:32] <logmsgbot>	 !log mforns@deploy2002 Started deploy [analytics/refinery@eb4c2b2] (thin): Regular analytics weekly train THIN [analytics/refinery@eb4c2b2]
[20:01:41] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [analytics/refinery@eb4c2b2] (thin): Regular analytics weekly train THIN [analytics/refinery@eb4c2b2] (duration: 00m 08s)
[20:01:59] <logmsgbot>	 !log mforns@deploy2002 Started deploy [analytics/refinery@eb4c2b2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@eb4c2b2]
[20:02:20] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:02:36] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:03:33] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [analytics/refinery@eb4c2b2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@eb4c2b2] (duration: 01m 34s)
[20:06:27] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) @elukey  so the foreign drives have effected both os drives it will need to be reimaged and is not letting me clear it.   I did open the box and did found a...
[20:09:33] <wikibugs>	 (03CR) 10Ottomata: Updates to kafka-dev chart for running in minikube (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905716 (owner: 10Ottomata)
[20:12:07] <wikibugs>	 (03PS2) 10AOkoth: exim: fix hard-coded vrts hostname [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515)
[20:13:54] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:14:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) @elukey  i was able to clear foreign status but will still need to be reimaged.
[20:14:12] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:15:19] <tsepoThoabala>	 Is anyone around  to help deploy?
[20:16:41] <Superpes>	 Uhm... no one replied to the ping :(
[20:17:18] <logmsgbot>	 !log mforns@deploy2002 Started deploy [airflow-dags/analytics@2192f15]: (no justification provided)
[20:17:31] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@2192f15]: (no justification provided) (duration: 00m 12s)
[20:18:26] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Updates to kafka-dev chart for running in minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/905716 (owner: 10Ottomata)
[20:18:53] <cjming>	 hi - sorry to be late - i can help deploy if there's still a need
[20:19:07] <nray>	 \o/
[20:19:35] <cjming>	 ok - i'll start with the top of the queue
[20:19:37] <tsepoThoabala>	 yes please
[20:19:49] <wikibugs>	 (03PS2) 10Clare Ming: Undeploy SimilarEditors from Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896936 (https://phabricator.wikimedia.org/T331718) (owner: 10TsepoThoabala)
[20:19:52] <nray>	 thank you cjming :) 
[20:20:16] <cjming>	 np!
[20:21:08] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5004.eqsin.wmnet with reason: host reimage
[20:21:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896936 (https://phabricator.wikimedia.org/T331718) (owner: 10TsepoThoabala)
[20:21:58] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy SimilarEditors from Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896936 (https://phabricator.wikimedia.org/T331718) (owner: 10TsepoThoabala)
[20:22:22] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:896936|Undeploy SimilarEditors from Beta (T331718)]]
[20:22:28] <stashbot>	 T331718: Undeploy SimilarEditors from Beta - https://phabricator.wikimedia.org/T331718
[20:22:41] <cjming>	 tsepoThoabala: are your changes testable? on any debug server if so
[20:23:00] <tsepoThoabala>	 no they are not.
[20:23:12] <cjming>	 so i'll just sync then
[20:23:24] <tsepoThoabala>	 cool thanks.
[20:24:09] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5004.eqsin.wmnet with reason: host reimage
[20:24:36] <wikibugs>	 (03Merged) 10jenkins-bot: Updates to kafka-dev chart for running in minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/905716 (owner: 10Ottomata)
[20:26:16] <cjming>	 scap is hanging a bit
[20:29:07] <wikibugs>	 (03PS1) 10Cathal Mooney: Change check_eth script to work without filter on netdev names [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007)
[20:33:20] <wikibugs>	 (03PS1) 10JHathaway: aux: Update jaeger templates to match upstream [deployment-charts] - 10https://gerrit.wikimedia.org/r/906104 (https://phabricator.wikimedia.org/T320554)
[20:35:22] <wikibugs>	 (03PS2) 10Cathal Mooney: Change check_eth script to work without filter on netdev names [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007)
[20:35:24] <icinga-wm_>	 RECOVERY - pybal on lvs5004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[20:38:42] <wikibugs>	 (03PS3) 10Cathal Mooney: Change check_eth script to work without filter on netdev names [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007)
[20:43:54] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:44:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis)
[20:44:02] <logmsgbot>	 !log cjming@deploy2002 tsepothoabala and cjming: Backport for [[gerrit:896936|Undeploy SimilarEditors from Beta (T331718)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[20:44:06] <stashbot>	 T331718: Undeploy SimilarEditors from Beta - https://phabricator.wikimedia.org/T331718
[20:44:36] <icinga-wm_>	 RECOVERY - PyBal connections to etcd on lvs5004 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[20:44:53] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5004.eqsin.wmnet with OS bullseye
[20:44:59] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs5004.eqsin.wmnet with OS bullseye completed: - lvs5004 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled...
[20:45:24] <cjming>	 tsepoThoabala: syncing now -- it's been a while since i last deployed -- i don't recall scap taking so long but i guess that's the new normal these days
[20:45:28] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:45:52] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:45:55] <tsepoThoabala>	 yes this seemed to have went a bit long , thanks
[20:46:29] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[20:49:21] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[20:51:59] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux: Update jaeger templates to match upstream [deployment-charts] - 10https://gerrit.wikimedia.org/r/906104 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway)
[20:57:43] <brett>	 !log Disable Puppet/PyBal on lvs5005 in preparation for reimaging - T321309
[20:57:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:47] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[20:58:03] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:896936|Undeploy SimilarEditors from Beta (T331718)]] (duration: 35m 41s)
[20:58:07] <stashbot>	 T331718: Undeploy SimilarEditors from Beta - https://phabricator.wikimedia.org/T331718
[20:58:12] <cjming>	 tsepoThoabala: your changes should be live!
[20:58:37] <tsepoThoabala>	 cjming thank you.
[20:58:56] <cjming>	 hi nray! shall we move onto your patch?
[20:59:03] <nray>	 cjming: sounds good!
[20:59:05] <wikibugs>	 (03PS3) 10Dzahn: simplelamp2: change default mariadb datadir to /var/lib/mysql/ [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571)
[20:59:25] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] add haproxy ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902611 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[20:59:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905769 (https://phabricator.wikimedia.org/T331681) (owner: 10Nray)
[21:00:19] <wikibugs>	 (03Merged) 10jenkins-bot: Add static mobile United_States page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905769 (https://phabricator.wikimedia.org/T331681) (owner: 10Nray)
[21:00:40] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:905769|Add static mobile United_States page to facilitate synthetic testing of T331681 (T331681)]]
[21:00:44] <stashbot>	 T331681: Measure performance of cookie-based anonymous client preferences - https://phabricator.wikimedia.org/T331681
[21:01:31] <wikibugs>	 (03PS1) 10BCornwall: hiera: lvs/interfaces: update 5005 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906126 (https://phabricator.wikimedia.org/T321309)
[21:01:52] <cjming>	 !log UTC late backport & config window continuing
[21:01:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:06] <logmsgbot>	 !log cjming@deploy2002 cjming and nray: Backport for [[gerrit:905769|Add static mobile United_States page to facilitate synthetic testing of T331681 (T331681)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[21:02:25] <cjming>	 nray: can you test on a debug server?
[21:02:31] <nray>	 cjming: yes
[21:02:39] <nray>	 cjming: testing now, thank you!
[21:02:45] <cjming>	 that went way faster
[21:04:00] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[21:04:56] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[21:05:02] <icinga-wm_>	 PROBLEM - pybal on lvs5005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[21:05:16] <nray>	 cjming: tested and things look good
[21:05:29] <cjming>	 nray: great - syncing
[21:05:34] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hiera: lvs/interfaces: update 5005 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906126 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[21:05:42] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update 5005 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906126 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[21:07:12] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[21:08:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:09:29] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001"
[21:09:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/888800/40550/signwriting-swis-2022.signwriting.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn)
[21:10:34] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001"
[21:10:34] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:10:46] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:905769|Add static mobile United_States page to facilitate synthetic testing of T331681 (T331681)]] (duration: 10m 06s)
[21:10:50] <stashbot>	 T331681: Measure performance of cookie-based anonymous client preferences - https://phabricator.wikimedia.org/T331681
[21:10:58] <cjming>	 nray: your changes are live! nice to see you :)
[21:11:33] <nray>	 cjming: \o/ thank you! Great to see you too!
[21:11:47] <cjming>	 Superpes: if you're still around, happy to do your patch
[21:12:23] <Superpes>	 Hi cjming :D Yep I'm here! Many thanks :)
[21:12:24] <wikibugs>	 (03PS1) 10Andrew Bogott: Added dummy ldap_os_system_pass [labs/private] - 10https://gerrit.wikimedia.org/r/906128 (https://phabricator.wikimedia.org/T330759)
[21:12:35] <wikibugs>	 (03PS2) 10Clare Ming: [mgwiki] Replace the wordmark on Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905960 (https://phabricator.wikimedia.org/T334022) (owner: 10Superpes15)
[21:12:46] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added dummy ldap_os_system_pass [labs/private] - 10https://gerrit.wikimedia.org/r/906128 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[21:13:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:14:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905960 (https://phabricator.wikimedia.org/T334022) (owner: 10Superpes15)
[21:14:45] <wikibugs>	 (03Merged) 10jenkins-bot: [mgwiki] Replace the wordmark on Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905960 (https://phabricator.wikimedia.org/T334022) (owner: 10Superpes15)
[21:15:09] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:905960|[mgwiki] Replace the wordmark on Vector 2022 (T334022)]]
[21:15:13] <stashbot>	 T334022: Word mark for Malagasy Wikipedia mobile site is in Guarani - https://phabricator.wikimedia.org/T334022
[21:16:22] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5005.eqsin.wmnet with OS bullseye
[21:16:28] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs5005.eqsin.wmnet with OS bullseye
[21:16:34] <logmsgbot>	 !log cjming@deploy2002 superpes and cjming: Backport for [[gerrit:905960|[mgwiki] Replace the wordmark on Vector 2022 (T334022)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[21:16:53] <cjming>	 Superpes: can you test on a debug server?
[21:17:09] <Superpes>	 Sure! Looking :)
[21:17:51] <Superpes>	 It's fine cjming :)
[21:17:59] <cjming>	 cool - syncing
[21:19:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "I double-checked on every instance that uses this - noop everywhere - after taavi added Hiera keys for me in those projects. Now this is f" [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn)
[21:21:08] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[21:21:28] <cjming>	 Superpes: once it's sync'd I believe I need to purge that file - not sure where to run purgeList from these days
[21:23:07] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:905960|[mgwiki] Replace the wordmark on Vector 2022 (T334022)]] (duration: 07m 58s)
[21:23:11] <stashbot>	 T334022: Word mark for Malagasy Wikipedia mobile site is in Guarani - https://phabricator.wikimedia.org/T334022
[21:23:57] <mutante>	 cjming: that would be currently mwmaint2002.codfw.wmnet (whatever is mwmaint.discovery.wmnet DNS entry points to)
[21:24:18] <cjming>	 mutante: thanks!
[21:24:21] <mutante>	 yw
[21:25:56] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Add new owners to the wikies-l mailing list - https://phabricator.wikimedia.org/T334135 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup {{done}}
[21:26:20] <cjming>	 Superpes: your change should be live - i also purged the file so hopefully you see the new wordmark
[21:26:50] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Hi @volans - thanks for the details on S/N #7S5LMH3, 7S5MMH3, 7S5NMH3, 7S5PMH3, and 5BF90C3.  The first four were deleted in error, which @RobH just fixed...and...
[21:27:56] <Superpes>	 Oh wonderful! I confirm that I see it live :) Many thanks for your time cjming :)
[21:28:09] <cjming>	 ur welcome!
[21:28:33] <cjming>	 !log end of UTC late backport window
[21:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:39] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001"
[21:31:42] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001"
[21:31:42] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:41:47] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5005.eqsin.wmnet with reason: host reimage
[21:41:53] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[21:45:22] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5005.eqsin.wmnet with reason: host reimage
[21:51:56] <wikibugs>	 (03PS1) 10Eevans: sessionstore: make native transport (intentionally) unreachable [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954)
[21:52:32] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:52:35] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[21:52:56] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:55:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10KFrancis) Hi @MarcoAurelio, please send your email address to kfrancis@wikimedia.org and I'll process this request.  Thanks!
[22:03:13] <wikibugs>	 (03PS2) 10Eevans: sessionstore: make native transport (intentionally) unreachable [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954)
[22:03:33] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[22:04:18] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[22:05:45] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5005.eqsin.wmnet with OS bullseye
[22:05:52] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs5005.eqsin.wmnet with OS bullseye completed: - lvs5005 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled...
[22:08:26] <wikibugs>	 (03PS3) 10Eevans: sessionstore: make native transport (intentionally) unreachable [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954)
[22:12:17] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[22:12:42] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "Looks right to me - moves listener from the normal port 9042 to 9043, making it unavailable to clients!" [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[22:14:16] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[22:17:16] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] sessionstore: make native transport (intentionally) unreachable [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[22:20:59] <urandom>	 !log restarting Cassandra on sessionstore1001 to apply (intentionally) unreachable native transport — T327954
[22:21:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:04] <stashbot>	 T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954
[22:23:01] <urandom>	 bblack: oh, this is going to create a service alert (port 9042)
[22:23:40] <bblack>	 yeah, probably :)
[22:23:44] <bblack>	 can downtime it!
[22:24:22] <urandom>	 ha, got it before it paged anyway
[22:24:39] * urandom spikes the ball
[22:25:02] * brett writhes in pain on the ground
[22:33:30] <urandom>	 !log rebooting Cassandra on sessionstore1001 — T327954
[22:33:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:35] <stashbot>	 T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954
[22:36:32] <topranks>	 !log enabling lsw1-e1-eqiad port et-0/0/51 to ssw1-e1-eqiad et-0/0/80 T322937
[22:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:36:36] <stashbot>	 T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937
[22:42:46] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:44:24] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:53:54] <topranks>	 ^^^ these alerts are related to issue our transport provider is having on path from eqiad to codfw.
[22:54:03] <topranks>	 emails to noc are about the same thing 
[22:54:05] <topranks>	 currently link is up and stable for ~11min
[22:59:36] <wikibugs>	 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dwisehaupt)
[23:02:40] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] sre: mute etcd-mirror pint promql checks [alerts] - 10https://gerrit.wikimedia.org/r/906011 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[23:03:17] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[23:33:13] <wikibugs>	 (03PS5) 10Legoktm: Add <link rel="me"> to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837
[23:35:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by legoktm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 (owner: 10Legoktm)
[23:36:30] <wikibugs>	 (03Merged) 10jenkins-bot: Add <link rel="me"> to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 (owner: 10Legoktm)
[23:36:56] <logmsgbot>	 !log legoktm@deploy2002 Started scap: Backport for [[gerrit:896837|Add <link rel="me"> to verify Mastodon account on mediawiki.org]]
[23:38:22] <logmsgbot>	 !log legoktm@deploy2002 legoktm: Backport for [[gerrit:896837|Add <link rel="me"> to verify Mastodon account on mediawiki.org]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[23:39:05] <legoktm>	 > <link rel="me" href="https://wikis.world/@mediawiki"/>
[23:44:43] <logmsgbot>	 !log legoktm@deploy2002 Finished scap: Backport for [[gerrit:896837|Add <link rel="me"> to verify Mastodon account on mediawiki.org]] (duration: 07m 47s)
[23:46:04] <legoktm>	 got the verified tick :D
[23:46:49] <wikibugs>	 (03PS2) 10Legoktm: Remove misleading "disable" of Special:Mostlinkedcategories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804805 (https://phabricator.wikimedia.org/T310456)
[23:49:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by legoktm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804805 (https://phabricator.wikimedia.org/T310456) (owner: 10Legoktm)
[23:50:22] <wikibugs>	 (03Merged) 10jenkins-bot: Remove misleading "disable" of Special:Mostlinkedcategories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804805 (https://phabricator.wikimedia.org/T310456) (owner: 10Legoktm)
[23:50:46] <logmsgbot>	 !log legoktm@deploy2002 Started scap: Backport for [[gerrit:804805|Remove misleading "disable" of Special:Mostlinkedcategories (T310456)]]
[23:50:50] <stashbot>	 T310456: Re-enable daily updates of formerly slow enwiki QueryPages - https://phabricator.wikimedia.org/T310456
[23:52:08] <logmsgbot>	 !log legoktm@deploy2002 legoktm: Backport for [[gerrit:804805|Remove misleading "disable" of Special:Mostlinkedcategories (T310456)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[23:53:21] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10wiki_willy) Thanks for trying out the reimage solution @MatthewVernon.  It helps us progress things along further with the Dell support request.  The latest note from Dell is that...
[23:53:26] <icinga-wm_>	 PROBLEM - zuul_merger_service_running on contint2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul
[23:54:18] <icinga-wm_>	 PROBLEM - Check systemd state on contint2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_zuul-merger.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:55:50] <urandom>	 !log rebooting Cassandra on sessionstore1001 — T327954
[23:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:54] <stashbot>	 T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954
[23:58:41] <logmsgbot>	 !log legoktm@deploy2002 Finished scap: Backport for [[gerrit:804805|Remove misleading "disable" of Special:Mostlinkedcategories (T310456)]] (duration: 07m 55s)
[23:58:45] <stashbot>	 T310456: Re-enable daily updates of formerly slow enwiki QueryPages - https://phabricator.wikimedia.org/T310456