[00:00:50] unlocked: *quadruple revert achievement* [00:01:32] nothing to hand-over, no alerts. going afk [00:06:23] (03PS4) 10Ladsgroup: Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905609 (https://phabricator.wikimedia.org/T326800) [00:07:15] (03CR) 10Ladsgroup: [V: 03+1] "Tested as many ways as possible in mwdebug, I'm about to go to sleep, otherwise I would have deployed it now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905609 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [00:23:27] (03CR) 10Tim Starling: "I have no idea why you did this." [puppet] - 10https://gerrit.wikimedia.org/r/902064 (https://phabricator.wikimedia.org/T328872) (owner: 10MVernon) [00:30:31] I mean, if you're going to revert my changes, it seems like it would be courteous to at least add me as a reviewer or add a comment to the change you're reverting [00:35:20] (03PS1) 10Legoktm: Remove possibly significant whitespace from robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905764 [00:39:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/905550 [00:39:28] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/905550 (owner: 10TrainBranchBot) [00:40:53] (03PS2) 10Legoktm: Remove possibly significant whitespace from robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905764 (https://phabricator.wikimedia.org/T334038) [00:51:58] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) There was no isolation or resolution of root causes, so we can expect the issue to recur peri... [00:58:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/905550 (owner: 10TrainBranchBot) [01:03:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10ssingh) Apologies for the wrong commits attached to this; those were for T333456. @Milimetric @JAllemandou, sorry for the ping but per the above comment, this nee... [01:08:12] (03CR) 10BBlack: [C: 03+1] "LGTM, we'll see if it does anything useful :)" [puppet] - 10https://gerrit.wikimedia.org/r/905746 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [01:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:38] PROBLEM - PHP opcache health on mw2430 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:27:36] (03PS1) 10Ssingh: admin: add fnavas-foundation to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482) [01:34:03] (03PS2) 10Ssingh: admin: add fnavas-foundation to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482) [01:41:01] (03PS1) 10Nray: Add static mobile United_States page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905769 (https://phabricator.wikimedia.org/T331681) [01:41:39] (03PS2) 10Nray: Add static mobile United_States page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905769 (https://phabricator.wikimedia.org/T331681) [01:50:00] (03PS3) 10Nray: Add static mobile United_States page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905769 (https://phabricator.wikimedia.org/T331681) [01:51:42] (03PS1) 10Andrew Bogott: OpenStack Designate: role back codfw1dev change to default policies [puppet] - 10https://gerrit.wikimedia.org/r/905770 (https://phabricator.wikimedia.org/T330759) [01:55:29] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Designate: role back codfw1dev change to default policies [puppet] - 10https://gerrit.wikimedia.org/r/905770 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [01:56:18] (03PS4) 10Nray: Add static mobile United_States page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905769 (https://phabricator.wikimedia.org/T331681) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:44] (03CR) 10RLazarus: [C: 03+1] admin: add fnavas-foundation to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482) (owner: 10Ssingh) [02:57:23] (03PS6) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [03:05:39] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) I would love to see the HTTP error response body. FileOperation logs show 502 errors, but the... [04:12:16] (03CR) 10Krinkle: [C: 03+1] "LGTM. Deploy any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905764 (https://phabricator.wikimedia.org/T334038) (owner: 10Legoktm) [04:17:22] !log restarted swift-proxy on ms-fe* T328872 [04:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:27] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [05:08:17] (03PS3) 10KartikMistry: Remove akwiki from CX config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904952 [05:22:12] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:53:42] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [05:54:30] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T0600) [06:04:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482) (owner: 10Ssingh) [06:06:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10MoritzMuehlenhoff) >>! In T331899#8731595, @taavi wrote: >> To be able to access deplyed Wiki instances and ensure that wikibase (namely wikibas... [06:17:50] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:22:40] RECOVERY - PHP opcache health on mw2430 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:26:14] 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Cookbook to depool a site in AuthDNS - https://phabricator.wikimedia.org/T334048 (10ayounsi) [06:39:07] (03PS1) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/905893 (https://phabricator.wikimedia.org/T333961) [06:39:38] (03CR) 10Marostegui: [C: 03+2] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/905893 (https://phabricator.wikimedia.org/T333961) (owner: 10Marostegui) [06:41:16] RECOVERY - MariaDB Replica SQL: es4 on es1022 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:41:17] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) Restarting the proxy servers temporarily fixed it again. The restart caused a doubling of the... [06:41:22] RECOVERY - MariaDB read only es4 on es1022 is OK: Version 10.6.12-MariaDB-log, Uptime 42s, read_only: True, event_scheduler: True, 58.07 QPS, connection latency: 0.004521s, query latency: 0.000601s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [06:41:26] RECOVERY - MariaDB Replica IO: es4 on es1022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:41:40] RECOVERY - mysqld processes on es1022 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [06:50:22] RECOVERY - MariaDB Replica Lag: es4 on es1022 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:56:39] (03CR) 10KartikMistry: [C: 03+2] Remove akwiki from CX config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904952 (owner: 10KartikMistry) [06:57:28] (03Merged) 10jenkins-bot: Remove akwiki from CX config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904952 (owner: 10KartikMistry) [06:58:34] I accidently merged change instead of scap backport a few minutes back :/ [06:59:57] kart_: scap backport can operate with manually merged changes just fine [07:00:05] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:35] 10SRE, 10Data-Engineering, 10Data-Persistence, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) [07:00:55] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) [07:01:35] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/905895 (https://phabricator.wikimedia.org/T333377) [07:01:57] taavi: good to know :) [07:01:57] It's time! [07:02:18] (03CR) 10Slyngshede: [C: 03+1] Signup: Add captcha to signups. [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809) (owner: 10Slyngshede) [07:02:24] (03CR) 10Slyngshede: [V: 03+2 C: 03+1] Signup: Add captcha to signups. [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809) (owner: 10Slyngshede) [07:02:26] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Signup: Add captcha to signups. [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809) (owner: 10Slyngshede) [07:03:19] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/905895 (https://phabricator.wikimedia.org/T333377) (owner: 10Marostegui) [07:03:45] !log Failover m3-master T333377 [07:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:50] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [07:03:58] (03CR) 10Slyngshede: "Looks good, that will allow me to re-enable the log shipping to logstash." [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond) [07:04:02] !log kartik@deploy2002 Started scap: Backport for [[gerrit:904952|Remove akwiki from CX config]] [07:04:09] (03CR) 10Slyngshede: [C: 03+1] url_downloader: switch squid logs to hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond) [07:04:19] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 33 [07:04:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 33 [07:05:21] !log kartik@deploy2002 kartik: Backport for [[gerrit:904952|Remove akwiki from CX config]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:08:22] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) [07:08:34] (03PS2) 10Slyngshede: P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676) [07:08:46] (03CR) 10CI reject: [V: 04-1] P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [07:10:13] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) @jcrespo could you double check the backup-related hosts? Thanks! [07:10:45] (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/905927 (https://phabricator.wikimedia.org/T333377) [07:11:19] !log Failover m5-master T333377 [07:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:23] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [07:11:25] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:904952|Remove akwiki from CX config]] (duration: 07m 22s) [07:11:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:11:43] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/905927 (https://phabricator.wikimedia.org/T333377) (owner: 10Marostegui) [07:12:36] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) [07:13:14] (03PS3) 10Slyngshede: P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676) [07:13:22] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) m3-master and m5-master have been failed over. [07:15:33] I saw errors in scap backport. [07:15:57] `07:08:06 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'mw2300.codfw.wmnet', 'mw1420.eqiad.wmnet', 'mw2289.codfw.wmnet', 'mw1486.eqiad.wmnet', 'deploy1002.eqiad.wmnet', 'mw1398.eqiad.wmnet', 'mw2259.codfw.wmnet', 'mw1366.eqiad.wmnet', 'deploy2002.codfw.wmnet', 'mw1404.eqiad.wmnet'] (ran as mwdeploy@deploy1002.eqiad.wmnet) returned [1]: Aborting: Scap is disabled on this [07:15:58] host. If you really need to run Scap here, you can override by passing "-Dblock_execution:False" to the call` [07:16:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:16:57] (03PS3) 10Elukey: Upgrade kafka-main to use PKI TLS certificates for brokers [puppet] - 10https://gerrit.wikimedia.org/r/905251 (https://phabricator.wikimedia.org/T319372) [07:17:35] scap error I got: https://pastebin.com/YWxNsMRJ @Amir1 @urbanecm @taavi [07:18:27] (03PS1) 10Marostegui: db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/905928 (https://phabricator.wikimedia.org/T331381) [07:18:54] kart_: o/ did you deploy from deplo1002? [07:18:56] kart_: afaik we're still on codfw? Are you on the correct host? [07:19:01] if so please use 2002 [07:19:48] ^ [07:19:49] kart_: "Aborting: Scap is disabled on this host." [07:20:27] !log Stop mariadb on db1101 T331381 [07:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:32] T331381: decommission db1101.eqiad.wmnet - https://phabricator.wikimedia.org/T331381 [07:22:07] (03CR) 10Marostegui: [C: 03+2] db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/905928 (https://phabricator.wikimedia.org/T331381) (owner: 10Marostegui) [07:24:10] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host kafka-test1006.eqiad.wmnet with OS bullseye [07:24:15] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host kafka-test1006.eqiad.wmnet with OS bullseye [07:26:02] elukey: no. Used 2002. [07:27:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond) [07:27:28] elukey: `kartik@deploy2002:~$ scap backport 904952` [07:29:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:29:39] (03PS1) 10Marostegui: instances.yaml: Remove db1104 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/905930 (https://phabricator.wikimedia.org/T329481) [07:29:44] I use ssh to deployment.codfw.wmnet - which automatically points to current dc. Is that changed? :/ [07:30:06] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1104 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/905930 (https://phabricator.wikimedia.org/T329481) (owner: 10Marostegui) [07:30:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2067.codfw.wmnet with OS bullseye [07:30:24] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye [07:30:43] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10akosiaris) Yes, we 'll have to depool codfw. [07:31:01] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) >>! In T334049#8757732, @Marostegui wrote: > @ayounsi to confirm, codfw will be depooled before this maintenance right? @akosiaris @Joe ? That's my under... [07:31:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1104 from dbctl T329481', diff saved to https://phabricator.wikimedia.org/P46035 and previous config saved to /var/cache/conftool/dbconfig/20230405-073102-marostegui.json [07:31:07] T329481: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481 [07:31:29] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [07:31:54] kart_: deployment.codfw.wmnet should work fine [07:32:07] I'm confused as to `07:08:34 sudo -u mwdeploy -n -- /usr/bin/scap cdb-rebuild (ran as mwdeploy@deploy1002.eqiad.wmnet) returned [1]: Aborting: Scap is disabled on this host. If you really need to run Scap here, you can override by passing "-Dblock_execution:False" to the call ` [07:32:25] maybe scap deploys _to_ deploy1002 and fails to, because scap's disabled there? [07:33:22] sounds plausible, as scap pull complains as well [07:34:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:34:23] good morning [07:35:03] good morning hashar. any idea what to do with the above mentioned problem? :-) [07:35:31] this job is never ending, I haven't drink my coffee yet :D [07:35:59] the primary deployment server is `deploy2002.codfw.wmnet` for sure [07:36:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [07:36:28] hashar: apologies, and feel free to drink it before you help :)) [07:36:32] scap has a `sync-master` step which rsync to the other(s) deployment server which includes deploy1002.eqiad.wmnet [07:36:50] and should it continue to do so even when deploy1002's not primary? [07:36:52] kart_: didn't mean to upset you, from the logs it seemed as if you were deploying from 1002, apologies [07:36:57] that is well to keep the spare deployment server up-to-date in case we need to switch over or the primarly magically disappears [07:37:27] one should not be able to deploy from the spare deploy1002.eqiad.wmnet [07:37:41] I can't remember how that is prevented, but a global lock sounds likely [07:37:42] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:37:57] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [07:38:03] upon connecting to the spare deploy1002.eqiad.wmnet, a message of the day should show up in the prompt stating "DO NOT USE THIS SERVER" [07:38:16] elukey: ah, no issue :) [07:38:45] so if you then deploy from the primary deploy2002.codfw.wmnet , I would expect it to be able to sync to the spare deploy1002 [07:39:07] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/0a5edcbbb78b72e749c411192ea4c8e6912dde4c ("scap: block Scap execution on inactive deployment hosts") was committed last night [07:39:17] and whatever got done in /srv/deployment or /srv/mediawiki-staging on deploy1002 will be erased/restored to the state of the primary deploy2002 [07:40:01] taavi: sounds like the cause to me. not sure if we should revert that patch or remove the lock temporarily. [07:41:31] a revert seems probably the best thing, it is blocking deployments [07:42:01] (03PS1) 10Elukey: Revert "scap: block Scap execution on inactive deployment hosts" [puppet] - 10https://gerrit.wikimedia.org/r/905741 [07:42:45] (03CR) 10Hashar: [C: 03+1] "jnuche: that broke scap cause we do a few scap operations on the spare deployment server for MediaWiki deployment notably `scap pull` or `" [puppet] - 10https://gerrit.wikimedia.org/r/905741 (owner: 10Elukey) [07:42:49] elukey: +1 ed :) [07:42:53] well [07:42:57] should tag T330756 [07:42:57] T330756: Improve behavior around global Scap lock + communicate changes - https://phabricator.wikimedia.org/T330756 [07:43:17] (03PS2) 10Hashar: Revert "scap: block Scap execution on inactive deployment hosts" [puppet] - 10https://gerrit.wikimedia.org/r/905741 (https://phabricator.wikimedia.org/T330756) (owner: 10Elukey) [07:43:19] amended [07:43:26] (03CR) 10Urbanecm: [C: 03+1] Revert "scap: block Scap execution on inactive deployment hosts" [puppet] - 10https://gerrit.wikimedia.org/r/905741 (https://phabricator.wikimedia.org/T330756) (owner: 10Elukey) [07:43:32] (03CR) 10Hashar: "Amended to attach this change to T330756" [puppet] - 10https://gerrit.wikimedia.org/r/905741 (https://phabricator.wikimedia.org/T330756) (owner: 10Elukey) [07:43:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops, 10Patch-For-Review: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10ayounsi) 05Open→03Resolved a:03ayounsi This has been rolled to all k8s clusters. [07:44:23] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40543/console" [puppet] - 10https://gerrit.wikimedia.org/r/905741 (https://phabricator.wikimedia.org/T330756) (owner: 10Elukey) [07:45:22] from the pcc it seems super safe, https://puppet-compiler.wmflabs.org/output/905741/40543/ [07:45:33] (03PS1) 10Marostegui: mariadb: Decommission db1104 [puppet] - 10https://gerrit.wikimedia.org/r/905932 (https://phabricator.wikimedia.org/T329481) [07:45:36] merging, thanks for the reviews [07:45:42] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10jcrespo) [07:45:55] (03CR) 10Elukey: [C: 03+2] Revert "scap: block Scap execution on inactive deployment hosts" [puppet] - 10https://gerrit.wikimedia.org/r/905741 (https://phabricator.wikimedia.org/T330756) (owner: 10Elukey) [07:46:02] in /etc/scap/scap.cfg that should remove the block_execution setting yeah [07:46:14] I don't know what the default is [07:46:33] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10jcrespo) >>! In T333377#8757686, @Marostegui wrote: > @jcrespo could you double check the backup-related hosts? Thanks! Documented- minor to no disruption. [07:46:35] taavi: I still don't get how you manage to find the root cause commits so fast :] [07:46:44] scap/config.py: "block_execution": (bool, False), [07:46:44] tests/scap/test_cli.py: cmd.config = {"block_execution": False} [07:46:45] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage [07:47:00] taavi++ [07:47:09] ^^ [07:47:12] so I guess no blocking by default :-] jnuche will be able to follow up [07:47:28] ok running puppet on deploy1002 and 2002, kart_ gimme 2 mins and then you can retry [07:47:35] hashar: I just tend to lurk in this channel and have a good memory :-P so I saw the commit yesterday evening and the error message today and connected the dots [07:47:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1104.eqiad.wmnet [07:47:40] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10jcrespo) [07:49:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage [07:49:49] (03CR) 10Jelto: [C: 03+2] gitlab: Fix listen_https typo [puppet] - 10https://gerrit.wikimedia.org/r/905653 (owner: 10BCornwall) [07:50:01] (03CR) 10Muehlenhoff: [C: 03+1] "That's a great cleanup! Happy to merge this myself, but would like to sort out a time when someone from the WMCS SREs is around just in ca" [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah) [07:50:16] * urbanecm recorded the issue + error message at T330756 [07:50:51] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1104 [puppet] - 10https://gerrit.wikimedia.org/r/905932 (https://phabricator.wikimedia.org/T329481) (owner: 10Marostegui) [07:51:21] kart_: green light [07:52:26] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [07:54:03] (03PS1) 10Majavah: P:wmcs::kubeadm: checker is a toolforge-specific feature [puppet] - 10https://gerrit.wikimedia.org/r/905933 [07:54:05] (03PS1) 10Majavah: P:toolforge::k8s::etcd: load list of control nodes from PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) [07:54:35] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1104.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:54:42] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kafka-test1006.eqiad.wmnet with OS bullseye [07:54:47] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host kafka-test1006.eqiad.wmnet with OS bullseye executed with errors: - kafka-test1006 (**FAIL**) - Downtimed... [07:56:06] RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 117.45 ms [07:56:26] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host kafka-test1006.eqiad.wmnet with OS bullseye [07:56:30] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:31] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host kafka-test1006.eqiad.wmnet with OS bullseye [07:57:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40544/console" [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) (owner: 10Majavah) [07:57:46] (03CR) 10Elukey: [C: 03+2] Upgrade kafka-main to use PKI TLS certificates for brokers [puppet] - 10https://gerrit.wikimedia.org/r/905251 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [07:59:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1104.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:59:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:59:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1104.eqiad.wmnet [07:59:09] (03CR) 10MVernon: [V: 03+2 C: 03+2] Provision the revised Swift dashboard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902064 (https://phabricator.wikimedia.org/T328872) (owner: 10MVernon) [07:59:45] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481 (10Marostegui) This is ready for DC-Ops [08:00:03] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) 05Open→03Stalled Marking it as stalled until the cookbook is reviewed/merged. [08:00:04] hashar and dduvall: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T0800). [08:00:13] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481 (10Marostegui) a:05Marostegui→03None [08:00:16] 10ops-eqiad, 10decommission-hardware: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481 (10Marostegui) [08:01:08] 10SRE, 10Infrastructure-Foundations, 10netops: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10ayounsi) a:03cmooney [08:02:01] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) [08:02:49] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-main1002.eqiad.wmnet with reason: restart kafka, switch to PKI [08:02:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-main1002.eqiad.wmnet with reason: restart kafka, switch to PKI [08:02:57] elukey: Oh, I woas bit away. Do I need to run backport again? [08:03:33] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) @jcrespo kindly check backup servers needs. Thanks [08:04:16] kart_: ah ok maybe not, but others will probably have better/more info [08:04:27] hashar: ^ [08:04:33] (I am fairly ignorant about sca) [08:05:06] Change seems deployed in akwiki, so it should be good IMHO. [08:06:30] ack super [08:06:33] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1006.eqiad.wmnet with reason: host reimage [08:07:11] !log restart kafka on kafka-main1002 to pick up the new TLS certificate (PKI based) - T319372 [08:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:14] T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 [08:09:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1006.eqiad.wmnet with reason: host reimage [08:11:13] (03PS26) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [08:11:29] kart_: I have no idea :-] [08:11:50] I guess we can check both deployment servers and a random mw app server to verify [08:13:39] deploy1002 still has akwiki => true [08:13:45] grep -A6 wgContentTranslationAsBetaFeature /srv/mediawiki-staging/wmf-config/InitialiseSettings.php [08:13:49] RECOVERY - Disk space on ms-be2067 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops [08:14:38] same on mw1473 (randomly picked up host) [08:14:42] :/ [08:14:57] and that is the same for deploy1002 ( in /srv/mediawiki ) [08:15:02] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10dcausse) >>! In T330693#8756120, @Ottomata wrote: > Generally implementers wo... [08:15:02] so I guess we need to redeploy it [08:15:09] `scap sync-file` should do it. [08:15:09] (03CR) 10David Caro: "The PCC looks weird no? there's less nodes now:" [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) (owner: 10Majavah) [08:15:47] even mwdebug1001 still has akwiki [08:16:02] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Marostegui) Thank you Papaul, they look good! [08:16:24] kart_: want me to do the sync ? [08:17:37] hashar: go ahead. I guess, patch itself has not desired effect, but I'll followup on that. [08:17:52] We need to disable CX on closed Wikis. [08:18:13] +1 [08:18:13] (03CR) 10Majavah: [V: 03+1] P:toolforge::k8s::etcd: load list of control nodes from PuppetDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) (owner: 10Majavah) [08:19:09] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10jcrespo) [08:19:16] sorry for the mess kart_ ! [08:19:27] I guess changes to scap config should require a verification [08:20:17] +100 [08:22:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 T326669', diff saved to https://phabricator.wikimedia.org/P46036 and previous config saved to /var/cache/conftool/dbconfig/20230405-082240-root.json [08:22:45] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:22:51] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/905933 (owner: 10Majavah) [08:22:53] hashar: no issue :) [08:23:03] hashar: I'd argue that changes in general would require a verification :D [08:23:15] (03PS1) 10KartikMistry: Disable ContentTranslation for Closed Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905935 [08:23:56] (03CR) 10CI reject: [V: 04-1] Disable ContentTranslation for Closed Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905935 (owner: 10KartikMistry) [08:25:06] !log hashar@deploy2002 Synchronized wmf-config/InitialiseSettings.php: Remove akwiki from CX config (take 2, it was not fully deployed due to a scap lock issue on the spare server) (duration: 06m 06s) [08:25:21] (03CR) 10Majavah: P:wmcs::kubeadm: checker is a toolforge-specific feature (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905933 (owner: 10Majavah) [08:25:42] (03PS1) 10Slyngshede: partman: allow partitions to take up the whole disk on no-swap. [puppet] - 10https://gerrit.wikimedia.org/r/905936 [08:26:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [08:27:12] (03CR) 10Slyngshede: "Follow up patch for addressing comments made on the merged patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/905160" [puppet] - 10https://gerrit.wikimedia.org/r/905936 (owner: 10Slyngshede) [08:27:16] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kafka-test1006.eqiad.wmnet with OS bullseye [08:27:21] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host kafka-test1006.eqiad.wmnet with OS bullseye completed: - kafka-test1006 (**PASS**) - Removed from Puppet a... [08:28:08] (03CR) 10David Caro: ceph: Allow setting a crush location hook for the rack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [08:28:18] (03Abandoned) 10KartikMistry: Disable ContentTranslation for Closed Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905935 (owner: 10KartikMistry) [08:28:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2067.codfw.wmnet with OS bullseye [08:28:55] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye completed: - ms-be2067 (**PASS**) - Downtim... [08:31:13] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: add rsyslog-namespaced fields to syslog_json [puppet] - 10https://gerrit.wikimedia.org/r/904597 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [08:32:00] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) https://www.juniper.net/documentation/us/en/software/junos/system-mgmt-monitoring/topics/ref/statement/enhanced-hash-key-e... [08:39:04] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe1003.eqiad.wmnet,service=thanos-web [08:43:19] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10MatthewVernon) ...it took about 10 minutes for sdx to start producing errors in the kernel log: ` Apr 5 08:21:22 ms-be2067 kernel: [ 22.166159] Process accounting resumed Apr... [08:43:45] I am going to check the logs a bit then do group1 [08:45:03] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905938 (https://phabricator.wikimedia.org/T330209) [08:45:05] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905938 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [08:45:47] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905938 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [08:46:58] hashar: back from the doctor, sorry about the issue with the inactive deployment server [08:47:01] thanks for the revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/905741/ [08:50:39] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [08:51:24] that caused a real mess, really sorry about it :( [08:52:05] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3754 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Docker [08:52:17] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.3 refs T330209 [08:52:21] T330209: 1.41.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T330209 [08:53:00] jnuche: as I get it the spare deployment server had a scap.cfg with `block_exection: true` which prevents scap 2 from deploying mediawiki cause we run a `scap pull` and a `scap rebuild-cdbs` on the spare server in order to populate /srv/mediawiki). [08:53:35] then I am not sure whether we need a full deploy of mediawiki on the deployment server, maybe that is needed to run mwscripts [08:53:51] anyway, that was an easy fix :-] [08:55:12] hashar: yes, the flag replaced another blocking mechanism we had, but apparently the old mechanism still allowed scap to run in some cases [08:55:19] also, I thought all the sync from primary to secondary master was done via rsync, apparently not [08:55:25] so I need to revisit [08:55:56] and by sheer coincidence the puppet change was finally merged last night and this morning I was at the doctor and not available [08:56:04] apologies again for the mess :( [08:56:49] (03PS1) 10Filippo Giunchedi: network: add LVS ranges for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/905939 (https://phabricator.wikimedia.org/T333949) [08:57:01] I think the issue is that the config change was applied but not verified after deployment or surely we would have caught it by running a `scap sync-file` [08:57:29] anyway no worries, it was an easy find (well thanks to t.aavi) and an easy revert (thanks e.lukey) :] [08:58:04] !log hashar@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.3 refs T330209 (duration: 05m 46s) [08:58:06] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333960 (10Peachey88) [08:58:08] T330209: 1.41.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T330209 [08:58:09] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Peachey88) [08:59:15] (03CR) 10Jaime Nuche: "Really sorry about this affecting the deployments. Thanks for the revert." [puppet] - 10https://gerrit.wikimedia.org/r/905741 (https://phabricator.wikimedia.org/T330756) (owner: 10Elukey) [09:00:20] (03PS9) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) [09:00:49] (03PS10) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) [09:01:27] (03CR) 10Jcrespo: mediabackups: Add static console port for easier remote management (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo) [09:02:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I would be great if you could confirm these updated recipes to be working as expected by reimaging two of the testvm* hosts (w" [puppet] - 10https://gerrit.wikimedia.org/r/905936 (owner: 10Slyngshede) [09:03:59] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo) [09:05:25] (03CR) 10Ayounsi: "lgtm! thx for completing it!" [puppet] - 10https://gerrit.wikimedia.org/r/905939 (https://phabricator.wikimedia.org/T333949) (owner: 10Filippo Giunchedi) [09:05:29] (03CR) 10Ayounsi: [C: 03+1] network: add LVS ranges for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/905939 (https://phabricator.wikimedia.org/T333949) (owner: 10Filippo Giunchedi) [09:07:18] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [09:09:19] (03CR) 10Slyngshede: [C: 03+2] partman: allow partitions to take up the whole disk on no-swap. [puppet] - 10https://gerrit.wikimedia.org/r/905936 (owner: 10Slyngshede) [09:09:27] (03PS1) 10Clément Goubert: linkrecommendation: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905941 (https://phabricator.wikimedia.org/T334060) [09:11:27] (03PS1) 10Clément Goubert: push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061) [09:12:00] (03CR) 10Jaime Nuche: "Apparently `/var/lock/scap-global-lock` allowed some Scap commands to run. In particular `scap pull` and `scap cdb-rebuild` still need to " [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [09:12:55] (03PS1) 10Clément Goubert: recommendation-api: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905943 (https://phabricator.wikimedia.org/T334062) [09:15:08] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [09:15:29] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [09:15:39] (03CR) 10Jbond: [C: 03+2] admin: hashar: some more git aliases [puppet] - 10https://gerrit.wikimedia.org/r/905715 (owner: 10Hashar) [09:15:45] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reimage for host testvm2002.codfw.wmnet with OS bullseye [09:16:40] group1 wikis look fine [09:17:00] (03PS1) 10Clément Goubert: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905944 (https://phabricator.wikimedia.org/T334064) [09:18:14] (03CR) 10Muehlenhoff: Add an in place Debian upgrade script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [09:19:55] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10MoritzMuehlenhoff) >>! In T331706#8755038, @jhathaway wrote: >>>! In T331706#8753210, @Ladsgroup wrote: >> I'll try to take a look at the grants (it's a bit unusual... [09:21:01] (03PS1) 10Marostegui: mariadb: Promote db1125 to test-cluster master [puppet] - 10https://gerrit.wikimedia.org/r/905945 [09:22:20] (03PS1) 10Ayounsi: cr: switch bootp to dhcp-relay; asw-drmrs: manage dhcp [homer/public] - 10https://gerrit.wikimedia.org/r/905946 (https://phabricator.wikimedia.org/T320508) [09:22:41] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [09:22:53] (03CR) 10CI reject: [V: 04-1] cr: switch bootp to dhcp-relay; asw-drmrs: manage dhcp [homer/public] - 10https://gerrit.wikimedia.org/r/905946 (https://phabricator.wikimedia.org/T320508) (owner: 10Ayounsi) [09:23:23] (03CR) 10Jbond: exim: fix hard-coded vrts hostname (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [09:23:55] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1125 to test-cluster master [puppet] - 10https://gerrit.wikimedia.org/r/905945 (owner: 10Marostegui) [09:25:23] (03PS1) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) [09:26:30] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [09:26:54] (03CR) 10Ayounsi: "Also adds `enhanced-hash-key` on drmrs switches for consistency with the routers." [homer/public] - 10https://gerrit.wikimedia.org/r/905946 (https://phabricator.wikimedia.org/T320508) (owner: 10Ayounsi) [09:28:14] (03PS2) 10Ayounsi: cr: switch bootp to dhcp-relay; asw-drmrs: manage dhcp [homer/public] - 10https://gerrit.wikimedia.org/r/905946 (https://phabricator.wikimedia.org/T320508) [09:29:23] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/905553 (https://phabricator.wikimedia.org/T334067) [09:29:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s2 T334067 [09:29:50] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [09:29:54] T334067: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T334067 [09:30:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/905637 (owner: 10Jbond) [09:30:14] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1007.eqiad.wmnet [09:30:16] (03CR) 10Muehlenhoff: [C: 03+2] Also add component/pybal for pybaltest hosts [puppet] - 10https://gerrit.wikimedia.org/r/905543 (owner: 10Muehlenhoff) [09:30:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T334067 [09:31:52] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update Java images to OpenJDK 11.0.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905592 (owner: 10Muehlenhoff) [09:31:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1162 with weight 0 T334067', diff saved to https://phabricator.wikimedia.org/P46038 and previous config saved to /var/cache/conftool/dbconfig/20230405-093155-marostegui.json [09:32:03] (03CR) 10David Caro: P:toolforge::k8s::etcd: load list of control nodes from PuppetDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) (owner: 10Majavah) [09:32:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/905553 (https://phabricator.wikimedia.org/T334067) (owner: 10Gerrit maintenance bot) [09:33:31] (03PS2) 10Majavah: P:toolforge::k8s::etcd: load list of control nodes from PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) [09:33:57] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) [09:34:04] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-main1003.eqiad.wmnet with reason: restart kafka, switch to PKI [09:34:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-main1003.eqiad.wmnet with reason: restart kafka, switch to PKI [09:34:22] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) [09:34:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1007.eqiad.wmnet [09:35:31] !log restart kafka on kafka-main1003 to pick up the new TLS certificate (PKI based) - T319372 [09:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:34] T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 [09:35:39] (03PS3) 10Majavah: P:toolforge::k8s::etcd: load list of control nodes from PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) [09:36:23] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host kafka-test1007.eqiad.wmnet with OS bullseye [09:36:27] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host kafka-test1007.eqiad.wmnet with OS bullseye [09:36:37] (03CR) 10Majavah: P:toolforge::k8s::etcd: load list of control nodes from PuppetDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) (owner: 10Majavah) [09:42:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Looks okay to me; I can help with testing it in #wikimedia-operations if you like." [deployment-charts] - 10https://gerrit.wikimedia.org/r/905944 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [09:42:37] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2002.codfw.wmnet with OS bullseye [09:43:41] Lucas_WMDE: [09:43:49] 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey) [09:43:50] I'd like that :) [09:43:54] ok :) [09:44:10] It's very low trafic so if we can trigger requests to the backend, I'll take it :D [09:44:31] (sorry for the no-message ping, ssh had a hiccup :P) [09:44:36] my main issue is that I don’t remember which… uh, slot? idk – is targeted by test.wikidata.org [09:44:44] we have staging/eqiad/codfw [09:44:52] and then I think there’s another thing with two options? [09:45:03] and test.wikidata.org goes to some combination of them but I don’t remember which one [09:45:57] (and www.wikidata.org presumably goes to the least staging/test-y one ^^) [09:46:24] Ah yes wait [09:46:36] There's another reference to the mw api that isn't through envoy [09:46:39] 10 │ WIKIBASE_REPO_HOSTNAME_ALIAS: api-ro.discovery.wmnet [09:47:24] And there seems to be three values files, staging, test, and plain values.yaml [09:48:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:48:18] So in the staging environmenet there are two releases deployed, test and staging [09:48:39] And then in eqiad and codfw, just a production release [09:48:40] (03CR) 10David Caro: [C: 03+2] P:toolforge::k8s::etcd: load list of control nodes from PuppetDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905934 (https://phabricator.wikimedia.org/T274499) (owner: 10Majavah) [09:48:47] ok [09:49:06] so eqiad and codfw are probably for www.wikidata.org, depending on which dc is active [09:49:12] (03CR) 10Jbond: [C: 03+2] squid: Add support for hourly log rotation [puppet] - 10https://gerrit.wikimedia.org/r/905637 (owner: 10Jbond) [09:49:14] So I'll change the WIKIBASE_REPO_HOSTNAME_ALIAS too maybe ? [09:49:16] (03CR) 10Jbond: [V: 03+1 C: 03+2] url_downloader: switch squid logs to hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond) [09:49:22] (that's for test) [09:49:33] and testwikidatawiki goes to http://termbox-test.staging.svc.eqiad.wmnet:3031/termbox if I found the right setting in IS.php [09:49:45] let me see [09:50:06] yeah probably change that too [09:50:30] production release has this calling itself I think [09:50:32] I assume that means “connect to DNS api-ro.discovery but send HTTP Host: test.wikidata.org” [09:50:32] WIKIBASE_REPO: http://www.wikidata.org:6500/w [09:50:34] WIKIBASE_REPO_HOSTNAME_ALIAS: localhost [09:50:51] yeah, that’s some proxy running on the same system I think [09:51:00] Yeah [09:51:32] (03CR) 10Jbond: [V: 03+1 C: 03+2] url_downloader: switch squid logs to hourly rotation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond) [09:51:38] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1007.eqiad.wmnet with reason: host reimage [09:51:42] (03CR) 10Hashar: gerrit: replace Icinga with Prometheus monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [09:51:43] We'll change the test values and deploy staging and test first and see [09:52:20] (03PS2) 10Clément Goubert: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905944 (https://phabricator.wikimedia.org/T334064) [09:52:40] sounds good [09:52:44] back in a minute [09:52:46] I know how to trigger requests at least [09:52:59] and if you say the request volume is low, I assume you can also see that the requests were triggered successfully [09:54:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1007.eqiad.wmnet with reason: host reimage [09:55:38] !log Starting s2 eqiad failover from db1122 to db1162 - T334067 [09:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:42] T334067: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T334067 [09:56:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1162 to s2 primary T334067', diff saved to https://phabricator.wikimedia.org/P46039 and previous config saved to /var/cache/conftool/dbconfig/20230405-095600-root.json [09:56:39] Lucas_WMDE: I'm basing myself on https://grafana.wikimedia.org/goto/AO_QnKYVk?orgId=1 [09:57:54] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reimage for host testvm2002.codfw.wmnet with OS bullseye [09:58:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905944 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [09:59:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1122', diff saved to https://phabricator.wikimedia.org/P46040 and previous config saved to /var/cache/conftool/dbconfig/20230405-095954-marostegui.json [10:00:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46041 and previous config saved to /var/cache/conftool/dbconfig/20230405-100003-root.json [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1000) [10:00:55] Lucas_WMDE: So if I change to this https://grafana.wikimedia.org/goto/Gdyg4KLVk?orgId=1 I should see the requests go through to mw-api-int [10:01:13] ok [10:01:18] sounds good [10:01:39] Ok, merging and deploying staging and test then ? [10:01:41] it sounds like www and test wikidata might go through different paths to the API anyways? [10:01:47] yeah I think you can go ahead [10:02:02] even if real wikidata has unexpected issues that we don’t catch via test wikidata, it shouldn’t be a huge problem [10:02:12] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable add link frontend in 7th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950 [10:02:18] jouncebot: now [10:02:18] For the next 0 hour(s) and 57 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1000) [10:02:22] Yeah I'm a bit confused about the WIKIBASE_REPO setting [10:02:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] openstack::nutcracker: Remove redis support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris) [10:02:58] (03PS1) 10Marostegui: mariadb: Productionize db1207 [puppet] - 10https://gerrit.wikimedia.org/r/905951 (https://phabricator.wikimedia.org/T326669) [10:03:03] I *think* it means it calls the local envoy proxy on port 6500 [10:03:33] (03PS2) 10Sergio Gimeno: GrowthExperiments: enable add link frontend in 7th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950 [10:03:36] Which means it'd call mw-api-int, because I used that same port for the new mw-api-int-asynclistener [10:04:22] (03CR) 10Clément Goubert: [C: 03+2] termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905944 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [10:05:37] (03CR) 10Volans: "Nice addition! I'll leave it to your team for the actual logic, I did a pass for general cookbook's related stuff." [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [10:05:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] network: add LVS ranges for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/905939 (https://phabricator.wikimedia.org/T333949) (owner: 10Filippo Giunchedi) [10:06:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1207 [puppet] - 10https://gerrit.wikimedia.org/r/905951 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [10:06:24] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [10:06:30] ok [10:07:32] akosiaris: thank you for the merge! FWIW I got a PCC running here as I wasn't sure about the exact implications https://puppet-compiler.wmflabs.org/output/905939/40546/ [10:07:33] so it calls mw-api-int in mw-on-k8s, but the older api on non-k8s deployments, because they have different things running on port 6500? [10:08:13] though should be safe AFAICS [10:08:18] (03PS1) 10Marostegui: mariadb: db1207 remove from insetup [puppet] - 10https://gerrit.wikimedia.org/r/905952 (https://phabricator.wikimedia.org/T326669) [10:08:32] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:08:45] (03CR) 10Marostegui: [C: 03+2] mariadb: db1207 remove from insetup [puppet] - 10https://gerrit.wikimedia.org/r/905952 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [10:09:07] (03PS1) 10Elukey: profile::kafka::broker: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954 [10:09:10] (03CR) 10Alexandros Kosiaris: [C: 03+1] thumbor: increase memory quota, per-container memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/905654 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [10:09:13] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [10:09:31] (03PS1) 10Marostegui: Revert "Add new db nodes to site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/905745 [10:09:41] Lucas_WMDE: It'll call the defined listener for port 6500 yes (which is what I changed in the values files) https://gerrit.wikimedia.org/r/c/operations/puppet/+/903595/ [10:09:44] (03PS2) 10Marostegui: Revert "Add new db nodes to site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/905745 [10:10:05] (03Merged) 10jenkins-bot: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905944 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [10:10:11] ok, nice [10:10:16] (03CR) 10Marostegui: [C: 03+2] Revert "Add new db nodes to site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/905745 (owner: 10Marostegui) [10:10:38] volans: FYI drmrs blackbox probes are now live as per T333949, in case they mis-page [10:10:39] T333949: service::catalog probes are not deployed in drmrs - https://phabricator.wikimedia.org/T333949 [10:11:06] godog: ack [10:11:27] godog: prego. In reality, nothing. Just adding a couple of more ferm macros (unused) an adding those nets to 2 used macros [10:11:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kafka-test1007.eqiad.wmnet with OS bullseye [10:11:43] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host kafka-test1007.eqiad.wmnet with OS bullseye completed: - kafka-test1007 (**PASS**) - Downtimed on Icinga/A... [10:11:47] there will be ferm restarts across the fleet, but that's going to be ok [10:12:00] at most we will discovery something weird with the firewall of some host due to an alert. [10:12:02] Hmm there's a big networkpolicy change that wasn't in CI [10:12:06] I need an adult :p [10:12:12] there are none :P [10:12:12] akosiaris: sweet! thank you that's informative [10:12:21] No adults? :( [10:12:26] in the helmfile diff? [10:12:31] Lucas_WMDE: yeah [10:12:39] * Lucas_WMDE has also been confused by extra diffs in the past [10:12:42] !log restart purged on cp6015 to verify if connection to brokers failed are only temporary or not [10:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:56] claime: what's the diff? [10:13:20] aka, how do I see it ? deploy2002 ; helmfile -e diff --context=5 ? [10:13:25] what is what here? [10:14:06] akosiaris: deploy2002, cd /srv/deployment-charts/helmfile.d/services/termbox; helmfile -e staging -l name=staging diff --context=5 [10:14:41] Or I made a phaste https://phabricator.wikimedia.org/P46042 [10:14:44] !log restart purged on cp5032, cp1082, cp6004, cp1090 - errors after restart of kafka main eqiad brokers [10:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:49] (03CR) 10Hnowlan: [C: 03+2] thumbor: increase memory quota, per-container memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/905654 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [10:15:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46043 and previous config saved to /var/cache/conftool/dbconfig/20230405-101507-root.json [10:15:26] There's also a chart version bump that apparently wasn't deployed [10:16:30] claime: that's where it comes from, I think [10:16:30] IMO it's either the removal of default-network-policy-conf.yaml, or the upgrade to mesh 1.1 that wasn't deployed [10:16:41] it's the removal of default-network-policy [10:16:46] go ahead, I reviewed the diff [10:16:49] ack [10:16:49] it's going to be fine [10:17:14] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [10:17:52] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [10:18:07] Lucas_WMDE: test and staging releases updates [10:18:11] updated* [10:18:15] ok [10:19:09] (03CR) 10Majavah: [C: 04-1] kubernetes: set NO_HOME for bulidservice (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [10:19:33] (03Merged) 10jenkins-bot: thumbor: increase memory quota, per-container memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/905654 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [10:19:59] (03PS1) 10Elukey: istio: upgrade to upstream version 1.17.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905956 (https://phabricator.wikimedia.org/T334068) [10:20:03] claime: hmm, without JS I don’t see a termbox at https://test.m.wikidata.org/wiki/Q229877 [10:20:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:20:17] let’s see if I can find any errors in logstash… [10:20:25] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2002.codfw.wmnet with OS bullseye [10:20:27] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/905554 (https://phabricator.wikimedia.org/T334077) [10:20:29] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/istio/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905956 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey) [10:22:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T326669', diff saved to https://phabricator.wikimedia.org/P46044 and previous config saved to /var/cache/conftool/dbconfig/20230405-102215-marostegui.json [10:22:20] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [10:23:01] claime: Failed to connect to termbox-test.staging.svc.eqiad.wmnet port 3031: Connection timed out [10:23:08] logstash _id eHnvUIcBtuN2AbPY_giz [10:23:31] ok let me check the releases [10:24:54] (03PS1) 10Marostegui: mariadb: Make db1179 candidate for x1 [puppet] - 10https://gerrit.wikimedia.org/r/905957 [10:24:59] Container is started and is supposed to be listening on 3031 [10:25:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:25:07] curl http://termbox-test.staging.svc.eqiad.wmnet:3031 [10:25:07] [10:25:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Make db1179 candidate for x1 [puppet] - 10https://gerrit.wikimedia.org/r/905957 (owner: 10Marostegui) [10:26:00] Same from a random mediawiki server [10:26:14] hm, “cannot GET /” might just be because it’s not the right URL [10:26:16] let me dig up a proper one [10:26:30] Yeah, but it means it is listening on 3031 [10:26:31] (03PS1) 10Slyngshede: partman: test updated flat-noswap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/905958 [10:26:38] yeah [10:27:19] (03PS1) 10Elukey: Add upstream release 1.15.7 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/905959 (https://phabricator.wikimedia.org/T334068) [10:27:52] (03CR) 10Jbond: [C: 03+1] "lgtm not tested" [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [10:27:57] http://termbox-test.staging.svc.eqiad.wmnet:3031/termbox?entity=Q229877&revision=630197&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ229877&preferredLanguages=en gives me a 500 Internal Server Error [10:28:16] after 3.something seconds [10:28:22] (03CR) 10Slyngshede: "Not really sure if this is the correct way to test the flat-noswap, no other hosts seems to use the noswap only." [puppet] - 10https://gerrit.wikimedia.org/r/905958 (owner: 10Slyngshede) [10:28:22] I could imagine the timeout being configured shorter than that [10:28:34] msg [10:28:36] timeout of 3000ms exceeded [10:28:43] (a bunch of them in termbox logstash) [10:29:21] so that’s the termbox service itself waiting for something else and timing out after 3 seconds? [10:29:25] (waiting for mediawiki, probably) [10:29:32] (03Abandoned) 10Marostegui: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/905554 (https://phabricator.wikimedia.org/T334077) (owner: 10Gerrit maintenance bot) [10:30:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46046 and previous config saved to /var/cache/conftool/dbconfig/20230405-103012-root.json [10:30:20] (03CR) 10Daniel Kinzler: "Thank you for finding this, Lucas!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905598 (https://phabricator.wikimedia.org/T333926) (owner: 10Lucas Werkmeister (WMDE)) [10:31:05] ok I found the logstash [10:31:08] not a lot of information in there :/ [10:31:18] other than the message you posted [10:32:35] yeah pretty sure this is a timeout from termbox trying to talk to mediawiki [10:34:42] (03PS3) 10Sergio Gimeno: GrowthExperiments: enable add link frontend and backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950 (https://phabricator.wikimedia.org/T304551) [10:35:21] Lucas_WMDE: I'm trying to find out if it's failing in termbox, or in envoy [10:37:24] I tried to kubectl exec bash into the pod but apparently I’m not allowed. maybe that’s for the better ^^ [10:38:18] I can do that by going to the actual node, and docker exec in the container [10:38:26] heh [10:38:41] But there's not much in terms of tools (which is normal, tbf) [10:38:59] ok [10:39:56] the pod has a HEALTHCHECK_QUERY in its environment, which looks similar to the URL I used above, but I don’t remember what that’s used for (it’s different from the k8s liveness and readiness probes, at least) [10:40:32] (03CR) 10Muehlenhoff: partman: test updated flat-noswap.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905958 (owner: 10Slyngshede) [10:40:47] !log hnowlan@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:40:53] (03PS1) 10Jcrespo: Revert "monitoring: Disable notifications for db1150 after crash" [puppet] - 10https://gerrit.wikimedia.org/r/905966 [10:41:04] (03CR) 10CI reject: [V: 04-1] Revert "monitoring: Disable notifications for db1150 after crash" [puppet] - 10https://gerrit.wikimedia.org/r/905966 (owner: 10Jcrespo) [10:41:16] (03CR) 10Jbond: "lgtm small nit/q inline" [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [10:41:35] !log hnowlan@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:42:15] Lucas_WMDE: Hmm there's a big difference between the test deployment and the staging deployment [10:42:29] staging has a tls-proxy (so envoy) [10:42:32] test doesn't [10:42:42] (03PS2) 10Jcrespo: Revert "monitoring: Disable notifications for db1150 after crash" [puppet] - 10https://gerrit.wikimedia.org/r/905966 [10:42:50] (03PS2) 10Slyngshede: partman: test updated flat-noswap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/905958 [10:43:02] hm [10:43:13] !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:43:14] test has mesh_enabled: false [10:43:18] (03PS2) 10Marostegui: mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/894571 (https://phabricator.wikimedia.org/T331302) (owner: 10Gerrit maintenance bot) [10:43:22] but we’re still trying to talk to mw-api-int.discovery.wmnet over TLS? [10:43:23] (03CR) 10Slyngshede: partman: test updated flat-noswap.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905958 (owner: 10Slyngshede) [10:43:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s5 T331302 [10:43:30] T331302: Switchover s5 master (db1100 -> db1130) - https://phabricator.wikimedia.org/T331302 [10:43:31] !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:44:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T331302 [10:44:05] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/905958 (owner: 10Slyngshede) [10:44:19] (03CR) 10Slyngshede: [C: 03+2] partman: test updated flat-noswap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/905958 (owner: 10Slyngshede) [10:44:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1130 with weight 0 T331302', diff saved to https://phabricator.wikimedia.org/P46047 and previous config saved to /var/cache/conftool/dbconfig/20230405-104422-marostegui.json [10:44:41] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10Ladsgroup) Yeah, I was about to say from the application point of view, the more the better, like why not 400? But I don't know the limitations the infra so I can't... [10:44:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/894571 (https://phabricator.wikimedia.org/T331302) (owner: 10Gerrit maintenance bot) [10:45:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46048 and previous config saved to /var/cache/conftool/dbconfig/20230405-104517-root.json [10:45:18] (03CR) 10Ladsgroup: [C: 03+1] "Thanks Jaime, Do you want me to deploy?" [puppet] - 10https://gerrit.wikimedia.org/r/905966 (owner: 10Jcrespo) [10:46:56] (03PS2) 10Hnowlan: admin: update platform engineering approvers [puppet] - 10https://gerrit.wikimedia.org/r/889967 (https://phabricator.wikimedia.org/T300244) [10:47:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46049 and previous config saved to /var/cache/conftool/dbconfig/20230405-104732-root.json [10:47:55] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reimage for host testvm2002.codfw.wmnet with OS bullseye [10:48:32] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:48:48] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:49:17] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:50:06] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:50:32] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [10:50:39] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:50:56] claime: I’ll probably be away or mostly unresponsive for the next two hours, sorry [10:51:02] Lucas_WMDE: Do you have a way to test staging ? [10:51:08] rather than test ? [10:51:16] not sure [10:51:44] hm, termbox-staging.staging.svc.eqiad.wmnet isn’t a real host apparently [10:52:48] (03PS4) 10Sergio Gimeno: GrowthExperiments: enable add link frontend (7th) and backend (8,9th) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950 (https://phabricator.wikimedia.org/T304551) [10:52:51] Because the production deployment is actually using the tls proxy [10:53:04] Anyways, I'll revert the change [10:53:14] We'll see when you get back, or later, it's not urgent [10:53:18] ok [10:53:28] (03CR) 10Jbond: partman: allow partitions to take up the whole disk on no-swap. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905936 (owner: 10Slyngshede) [10:53:45] I can tell you how to end-to-end test it from the wiki, at least [10:53:47] (03PS1) 10Clément Goubert: Revert "termbox: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905967 [10:53:54] (03CR) 10Clément Goubert: [C: 03+2] Revert "termbox: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905967 (owner: 10Clément Goubert) [10:54:13] create a new item on test wikidata (Special:NewItem), load it on the mobile domain, check if you see anything above “statements” when javascript is not enabled [10:54:39] on real wikidata, creating a test item would be frowned upon ;) but you could get mostly the same effect by loading random items on the mobile site [10:54:51] (assuming that they won’t have a cached termbox already, this should still test the server-side rendering) [10:55:09] but I’m not so sure how to test the individual parts internally [10:56:24] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [10:56:30] jouncebot: nowandnext [10:56:31] For the next 0 hour(s) and 3 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1000) [10:56:31] In 2 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1300) [10:56:38] Shall I deploy a patch? [10:56:45] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [10:57:05] * Lucas_WMDE afk [10:58:22] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10Ladsgroup) I'm inclined to mark this as decline... [10:58:31] q [10:59:18] I can press F to pay respects if needed [10:59:22] (03Merged) 10jenkins-bot: Revert "termbox: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905967 (owner: 10Clément Goubert) [10:59:38] Amir1: SSH lockups make me type strange things. [10:59:46] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [10:59:46] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [10:59:47] Amir1: You can go ahead for my part [10:59:57] noted, merci [11:00:02] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [11:00:16] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [11:00:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46050 and previous config saved to /var/cache/conftool/dbconfig/20230405-110022-root.json [11:00:50] (03CR) 10Ladsgroup: "💔" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905967 (owner: 10Clément Goubert) [11:02:00] Oh I think I may know what's happening though [11:02:20] the old api-ro listens on 443, so it doesn't need a port specified [11:02:30] :lightbulb: [11:02:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P46051 and previous config saved to /var/cache/conftool/dbconfig/20230405-110237-root.json [11:04:53] (03PS2) 10Phuedx: VisualEditorFeatureUse sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905601 (https://phabricator.wikimedia.org/T333168) [11:05:11] !log Starting s5 eqiad failover from db1100 to db1130 - T331302 [11:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:15] T331302: Switchover s5 master (db1100 -> db1130) - https://phabricator.wikimedia.org/T331302 [11:05:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1130 to s5 primary T331302', diff saved to https://phabricator.wikimedia.org/P46052 and previous config saved to /var/cache/conftool/dbconfig/20230405-110530-root.json [11:06:00] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) @wiki_willy that's because all those have `N/A` in the Accounting tab of the spreadsheet in the `Asset tag` column and so they don't match. [11:06:17] (03CR) 10Ladsgroup: [V: 03+1 C: 03+2] Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905609 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [11:07:10] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905609 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [11:07:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1100 with 1% weight', diff saved to https://phabricator.wikimedia.org/P46053 and previous config saved to /var/cache/conftool/dbconfig/20230405-110717-root.json [11:07:50] ugh, need to restart my pc [11:09:12] (03CR) 10David Caro: P:ldap::client: split config and utils to a separate profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah) [11:09:54] Amir1: you need to stop using Windows [11:10:02] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:11:17] marostegui: I feel that Windows can be used since WSL became a part of it. [11:11:55] (03PS1) 10Superpes15: [mgwiki] Replace the wordmark on Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905960 (https://phabricator.wikimedia.org/T334022) [11:12:14] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2002.codfw.wmnet with OS bullseye [11:12:24] !log installing systemd security updates on buster [11:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:36] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:12:37] marostegui: it's better than using Mac [11:12:38] urbanecm: XDD [11:12:59] but I respect everyone's flaws [11:13:02] (03PS1) 10Slyngshede: Revert "partman: test updated flat-noswap.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/905969 [11:13:04] Amir1: and Emacs? [11:13:12] (03PS2) 10Majavah: hieradata: remove unused keys from labsdnsconfig [puppet] - 10https://gerrit.wikimedia.org/r/903258 [11:13:27] nope that is not respectable :P [11:14:41] (03CR) 10Slyngshede: [C: 03+2] Revert "partman: test updated flat-noswap.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/905969 (owner: 10Slyngshede) [11:14:57] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:15:01] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:905609|Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" (T326800)]] [11:15:05] T326800: Make Wikimedia mwscript use run.php to run maintenance scripts - https://phabricator.wikimedia.org/T326800 [11:15:11] (03CR) 10MVernon: "Hi," [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [11:15:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46054 and previous config saved to /var/cache/conftool/dbconfig/20230405-111527-root.json [11:15:30] (03PS1) 10Marostegui: mariadb: Add db1220 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/905962 (https://phabricator.wikimedia.org/T326669) [11:16:01] (03CR) 10Majavah: [V: 03+1] P:ldap::client: split config and utils to a separate profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah) [11:16:11] (03PS1) 10Clément Goubert: Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) [11:16:27] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:905609|Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" (T326800)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [11:16:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] puppet-enc: added some tests for the api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875398 (owner: 10David Caro) [11:17:04] > Revert "Revert "Revert "Revert "mwscript: Switch to use run.php [11:17:22] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:17:24] 10/10, no notes. [11:17:26] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:17:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P46055 and previous config saved to /var/cache/conftool/dbconfig/20230405-111742-root.json [11:17:44] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:18:11] (03PS2) 10Clément Goubert: Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) [11:19:34] :P [11:19:36] (03PS2) 10Clément Goubert: linkrecommendation: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905941 (https://phabricator.wikimedia.org/T334060) [11:19:52] (03PS2) 10Clément Goubert: push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061) [11:20:14] (03PS2) 10Clément Goubert: recommendation-api: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905943 (https://phabricator.wikimedia.org/T334062) [11:20:38] (03PS2) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) [11:21:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] puppet-enc: rename so it can be imported and mocked [puppet] - 10https://gerrit.wikimedia.org/r/875824 (owner: 10David Caro) [11:22:08] (03CR) 10Hnowlan: [C: 03+1] api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [11:22:33] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reimage for host testvm2004.codfw.wmnet with OS bullseye [11:22:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P46056 and previous config saved to /var/cache/conftool/dbconfig/20230405-112240-root.json [11:22:53] Amir1: can you ping me when you're done? I just want to get `905764: Remove possibly significant whitespace from robots.txt | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/905764` out [11:23:08] sure [11:23:18] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:23:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Add db1220 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/905962 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [11:23:47] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:905609|Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" (T326800)]] (duration: 08m 45s) [11:23:51] T326800: Make Wikimedia mwscript use run.php to run maintenance scripts - https://phabricator.wikimedia.org/T326800 [11:24:18] (03CR) 10Jcrespo: "No, thank you, I need to make sure I finish the transfer and setup first" [puppet] - 10https://gerrit.wikimedia.org/r/905966 (owner: 10Jcrespo) [11:24:56] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:25:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/875866 (owner: 10David Caro) [11:25:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/875825 (owner: 10David Caro) [11:26:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:27:30] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10jcrespo) > I'm inclined to mark this as decline... [11:28:03] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:28:29] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:29:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mw1414.eqiad.wmnet [11:30:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46057 and previous config saved to /var/cache/conftool/dbconfig/20230405-113031-root.json [11:30:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46058 and previous config saved to /var/cache/conftool/dbconfig/20230405-113052-root.json [11:31:00] (HelmReleaseBadStatus) firing: Helm release thumbor/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:31:31] (03CR) 10Jbond: [C: 03+1] P:ldap::client: split config and utils to a separate profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah) [11:31:31] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2004.codfw.wmnet with reason: host reimage [11:31:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:31:43] TheresNoTime: done now [11:31:48] ty! [11:32:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905764 (https://phabricator.wikimedia.org/T334038) (owner: 10Legoktm) [11:32:03] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/905986 [11:32:42] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10hnowlan) [11:32:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P46059 and previous config saved to /var/cache/conftool/dbconfig/20230405-113246-root.json [11:32:52] (03Merged) 10jenkins-bot: Remove possibly significant whitespace from robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905764 (https://phabricator.wikimedia.org/T334038) (owner: 10Legoktm) [11:33:11] !log samtar@deploy2002 Started scap: Backport for [[gerrit:905764|Remove possibly significant whitespace from robots.txt (T334038)]] [11:33:15] T334038: Excess whitespace in English Wikipedia robots.txt file could cause problems in some implementations - https://phabricator.wikimedia.org/T334038 [11:33:38] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10hnowlan) [11:34:19] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2004.codfw.wmnet with reason: host reimage [11:34:42] !log samtar@deploy2002 legoktm and samtar: Backport for [[gerrit:905764|Remove possibly significant whitespace from robots.txt (T334038)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [11:34:45] (testing) [11:35:14] (syncing) [11:35:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw1414.eqiad.wmnet [11:37:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P46060 and previous config saved to /var/cache/conftool/dbconfig/20230405-113745-root.json [11:37:54] (03CR) 10Jcrespo: [C: 03+2] Revert "monitoring: Disable notifications for db1150 after crash" [puppet] - 10https://gerrit.wikimedia.org/r/905966 (owner: 10Jcrespo) [11:38:24] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/905986 (owner: 10Muehlenhoff) [11:38:48] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:40:25] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:905764|Remove possibly significant whitespace from robots.txt (T334038)]] (duration: 07m 14s) [11:40:29] T334038: Excess whitespace in English Wikipedia robots.txt file could cause problems in some implementations - https://phabricator.wikimedia.org/T334038 [11:41:00] (HelmReleaseBadStatus) resolved: Helm release thumbor/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:41:09] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) >>! In T310980#8724070, @MoritzMuehlenhoff wrote: > Looking at https://cassandra.apache.org/doc/latest/cassandra/getting_started/java11.html we should probably also continue to... [11:43:42] hm, how does one clear what I assume is a cached robots.txt? (for T334038) [11:44:08] purgeList.php would work here too I assume [11:44:16] ah, makes sense, thank you [11:44:59] yup :) [11:45:15] !log `[samtar@mwmaint2002 ~]$ echo 'https://en.wikipedia.org/robots.txt' | mwscript purgeList.php` T334038 [11:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46061 and previous config saved to /var/cache/conftool/dbconfig/20230405-114557-root.json [11:47:35] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2004.codfw.wmnet with OS bullseye [11:47:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46062 and previous config saved to /var/cache/conftool/dbconfig/20230405-114751-root.json [11:52:16] (03CR) 10Jaime Nuche: [C: 03+1] "Fix LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/905304 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [11:52:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P46063 and previous config saved to /var/cache/conftool/dbconfig/20230405-115249-root.json [11:53:59] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:54:45] !log installing apache2 security updates on buster [11:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46064 and previous config saved to /var/cache/conftool/dbconfig/20230405-120101-root.json [12:02:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46065 and previous config saved to /var/cache/conftool/dbconfig/20230405-120256-root.json [12:04:21] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [12:06:16] (03PS3) 10Samtar: Remove WikiEditor's Realtime Preview config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901553 (https://phabricator.wikimedia.org/T327515) (owner: 10Samwilson) [12:06:29] jouncebot: nowandnext [12:06:29] No deployments scheduled for the next 0 hour(s) and 53 minute(s) [12:06:29] In 0 hour(s) and 53 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1300) [12:07:06] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ayounsi) a:03ayounsi [12:07:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46066 and previous config saved to /var/cache/conftool/dbconfig/20230405-120754-root.json [12:09:26] o/ I intend to deploy `901553: Remove WikiEditor's Realtime Preview config vars | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/901553` — any reason not to? [12:09:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10ayounsi) a:03ayounsi Taking that task, even if the current CR does the job, it could be refactored with @cmooney work to remove the duplicated co... [12:10:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901553 (https://phabricator.wikimedia.org/T327515) (owner: 10Samwilson) [12:11:26] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [12:12:00] (03Merged) 10jenkins-bot: Remove WikiEditor's Realtime Preview config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901553 (https://phabricator.wikimedia.org/T327515) (owner: 10Samwilson) [12:12:26] !log samtar@deploy2002 Started scap: Backport for [[gerrit:901553|Remove WikiEditor's Realtime Preview config vars (T327515)]] [12:12:30] T327515: Remove Realtime Preview's Beta Feature and Onboarding UI - https://phabricator.wikimedia.org/T327515 [12:13:51] !log samtar@deploy2002 samwilson and samtar: Backport for [[gerrit:901553|Remove WikiEditor's Realtime Preview config vars (T327515)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [12:13:52] (testing) [12:14:42] (syncing) [12:14:53] 10ops-eqiad, 10decommission-hardware: decommission frdata1001.frack.eqiad.wmnet (WMF7292) - https://phabricator.wikimedia.org/T333971 (10Jgreen) [12:15:09] (03CR) 10Vgutierrez: [C: 03+1] "looks good, LGTM assuming that varnish tests are still happy (OoO today and I cannot 4un the tests)" [puppet] - 10https://gerrit.wikimedia.org/r/904883 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [12:16:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46067 and previous config saved to /var/cache/conftool/dbconfig/20230405-121606-root.json [12:18:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46068 and previous config saved to /var/cache/conftool/dbconfig/20230405-121801-root.json [12:18:36] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10Ladsgroup) 05Open→03Declined Yeah, it was t... [12:20:08] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:901553|Remove WikiEditor's Realtime Preview config vars (T327515)]] (duration: 07m 41s) [12:20:12] T327515: Remove Realtime Preview's Beta Feature and Onboarding UI - https://phabricator.wikimedia.org/T327515 [12:20:19] (03CR) 10Ayounsi: [C: 03+2] "0 tests failed, 0 tests skipped, 17 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/904883 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [12:20:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:23:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46069 and previous config saved to /var/cache/conftool/dbconfig/20230405-122259-root.json [12:23:31] (03CR) 10Jelto: "I'm exited about the new cookbook! I left some gitlab-specific comments in-line. I'm happy to take another look on future patchsets." [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [12:25:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:27:16] !log installing xapian-core security updates [12:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:00] 10SRE, 10Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845 (10ayounsi) 05Open→03Resolved a:03ayounsi This is completed in drmrs, the same will be applied to the other sites when we bring L3 on the ToR switches as I don't think... [12:28:41] (03PS1) 10Filippo Giunchedi: hieradata: move varnishkafka-exporter stats to counters [puppet] - 10https://gerrit.wikimedia.org/r/906000 (https://phabricator.wikimedia.org/T334085) [12:31:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46070 and previous config saved to /var/cache/conftool/dbconfig/20230405-123111-root.json [12:33:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46071 and previous config saved to /var/cache/conftool/dbconfig/20230405-123305-root.json [12:38:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46072 and previous config saved to /var/cache/conftool/dbconfig/20230405-123804-root.json [12:46:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46073 and previous config saved to /var/cache/conftool/dbconfig/20230405-124616-root.json [12:48:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46074 and previous config saved to /var/cache/conftool/dbconfig/20230405-124810-root.json [12:52:36] (03PS1) 10Muehlenhoff: debdeploy-revdeps: Omit Breaks, Enhances, Conflicts, Replaces [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/906009 [12:53:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46075 and previous config saved to /var/cache/conftool/dbconfig/20230405-125308-root.json [12:55:15] (03CR) 10Ayounsi: [C: 03+2] Allow different port than default 22 (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [12:56:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Ottomata) @BTullis gave @Jgiannelos sql_role perms in T328457#8734396 I think we can close this. [12:57:19] (03Merged) 10jenkins-bot: Allow different port than default 22 [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [12:58:32] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1008.eqiad.wmnet [12:59:27] (03PS1) 10Filippo Giunchedi: sre: mute etcd-mirror pint promql checks [alerts] - 10https://gerrit.wikimedia.org/r/906011 (https://phabricator.wikimedia.org/T309182) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1300) [13:00:05] sergi0 and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] hi [13:00:13] o/ [13:00:29] o/ busy but probably available in 5mins or so [13:00:45] (03PS6) 10Sergio Gimeno: GrowthExperiments: enable add link backend in wiki rounds (8,9th) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950 (https://phabricator.wikimedia.org/T308133) [13:01:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46076 and previous config saved to /var/cache/conftool/dbconfig/20230405-130121-root.json [13:03:04] (03PS3) 10Clément Goubert: Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) [13:03:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46077 and previous config saved to /var/cache/conftool/dbconfig/20230405-130315-root.json [13:03:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1008.eqiad.wmnet [13:04:51] 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Aklapper) [13:04:52] ok, I can deploy! [13:04:54] * Lucas_WMDE looks [13:05:08] (03PS1) 10Stevemunene: Decommission an-worker1132 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/906017 (https://phabricator.wikimedia.org/T334092) [13:06:04] Lucas_WMDE: we won't be able to test much from my patch since the flag is only read on a maintenance script triggered by a periodic job. Should be safe though, we've been using it for a while. [13:06:15] ok [13:07:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950 (https://phabricator.wikimedia.org/T308133) (owner: 10Sergio Gimeno) [13:08:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46078 and previous config saved to /var/cache/conftool/dbconfig/20230405-130813-root.json [13:08:35] (03Merged) 10jenkins-bot: GrowthExperiments: enable add link backend in wiki rounds (8,9th) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905950 (https://phabricator.wikimedia.org/T308133) (owner: 10Sergio Gimeno) [13:08:58] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:905950|GrowthExperiments: enable add link backend in wiki rounds (8,9th) (T308133 T308134)]] [13:09:03] T308134: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 [13:09:03] T308133: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 [13:10:25] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and sgimeno: Backport for [[gerrit:905950|GrowthExperiments: enable add link backend in wiki rounds (8,9th) (T308133 T308134)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:11:15] * Lucas_WMDE quickly checks that fiwiki isn’t totally broken [13:11:27] heh, of course they have the NATO logo in the recent news section ^^ [13:11:32] anyway, looks fine enough, syncing [13:11:52] of course we do, it's rather major news here :P [13:12:09] (03CR) 10Lucas Werkmeister (WMDE): Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [13:12:28] :D [13:12:29] heh, thanks for checking Lucas_WMDE [13:12:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/906009 (owner: 10Muehlenhoff) [13:14:17] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:14:37] * Lucas_WMDE checks the enwiki and dewiki front pages for comparison and 🤮 [13:15:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:16:08] (03CR) 10Jbond: "ill merge theses after easter" [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [13:16:35] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED [13:16:58] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:905950|GrowthExperiments: enable add link backend in wiki rounds (8,9th) (T308133 T308134)]] (duration: 08m 00s) [13:17:04] T308134: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 [13:17:05] T308133: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 [13:17:11] (03PS2) 10Lucas Werkmeister (WMDE): mediawiki.edit_attempt: Ignore events from PHP MPC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905261 (https://phabricator.wikimedia.org/T309985) (owner: 10Phuedx) [13:17:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905261 (https://phabricator.wikimedia.org/T309985) (owner: 10Phuedx) [13:17:39] phuedx: can the edit_attempt change be tested on mwdebug? [13:17:45] (03PS1) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019 [13:17:57] Lucas_WMDE: Yes [13:18:02] yay [13:18:09] (03Merged) 10jenkins-bot: mediawiki.edit_attempt: Ignore events from PHP MPC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905261 (https://phabricator.wikimedia.org/T309985) (owner: 10Phuedx) [13:18:20] (03PS2) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019 [13:18:33] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:905261|mediawiki.edit_attempt: Ignore events from PHP MPC (T309985)]] [13:18:37] T309985: Migrate WikiEditor EditAttemptStep instrument to Metrics Platform - https://phabricator.wikimedia.org/T309985 [13:19:15] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED [13:19:48] (03PS1) 10Filippo Giunchedi: data-engineering: ignore 'status' label pint check [alerts] - 10https://gerrit.wikimedia.org/r/906020 (https://phabricator.wikimedia.org/T309182) [13:19:50] (03CR) 10Volans: "nits inline" [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi) [13:19:55] (03CR) 10CI reject: [V: 04-1] CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi) [13:19:56] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and phuedx: Backport for [[gerrit:905261|mediawiki.edit_attempt: Ignore events from PHP MPC (T309985)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:20:11] phuedx: then please test now :) [13:21:42] (03PS3) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019 [13:21:45] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED [13:22:08] (03CR) 10Ayounsi: "thx" [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi) [13:23:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46079 and previous config saved to /var/cache/conftool/dbconfig/20230405-132318-root.json [13:23:22] (03CR) 10CI reject: [V: 04-1] CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi) [13:23:59] Lucas_WMDE: LGTM. I tested the change by opening the editor on a random page on enwiki and typing a few characters (but not saving the change) and observing several analytics events being logged to the mediawiki.edit_attempt stream [13:24:23] ok! [13:24:29] thanks [13:26:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED [13:27:06] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:27:59] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] debdeploy-revdeps: Omit Breaks, Enhances, Conflicts, Replaces [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/906009 (owner: 10Muehlenhoff) [13:28:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:28:43] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED [13:29:25] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:905261|mediawiki.edit_attempt: Ignore events from PHP MPC (T309985)]] (duration: 10m 52s) [13:29:29] T309985: Migrate WikiEditor EditAttemptStep instrument to Metrics Platform - https://phabricator.wikimedia.org/T309985 [13:30:13] (03PS3) 10Lucas Werkmeister (WMDE): VisualEditorFeatureUse sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905601 (https://phabricator.wikimedia.org/T333168) (owner: 10Phuedx) [13:30:19] (03PS4) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019 [13:30:49] (03CR) 10Ssingh: [C: 03+2] admin: add fnavas-foundation to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482) (owner: 10Ssingh) [13:30:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905601 (https://phabricator.wikimedia.org/T333168) (owner: 10Phuedx) [13:31:32] (03Merged) 10jenkins-bot: VisualEditorFeatureUse sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905601 (https://phabricator.wikimedia.org/T333168) (owner: 10Phuedx) [13:31:57] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:905601|VisualEditorFeatureUse sampling rate to 1 everywhere (T333168)]] [13:32:01] T333168: Increase VisualEditorFeatureUse sampling rate to 100% - https://phabricator.wikimedia.org/T333168 [13:32:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10ssingh) 05Open→03Resolved Oh, great. Thanks for sharing @Ottomata! [13:33:24] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and phuedx: Backport for [[gerrit:905601|VisualEditorFeatureUse sampling rate to 1 everywhere (T333168)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:33:31] phuedx: is this one testable too? [13:34:08] Lucas_WMDE: Yeah. I can do a quick spot check on mwdebug2002. I'll be monitoring it after it rolls out too [13:34:14] ok thanks [13:35:24] (03PS1) 10Alexandros Kosiaris: Assign proper insetup Puppet roles to machines [puppet] - 10https://gerrit.wikimedia.org/r/906023 [13:35:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:35:44] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi) [13:36:05] PROBLEM - Host ml-serve2004 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10ssingh) 05Open→03Resolved a:03ssingh @FNavas-foundation: Your access request has been merged. Please try again (in about 30 minutes from thi... [13:36:40] (03PS2) 10Elukey: istio: upgrade to upstream version 1.15.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905956 (https://phabricator.wikimedia.org/T334068) [13:38:46] (03CR) 10Ayounsi: [C: 03+2] CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi) [13:39:19] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:40:35] phuedx: are you doing the spot check? [13:40:38] just want to make sure we’re not both waiting for the other to say something ^^ [13:40:38] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.6.2 [software/homer] - 10https://gerrit.wikimedia.org/r/906019 (owner: 10Ayounsi) [13:40:47] Lucas_WMDE: Yes. Sorry. Thanks :) [13:41:07] ok, I’ll wait for your confirmation then ^^ [13:41:37] (03CR) 10Muehlenhoff: [C: 03+1] "Aligns with the role annotations we have in Hiera." [puppet] - 10https://gerrit.wikimedia.org/r/906023 (owner: 10Alexandros Kosiaris) [13:41:38] (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:41:39] Lucas_WMDE: LGTM. Thanks [13:41:43] ok thanks! [13:41:47] syncing [13:42:36] I'll monitor the impact closely over at https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=VisualEditorFeatureUse [13:43:57] Lucas_WMDE: If the port from WIKIBASE_REPO shows up for test, it'll show up for prod too where WIKIBASE_REPO: http://www.wikidata.org:6500/w [13:44:09] hm, good point ^^ [13:44:35] I'll wait until you're done with the deploy and merge the new patch, if that's all right with you? [13:45:34] sounds good! [13:46:45] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:905601|VisualEditorFeatureUse sampling rate to 1 everywhere (T333168)]] (duration: 14m 47s) [13:46:49] T333168: Increase VisualEditorFeatureUse sampling rate to 100% - https://phabricator.wikimedia.org/T333168 [13:47:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [13:47:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [13:47:57] claime: I’m done [13:48:04] ack [13:48:07] let's go :P [13:48:27] (03CR) 10Clément Goubert: [C: 03+2] Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [13:48:32] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1009.eqiad.wmnet [13:48:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:48:57] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-main1004.eqiad.wmnet with reason: restart kafka, switch to PKI [13:49:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-main1004.eqiad.wmnet with reason: restart kafka, switch to PKI [13:52:10] !log restart kafka on kafka-main1004 to pick up the new TLS certificate (PKI based) - T319372 [13:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:14] T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 [13:52:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1009.eqiad.wmnet [13:52:55] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1010.eqiad.wmnet [13:52:59] (03Merged) 10jenkins-bot: Revert "Revert "termbox: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905963 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [13:53:30] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [13:53:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:53:39] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [13:53:52] Lucas_WMDE: deployed the test releas [13:53:54] e [13:54:03] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [13:54:15] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [13:54:16] ok, checking [13:54:30] And the staging release for good measure, even if it doesn't seem used [13:54:42] (03PS1) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) [13:54:55] hm, no termbox on https://test.m.wikidata.org/wiki/Q229878 with noscript [13:55:09] * claime grumbles [13:55:25] “timeout of 3000ms exceeded” in logstash :( [13:55:29] Yeah [13:55:57] (03PS10) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) [13:56:04] (03PS17) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [13:56:57] Lucas_WMDE: Can test.wikidata be switched to use the staging release and not the test release ? [13:57:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1010.eqiad.wmnet [13:57:07] (the one that's using the service mesh) [13:57:52] On port 4004, not 3031 [13:57:58] no idea tbh [13:58:32] 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Ottomata) From a brief glance, those look like normal consumer reassignment messages. Probably shouldn't be alerts. [13:58:34] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host kafka-test1008.eqiad.wmnet with OS bullseye [13:58:39] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host kafka-test1008.eqiad.wmnet with OS bullseye [13:58:41] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) 05Open→03Resolved updated backplane firmware looks like errors have resolved [13:59:24] I tried curling termbox-test.staging.svc.eqiad.wmnet:4004 and got “empty reply from server” [14:00:05] Lucas_WMDE: curl -k https://termbox-test.staging.svc.eqiad.wmnet:4004/?spec [14:00:26] curl -k https://termbox-test.staging.svc.eqiad.wmnet:3031/?spec [14:00:28] curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number [14:00:30] lol. [14:00:32] awesome. [14:00:48] ok, but [14:00:49] curl -vk 'https://termbox-test.staging.svc.eqiad.wmnet:4004/termbox?entity=Q229877&revision=630197&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ229877&preferredLanguages=en' [14:00:51] !log powercycle an-worker1132 [14:00:52] is a 500 Internal Server Error [14:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:17] (03CR) 10Ayounsi: [C: 03+2] Bird: remove anycast subnet filter [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [14:01:32] (03CR) 10Ayounsi: [C: 03+2] Bird: remove anycast subnet filter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [14:03:15] hm, but this is interesting [14:03:26] if I curl termbox-test:4004, there are different logstash errors [14:03:31] “Request failed with status code 503” [14:03:43] and it’s apparently talking to www.wikidata.org:6500 ? [14:04:07] so https://termbox-test.staging.svc.eqiad.wmnet:4004 is somehow a prod termbox instead of a testwikidatawiki one? [14:04:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) As an update, this is now blocked on {T297596}. The previous implementation discussion led to a finalization of guidelines... [14:04:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) [14:04:44] Lucas_WMDE: Yeah, staging uses the same endpoints as prod [14:04:45] I guess that’s because values-staging.yaml doesn’t change the WIKIBASE_REPO etc. like values-test.yaml does [14:04:48] yeah [14:04:51] exactly [14:04:53] ok [14:05:04] but doesn’t that mean we can’t use it for test wikidata? [14:05:08] I have a meeting, I'll roll back [14:05:12] ok thanks [14:05:19] We'll figure that out :P [14:05:47] (03PS1) 10Clément Goubert: Revert "Revert "Revert "termbox: Switch to mw-api-int-async on k8s""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905973 [14:05:59] competing with Amir1 for the most reverts in a commit message, I see :P [14:06:17] Yes :D [14:06:34] I'll end up doing Revert^5 [14:08:37] you have a lot to catch on. I'm not worried [14:08:43] (03CR) 10Clément Goubert: [C: 03+2] Revert "Revert "Revert "termbox: Switch to mw-api-int-async on k8s""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905973 (owner: 10Clément Goubert) [14:10:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:10:30] (03PS1) 10Jbond: P:netbox: add consumeres fo prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/906031 [14:11:34] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1008.eqiad.wmnet with reason: host reimage [14:11:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED [14:12:06] (03CR) 10CI reject: [V: 04-1] P:netbox: add consumeres fo prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/906031 (owner: 10Jbond) [14:13:55] (03Merged) 10jenkins-bot: Revert "Revert "Revert "termbox: Switch to mw-api-int-async on k8s""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905973 (owner: 10Clément Goubert) [14:14:04] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [14:14:19] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [14:14:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1008.eqiad.wmnet with reason: host reimage [14:14:29] Lucas_WMDE: rollback done [14:15:15] phuedx, looks like VisualEditorFeatureUse validation errors are creeping up [14:16:02] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10JArguello-WMF) [14:16:18] claime: thanks [14:22:54] (03CR) 10Muehlenhoff: install_server: simplify gitlab disk layout, drop lvm, use four SSDs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [14:23:59] phuedx: e.g. https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-default-1-7.0.0-1-2023.04.05?id=E13NUYcBwEtI0jFYGROB [14:24:37] ottomata: Looking. There appears to be a bunch of events with missing data. Happy to roll the change back and then investigate what's going on with the instrument [14:24:48] * Lucas_WMDE still around if needed [14:24:51] phuedx: no need to roll back, just as long as you know and are working on it. [14:25:06] it looks like maybe these errors were there before, its just now there are more of them [14:25:23] the errors aren't hurting anything right now [14:25:51] https://grafana.wikimedia.org/goto/qhs4YFLVz?orgId=1 [14:26:42] (03PS2) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) [14:27:25] (03CR) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [14:30:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:30:32] (03PS1) 10Majavah: cinderutils: stop provisioning old filename on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/906034 [14:30:36] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10elukey) 05Resolved→03Open @Jclark-ctr hi! I tried to reboot the node and it gets blocked when checking the hard drivers, telling me about possible preserved cache et... [14:30:51] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-main1005.eqiad.wmnet with reason: restart kafka, switch to PKI [14:31:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-main1005.eqiad.wmnet with reason: restart kafka, switch to PKI [14:31:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kafka-test1008.eqiad.wmnet with OS bullseye [14:31:24] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host kafka-test1008.eqiad.wmnet with OS bullseye completed: - kafka-test1008 (**PASS**) - Downtimed on Icinga/A... [14:33:37] !log restart kafka on kafka-main1005 to pick up the new TLS certificate (PKI based) - T319372 [14:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:41] T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 [14:34:13] (03CR) 10Volans: P:netbox: add consumeres fo prefixes and net devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906031 (owner: 10Jbond) [14:34:38] (03PS1) 10DCausse: rdf-streaming-updater: increase mem overhead to 45% [deployment-charts] - 10https://gerrit.wikimedia.org/r/906035 [14:36:33] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host kafka-test1009.eqiad.wmnet with OS bullseye [14:36:40] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host kafka-test1009.eqiad.wmnet with OS bullseye [14:38:11] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: increase mem overhead to 45% [deployment-charts] - 10https://gerrit.wikimedia.org/r/906035 (owner: 10DCausse) [14:43:30] (03Merged) 10jenkins-bot: rdf-streaming-updater: increase mem overhead to 45% [deployment-charts] - 10https://gerrit.wikimedia.org/r/906035 (owner: 10DCausse) [14:44:38] (03CR) 10Muehlenhoff: install_server: simplify gitlab disk layout, drop lvm, use four SSDs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [14:48:12] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:48:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:48:22] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:48:28] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:49:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:50] (03CR) 10AOkoth: exim: fix hard-coded vrts hostname (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [14:51:36] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1009.eqiad.wmnet with reason: host reimage [14:54:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1009.eqiad.wmnet with reason: host reimage [14:55:09] (03PS1) 10DCausse: rdf-streaming-updater: increase jvm-overhead.max [deployment-charts] - 10https://gerrit.wikimedia.org/r/906040 [14:55:49] (03CR) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [14:58:19] (03PS3) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) [14:58:22] (03CR) 10Ahmon Dancy: [C: 03+1] scap: block Scap execution on inactive deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [14:59:35] (03CR) 10Ahmon Dancy: [C: 03+1] scap: block Scap execution on inactive deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [14:59:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:59:50] (03CR) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [15:00:08] ottomata: I should have looked at the error rate and nature of the errors before increasing the sampling rate. I'm OoO next week and am trying to close out a few dangling threads. I think it's best to revert for now and take a look at the instrument when I get back [15:00:11] ^ Lucas_WMDE [15:01:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:02:36] (03PS1) 10Phuedx: Revert "VisualEditorFeatureUse sampling rate to 1 everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905979 [15:02:38] (03CR) 10Clément Goubert: [C: 03+2] mediawiki::scap: Ensure Exec['fetch_mediawiki'] resource always exists [puppet] - 10https://gerrit.wikimedia.org/r/905304 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [15:03:25] (03CR) 10Cwhite: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/906000 (https://phabricator.wikimedia.org/T334085) (owner: 10Filippo Giunchedi) [15:03:29] o/ [15:03:30] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) @ssingh -- thank you. Now i can't get in but i think it is an ITS issue. [15:03:47] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:03:51] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:04:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.376 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:04:21] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10ssingh) >>! In T331482#8759447, @FNavas-foundation wrote: > @ssingh -- thank you. Now i can't get in but i think it is an ITS issue. Make sure you are logging in with... [15:04:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:04:43] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:05:00] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:05:25] phuedx: we’re reverting VisualEditorFeatureUse, not edit_attempt, right? [15:05:33] ah, I see the revert already exists :) [15:05:42] Lucas_WMDE: Yes. That's correct. Revert exists: https://gerrit.wikimedia.org/r/905979 [15:05:54] jouncebot: now [15:05:54] No deployments scheduled for the next 1 hour(s) and 54 minute(s) [15:05:56] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10colewhite) [15:06:26] don’t see anything else going on that looks like I shouldn’t deploy right now [15:06:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905979 (owner: 10Phuedx) [15:06:32] let’s go [15:07:00] (03PS1) 10Mazevedo: Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) [15:07:22] (03Merged) 10jenkins-bot: Revert "VisualEditorFeatureUse sampling rate to 1 everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905979 (owner: 10Phuedx) [15:07:50] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:905979|Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"]] [15:07:52] (03CR) 10CI reject: [V: 04-1] Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo) [15:08:45] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [15:09:13] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [15:09:18] !log installing nodejs security updates on buster [15:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:22] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and phuedx: Backport for [[gerrit:905979|Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [15:09:34] phuedx: anything to test, or should I just deploy right away? [15:10:01] Lucas_WMDE: Deploy right away I think. I'll monitor error rate and event rate [15:10:09] ok, doing [15:10:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kafka-test1009.eqiad.wmnet with OS bullseye [15:10:31] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host kafka-test1009.eqiad.wmnet with OS bullseye completed: - kafka-test1009 (**PASS**) - Downtimed on Icinga/A... [15:11:00] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:13:08] Thanks, Lucas_WMDE [15:13:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:13:35] (03CR) 10Cwhite: [C: 03+2] rsyslog: add rsyslog-namespaced fields to syslog_json [puppet] - 10https://gerrit.wikimedia.org/r/904597 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [15:14:20] (03PS1) 10Eevans: cassandra: create aqs cluster user for 'fgoodwin' [puppet] - 10https://gerrit.wikimedia.org/r/906044 (https://phabricator.wikimedia.org/T334099) [15:14:56] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:15:32] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:905979|Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"]] (duration: 07m 42s) [15:15:40] (03CR) 10Eevans: [C: 03+2] cassandra: create aqs cluster user for 'fgoodwin' [puppet] - 10https://gerrit.wikimedia.org/r/906044 (https://phabricator.wikimedia.org/T334099) (owner: 10Eevans) [15:16:09] * Lucas_WMDE done [15:16:24] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [15:17:00] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) You missed a part of this conversation that involved ITS removing fnavas-foundation in favor of the verified WMF ITS created SUL wiki account FNavas-... [15:21:12] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=7; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [15:21:17] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10ssingh) >>! In T331482#8759537, @FNavas-foundation wrote: > You missed a part of this conversation that involved ITS > removing fnavas-foundation in favor of the verifi... [15:21:53] !log installing pcre2 security updates on buster [15:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:00] (HelmReleaseBadStatus) firing: Helm release thumbor/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:24:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:25:05] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:26:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49853 bytes in 5.530 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:27:14] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host kafka-test1010.eqiad.wmnet with OS bullseye [15:27:19] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host kafka-test1010.eqiad.wmnet with OS bullseye [15:28:00] (HelmReleaseBadStatus) resolved: Helm release thumbor/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:30:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=8; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [15:31:25] !log restarting FPM on mediawiki canaries to pick up pcre security update [15:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:24] (03CR) 10Andrew Bogott: "Hello everyone! This is still a useful patch, still in need of review." [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [15:37:38] (03CR) 10Andrew Bogott: "Hello everyone! This is still a useful patch, still in need of review." [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (https://phabricator.wikimedia.org/T325773) (owner: 10Andrew Bogott) [15:37:44] (03PS1) 10Ahmon Dancy: mediawiki::scap: force creation of the symlink when enabled [puppet] - 10https://gerrit.wikimedia.org/r/906051 (https://phabricator.wikimedia.org/T329857) [15:38:08] (03CR) 10CI reject: [V: 04-1] mediawiki::scap: force creation of the symlink when enabled [puppet] - 10https://gerrit.wikimedia.org/r/906051 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [15:38:48] (03CR) 10Dzahn: "Yes, Arnold, that's correct. Change would be just in DNS then." [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [15:39:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:39:40] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync [15:40:07] (03CR) 10BCornwall: [C: 03+1] "Good catch, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/906000 (https://phabricator.wikimedia.org/T334085) (owner: 10Filippo Giunchedi) [15:40:19] (03PS2) 10Ahmon Dancy: mediawiki::scap: force creation of the symlink when enabled [puppet] - 10https://gerrit.wikimedia.org/r/906051 (https://phabricator.wikimedia.org/T329857) [15:41:21] (03PS1) 10Andrew Bogott: wikireplica_dns.yaml: move toolsdb DNS to new server in 'tools' project [puppet] - 10https://gerrit.wikimedia.org/r/906053 (https://phabricator.wikimedia.org/T333471) [15:41:42] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1010.eqiad.wmnet with reason: host reimage [15:42:06] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [15:42:20] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [15:42:57] (03CR) 10BCornwall: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40548/console" [puppet] - 10https://gerrit.wikimedia.org/r/906000 (https://phabricator.wikimedia.org/T334085) (owner: 10Filippo Giunchedi) [15:44:00] 10SRE, 10LDAP-Access-Requests: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10ssingh) @KFrancis: Hi! @MarcoAurelio needs an NDA for this request to proceed. Thank you! [15:44:19] (03CR) 10Andrew Bogott: "To be merged during migration window" [puppet] - 10https://gerrit.wikimedia.org/r/906053 (https://phabricator.wikimedia.org/T333471) (owner: 10Andrew Bogott) [15:44:48] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1010.eqiad.wmnet with reason: host reimage [15:47:36] (03CR) 10Clément Goubert: [C: 03+2] mediawiki::scap: force creation of the symlink when enabled [puppet] - 10https://gerrit.wikimedia.org/r/906051 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [15:47:53] !log Disable Puppet/PyBal on lvs4008 in preparation for reimaging - T321309 [15:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:57] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [15:50:59] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [15:51:09] Amir1: got a minute, may I PM? [15:51:24] sure [15:51:25] what's up [15:51:33] PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:51:37] PROBLEM - pybal on lvs4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:51:50] ^ expected, brett is reimaging lvs4008 [15:52:23] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:52:43] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:54:47] PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:55:31] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [16:02:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kafka-test1010.eqiad.wmnet with OS bullseye [16:02:50] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host kafka-test1010.eqiad.wmnet with OS bullseye completed: - kafka-test1010 (**PASS**) - Downtimed on Icinga/A... [16:04:01] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [16:04:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:04:45] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [16:11:19] (03PS1) 10BCornwall: hiera: lvs/interfaces: update lvs4008 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906057 (https://phabricator.wikimedia.org/T321309) [16:12:46] (03CR) 10Ssingh: hiera: lvs/interfaces: update lvs4008 iface name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906057 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [16:13:43] (03PS2) 10BCornwall: hiera: lvs/interfaces: update lvs4008 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906057 (https://phabricator.wikimedia.org/T321309) [16:13:45] (03CR) 10BCornwall: hiera: lvs/interfaces: update lvs4008 iface name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906057 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [16:16:01] (03CR) 10Ssingh: [C: 03+1] hiera: lvs/interfaces: update lvs4008 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906057 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [16:18:08] !log hnowlan@puppetmaster1001 conftool action : set/weight=8; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet [16:18:55] (03CR) 10Tchanders: [C: 03+1] Undeploy SimilarEditors from Beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896936 (https://phabricator.wikimedia.org/T331718) (owner: 10TsepoThoabala) [16:19:38] (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update lvs4008 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906057 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [16:20:45] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS bullseye [16:20:55] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye [16:22:01] (03CR) 10JHathaway: exim: fix hard-coded vrts hostname (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [16:24:29] 10SRE, 10SRE-Unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10elukey) 05Open→03Resolved a:03elukey All nodes migrated to Bullseye! To keep archives happy - I didn't preserve any data when reimaging the VMs, Kafka's data was not a lot and the brokers were abl... [16:24:33] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10elukey) [16:28:19] (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 [16:29:22] (03PS1) 10Jbond: spicerack: install python3-aiohttp [puppet] - 10https://gerrit.wikimedia.org/r/906066 [16:30:01] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [16:30:14] (03PS2) 10Jbond: spicerack: install python3-aiohttp [puppet] - 10https://gerrit.wikimedia.org/r/906066 [16:30:43] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [16:31:02] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond) [16:31:04] (03CR) 10Jbond: [C: 04-1] "-1: re volans comment" [puppet] - 10https://gerrit.wikimedia.org/r/906031 (owner: 10Jbond) [16:31:09] (03PS2) 10Jbond: sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 [16:33:15] (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond) [16:34:03] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond) [16:34:14] (03PS3) 10JHathaway: Add an in place Debian upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) [16:35:18] (03CR) 10JHathaway: Add an in place Debian upgrade script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [16:36:35] !log hnowlan@puppetmaster1001 conftool action : set/weight=6; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet [16:37:11] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [16:40:29] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [16:45:10] (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: add asincio (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond) [16:46:29] (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: add asincio (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond) [16:47:08] !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route depool restbase-async in codfw: Depool from primary DC following network maintenance [16:47:09] !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache restbase-async.discovery.wmnet on all recursors [16:47:12] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase-async.discovery.wmnet on all recursors [16:47:43] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs4008.ulsfo.wmnet with OS bullseye [16:47:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye executed with errors: - lvs4008 (**FAIL**) - Downtimed on Icinga/Alertmanager... [16:47:55] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS bullseye [16:48:02] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye [16:52:11] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in codfw: Depool from primary DC following network maintenance [16:54:11] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet [16:54:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [16:56:31] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on lists1003.wikimedia.org with reason: Moar CPUs! [16:56:46] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lists1003.wikimedia.org with reason: Moar CPUs! [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1700) [17:00:20] (03PS1) 10Ahmon Dancy: Revert "mediawiki::scap: force creation of the symlink when enabled" [puppet] - 10https://gerrit.wikimedia.org/r/905983 [17:00:32] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10jhathaway) I bumped the CPU count to four and as @MoritzMuehlenhoff mentioned we can always bump higher if the need arises. [17:02:29] (03CR) 10CI reject: [V: 04-1] Revert "mediawiki::scap: force creation of the symlink when enabled" [puppet] - 10https://gerrit.wikimedia.org/r/905983 (owner: 10Ahmon Dancy) [17:03:06] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [17:04:40] (03PS1) 10Jdlrobson: ReadingLists: Show previews on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906069 [17:06:27] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [17:08:11] jouncebot: now [17:08:12] For the next 0 hour(s) and 51 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1700) [17:18:45] 10SRE-swift-storage: Bring ms-fe101[3-4] into service - https://phabricator.wikimedia.org/T334122 (10Eevans) [17:19:31] 10SRE-swift-storage: Bring ms-fe101[3-4] into service - https://phabricator.wikimedia.org/T334122 (10Eevans) p:05Triage→03Medium [17:19:31] (03CR) 10Volans: Add an in place Debian upgrade script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [17:22:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4008.ulsfo.wmnet with OS bullseye [17:22:37] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye completed: - lvs4008 (**WARN**) - Downtimed on Icinga/Alertmanager - //Unable... [17:23:47] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:27:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906069 (owner: 10Jdlrobson) [17:28:35] (03Merged) 10jenkins-bot: ReadingLists: Show previews on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906069 (owner: 10Jdlrobson) [17:28:35] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:28:40] !log deploying labs-only change [17:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:27] (03Abandoned) 10David Caro: buildservice: use /app as workingdir [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/906005 (owner: 10David Caro) [17:32:23] !log Disable Puppet/PyBal on lvs4009 in preparation for reimaging - T321309 [17:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:27] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [17:34:36] (03PS1) 10BCornwall: hiera: lvs/interfaces: update lvs4009 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906076 (https://phabricator.wikimedia.org/T321309) [17:35:02] (03PS2) 10BCornwall: hiera: lvs/interfaces: update lvs4009 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906076 (https://phabricator.wikimedia.org/T321309) [17:35:47] (03CR) 10Ssingh: [C: 03+1] hiera: lvs/interfaces: update lvs4009 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906076 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [17:35:59] PROBLEM - PyBal backends health check on lvs4009 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:36:05] (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update lvs4009 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906076 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [17:36:07] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:36:14] ^ expected [17:36:17] PROBLEM - pybal on lvs4009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:36:27] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:36:40] (03PS1) 10Eevans: swift: add ms-fe101[3-4] as new Swift proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/906078 (https://phabricator.wikimedia.org/T334122) [17:38:39] PROBLEM - PyBal connections to etcd on lvs4009 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [17:41:53] (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:46:13] (03CR) 10David Caro: "This is going to help a lot testing stuff \o/" [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [17:50:56] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4009.ulsfo.wmnet with OS bullseye [17:51:06] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs4009.ulsfo.wmnet with OS bullseye [17:51:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:54:36] (03PS2) 10Mazevedo: Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) [17:54:42] (03CR) 10Dzahn: "I am going through the users of this role at https://openstack-browser.toolforge.org/puppetclass/role::simplelamp2 to check what their cur" [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [17:55:18] (03CR) 10CI reject: [V: 04-1] Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo) [17:57:25] (03PS3) 10Mazevedo: Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) [18:00:05] hashar and dduvall: (Dis)respected human, time to deploy Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1800). Please do the needful. [18:00:05] hashar and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T1800). [18:03:16] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [18:06:33] (03PS2) 10Ahmon Dancy: Revert "mediawiki::scap: force creation of the symlink when enabled" [puppet] - 10https://gerrit.wikimedia.org/r/905983 (https://phabricator.wikimedia.org/T329857) [18:09:43] (03CR) 10Dzahn: "example compile on cloud VPS host name: https://puppet-compiler.wmflabs.org/output/888800/40549/signwriting-swis-2022.signwriting.eqiad1.w" [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [18:11:33] (03CR) 10Dzahn: "I think now that I have to go through each existing project, check their data dir and whether they have restarted, then those that actuall" [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [18:16:53] (03CR) 10Jforrester: [C: 03+1] "Let's deploy this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 (owner: 10Legoktm) [18:22:08] (03PS1) 10Majavah: openstack: puppet-enc: add foreign keys for hiera/role tables [puppet] - 10https://gerrit.wikimedia.org/r/906085 [18:22:10] (03PS1) 10Majavah: openstack: puppet-enc: add endpoint for deleting entire projects [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127) [18:22:12] (03PS1) 10Majavah: openstack: admin_scripts: properly remove old projects from enc [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127) [18:31:46] (03PS2) 10Majavah: openstack: puppet-enc: add foreign keys for hiera/role tables [puppet] - 10https://gerrit.wikimedia.org/r/906085 [18:31:48] (03PS2) 10Majavah: openstack: puppet-enc: add endpoint for deleting entire projects [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127) [18:31:50] (03PS2) 10Majavah: openstack: admin_scripts: properly remove old projects from enc [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127) [18:37:32] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs4009.ulsfo.wmnet with OS bullseye [18:37:39] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs4009.ulsfo.wmnet with OS bullseye executed with errors: - lvs4009 (**FAIL**) - Downtimed on Icinga/Alertmanager... [18:37:56] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4009.ulsfo.wmnet with OS bullseye [18:38:03] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs4009.ulsfo.wmnet with OS bullseye [18:38:13] (03CR) 10JHathaway: Add an in place Debian upgrade script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [18:48:49] (03CR) 10Herron: [C: 03+1] sre: mute etcd-mirror pint promql checks [alerts] - 10https://gerrit.wikimedia.org/r/906011 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [18:52:59] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4009.ulsfo.wmnet with reason: host reimage [18:56:57] (03PS4) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) [18:58:25] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4009.ulsfo.wmnet with reason: host reimage [19:12:16] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:12:57] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4009.ulsfo.wmnet with OS bullseye [19:13:03] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs4009.ulsfo.wmnet with OS bullseye completed: - lvs4009 (**PASS**) - Removed from Puppet and PuppetDB if present... [19:19:01] !log mforns@deploy2002 Started deploy [analytics/refinery@944a995]: Regular analytics weekly train [analytics/refinery@944a995] [19:19:50] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:24:46] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:25:21] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) ITS should have removed fnavas-foundation entirely. Do you still see it? or is the only issue that FNavas-WMF is not LADP? [19:25:33] !log mforns@deploy2002 Finished deploy [analytics/refinery@944a995]: Regular analytics weekly train [analytics/refinery@944a995] (duration: 06m 31s) [19:25:42] !log mforns@deploy2002 Started deploy [analytics/refinery@944a995] (thin): Regular analytics weekly train THIN [analytics/refinery@944a995] [19:25:43] (03PS2) 10Dzahn: simplelamp2: change default mariadb datadir to /var/lib/mysql/ [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) [19:25:51] !log mforns@deploy2002 Finished deploy [analytics/refinery@944a995] (thin): Regular analytics weekly train THIN [analytics/refinery@944a995] (duration: 00m 08s) [19:25:59] !log mforns@deploy2002 Started deploy [analytics/refinery@944a995] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@944a995] [19:26:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:27:28] !log mforns@deploy2002 Finished deploy [analytics/refinery@944a995] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@944a995] (duration: 01m 29s) [19:27:48] (03CR) 10CI reject: [V: 04-1] simplelamp2: change default mariadb datadir to /var/lib/mysql/ [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [19:29:33] (03PS1) 10BCornwall: hiera: lvs/interfaces: update 5004 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906098 (https://phabricator.wikimedia.org/T321309) [19:30:35] !log Disable Puppet/PyBal on lvs5004 in preparation for reimaging - T321309 [19:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:39] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [19:31:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:32:05] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) @FNavas-foundation What matters here is what login you are using on the wikitech wiki ( https://wikitech.wikimedia.org/wiki/Main_Page). If the user works there t... [19:32:22] 10SRE, 10Wikimedia-Mailing-lists: Add new owners to the wikies-l mailing list - https://phabricator.wikimedia.org/T334135 (10MarcoAurelio) [19:32:27] 10SRE, 10Wikimedia-Mailing-lists: Add new owners to the wikies-l mailing list - https://phabricator.wikimedia.org/T334135 (10MarcoAurelio) [19:34:20] PROBLEM - PyBal connections to etcd on lvs5004 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [19:34:30] hmm - https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1 <-- mailman queue got again somewhat backlogged again today [19:35:34] PROBLEM - pybal on lvs5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:35:44] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:35:55] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) Now, when checking LDAP we can see there are 2 users: - uid: fnavas (43544) - uid: fnavas-foundation (43670) Both are using the same -ctr@wikimedia email addre... [19:39:05] (03CR) 10Ssingh: [C: 03+1] hiera: lvs/interfaces: update 5004 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906098 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:41:06] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10MarcoAurelio) Not sure if there's anything actionable here left to do. Lo... [19:41:36] (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update 5004 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906098 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:42:24] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:42:36] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:48:03] sukhe: Just making sure this isn't possibly my fault [19:50:16] brett: no, not related to us [19:50:18] all good [19:52:53] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5004.eqsin.wmnet with OS bullseye [19:53:04] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs5004.eqsin.wmnet with OS bullseye [19:54:14] 10SRE, 10Wikimedia-Mailing-lists: Add new owners to the wikies-l mailing list - https://phabricator.wikimedia.org/T334135 (10MarcoAurelio) [19:54:56] !log mforns@deploy2002 Started deploy [analytics/refinery@eb4c2b2]: Regular analytics weekly train [analytics/refinery@eb4c2b2] [19:55:44] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:55:58] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:56:59] If no one objects, I will depool sessionstore in eqiad in the next 30 minutes or so to conduct some experiments (see: T327954) [19:56:59] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230405T2000). Please do the needful. [20:00:05] tsepoThoabala, nray, and Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:01] o/ [20:01:07] Hello :) [20:01:22] !log mforns@deploy2002 Finished deploy [analytics/refinery@eb4c2b2]: Regular analytics weekly train [analytics/refinery@eb4c2b2] (duration: 06m 26s) [20:01:32] !log mforns@deploy2002 Started deploy [analytics/refinery@eb4c2b2] (thin): Regular analytics weekly train THIN [analytics/refinery@eb4c2b2] [20:01:41] !log mforns@deploy2002 Finished deploy [analytics/refinery@eb4c2b2] (thin): Regular analytics weekly train THIN [analytics/refinery@eb4c2b2] (duration: 00m 08s) [20:01:59] !log mforns@deploy2002 Started deploy [analytics/refinery@eb4c2b2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@eb4c2b2] [20:02:20] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:02:36] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:03:33] !log mforns@deploy2002 Finished deploy [analytics/refinery@eb4c2b2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@eb4c2b2] (duration: 01m 34s) [20:06:27] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) @elukey so the foreign drives have effected both os drives it will need to be reimaged and is not letting me clear it. I did open the box and did found a... [20:09:33] (03CR) 10Ottomata: Updates to kafka-dev chart for running in minikube (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905716 (owner: 10Ottomata) [20:12:07] (03PS2) 10AOkoth: exim: fix hard-coded vrts hostname [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) [20:13:54] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:14:04] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) @elukey i was able to clear foreign status but will still need to be reimaged. [20:14:12] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:15:19] Is anyone around  to help deploy? [20:16:41] Uhm... no one replied to the ping :( [20:17:18] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@2192f15]: (no justification provided) [20:17:31] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@2192f15]: (no justification provided) (duration: 00m 12s) [20:18:26] (03CR) 10Ottomata: [C: 03+2] Updates to kafka-dev chart for running in minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/905716 (owner: 10Ottomata) [20:18:53] hi - sorry to be late - i can help deploy if there's still a need [20:19:07] \o/ [20:19:35] ok - i'll start with the top of the queue [20:19:37] yes please [20:19:49] (03PS2) 10Clare Ming: Undeploy SimilarEditors from Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896936 (https://phabricator.wikimedia.org/T331718) (owner: 10TsepoThoabala) [20:19:52] thank you cjming :) [20:20:16] np! [20:21:08] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5004.eqsin.wmnet with reason: host reimage [20:21:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896936 (https://phabricator.wikimedia.org/T331718) (owner: 10TsepoThoabala) [20:21:58] (03Merged) 10jenkins-bot: Undeploy SimilarEditors from Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896936 (https://phabricator.wikimedia.org/T331718) (owner: 10TsepoThoabala) [20:22:22] !log cjming@deploy2002 Started scap: Backport for [[gerrit:896936|Undeploy SimilarEditors from Beta (T331718)]] [20:22:28] T331718: Undeploy SimilarEditors from Beta - https://phabricator.wikimedia.org/T331718 [20:22:41] tsepoThoabala: are your changes testable? on any debug server if so [20:23:00] no they are not. [20:23:12] so i'll just sync then [20:23:24] cool thanks. [20:24:09] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5004.eqsin.wmnet with reason: host reimage [20:24:36] (03Merged) 10jenkins-bot: Updates to kafka-dev chart for running in minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/905716 (owner: 10Ottomata) [20:26:16] scap is hanging a bit [20:29:07] (03PS1) 10Cathal Mooney: Change check_eth script to work without filter on netdev names [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007) [20:33:20] (03PS1) 10JHathaway: aux: Update jaeger templates to match upstream [deployment-charts] - 10https://gerrit.wikimedia.org/r/906104 (https://phabricator.wikimedia.org/T320554) [20:35:22] (03PS2) 10Cathal Mooney: Change check_eth script to work without filter on netdev names [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007) [20:35:24] RECOVERY - pybal on lvs5004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [20:38:42] (03PS3) 10Cathal Mooney: Change check_eth script to work without filter on netdev names [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007) [20:43:54] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:44:00] 10SRE, 10Infrastructure-Foundations, 10Traffic: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) [20:44:02] !log cjming@deploy2002 tsepothoabala and cjming: Backport for [[gerrit:896936|Undeploy SimilarEditors from Beta (T331718)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:44:06] T331718: Undeploy SimilarEditors from Beta - https://phabricator.wikimedia.org/T331718 [20:44:36] RECOVERY - PyBal connections to etcd on lvs5004 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [20:44:53] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5004.eqsin.wmnet with OS bullseye [20:44:59] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs5004.eqsin.wmnet with OS bullseye completed: - lvs5004 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled... [20:45:24] tsepoThoabala: syncing now -- it's been a while since i last deployed -- i don't recall scap taking so long but i guess that's the new normal these days [20:45:28] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:45:52] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:45:55] yes this seemed to have went a bit long , thanks [20:46:29] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:49:21] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:51:59] (03CR) 10JHathaway: [C: 03+2] aux: Update jaeger templates to match upstream [deployment-charts] - 10https://gerrit.wikimedia.org/r/906104 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [20:57:43] !log Disable Puppet/PyBal on lvs5005 in preparation for reimaging - T321309 [20:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:47] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [20:58:03] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:896936|Undeploy SimilarEditors from Beta (T331718)]] (duration: 35m 41s) [20:58:07] T331718: Undeploy SimilarEditors from Beta - https://phabricator.wikimedia.org/T331718 [20:58:12] tsepoThoabala: your changes should be live! [20:58:37] cjming thank you. [20:58:56] hi nray! shall we move onto your patch? [20:59:03] cjming: sounds good! [20:59:05] (03PS3) 10Dzahn: simplelamp2: change default mariadb datadir to /var/lib/mysql/ [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) [20:59:25] (03CR) 10Cwhite: [C: 03+2] add haproxy ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902611 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:59:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905769 (https://phabricator.wikimedia.org/T331681) (owner: 10Nray) [21:00:19] (03Merged) 10jenkins-bot: Add static mobile United_States page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905769 (https://phabricator.wikimedia.org/T331681) (owner: 10Nray) [21:00:40] !log cjming@deploy2002 Started scap: Backport for [[gerrit:905769|Add static mobile United_States page to facilitate synthetic testing of T331681 (T331681)]] [21:00:44] T331681: Measure performance of cookie-based anonymous client preferences - https://phabricator.wikimedia.org/T331681 [21:01:31] (03PS1) 10BCornwall: hiera: lvs/interfaces: update 5005 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906126 (https://phabricator.wikimedia.org/T321309) [21:01:52] !log UTC late backport & config window continuing [21:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:06] !log cjming@deploy2002 cjming and nray: Backport for [[gerrit:905769|Add static mobile United_States page to facilitate synthetic testing of T331681 (T331681)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:02:25] nray: can you test on a debug server? [21:02:31] cjming: yes [21:02:39] cjming: testing now, thank you! [21:02:45] that went way faster [21:04:00] PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [21:04:56] PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [21:05:02] PROBLEM - pybal on lvs5005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [21:05:16] cjming: tested and things look good [21:05:29] nray: great - syncing [21:05:34] (03CR) 10Ssingh: [C: 03+1] hiera: lvs/interfaces: update 5005 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906126 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [21:05:42] (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update 5005 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906126 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [21:07:12] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:08:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:09:29] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001" [21:09:43] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/888800/40550/signwriting-swis-2022.signwriting.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [21:10:34] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001" [21:10:34] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:10:46] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:905769|Add static mobile United_States page to facilitate synthetic testing of T331681 (T331681)]] (duration: 10m 06s) [21:10:50] T331681: Measure performance of cookie-based anonymous client preferences - https://phabricator.wikimedia.org/T331681 [21:10:58] nray: your changes are live! nice to see you :) [21:11:33] cjming: \o/ thank you! Great to see you too! [21:11:47] Superpes: if you're still around, happy to do your patch [21:12:23] Hi cjming :D Yep I'm here! Many thanks :) [21:12:24] (03PS1) 10Andrew Bogott: Added dummy ldap_os_system_pass [labs/private] - 10https://gerrit.wikimedia.org/r/906128 (https://phabricator.wikimedia.org/T330759) [21:12:35] (03PS2) 10Clare Ming: [mgwiki] Replace the wordmark on Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905960 (https://phabricator.wikimedia.org/T334022) (owner: 10Superpes15) [21:12:46] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added dummy ldap_os_system_pass [labs/private] - 10https://gerrit.wikimedia.org/r/906128 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [21:13:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:14:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905960 (https://phabricator.wikimedia.org/T334022) (owner: 10Superpes15) [21:14:45] (03Merged) 10jenkins-bot: [mgwiki] Replace the wordmark on Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905960 (https://phabricator.wikimedia.org/T334022) (owner: 10Superpes15) [21:15:09] !log cjming@deploy2002 Started scap: Backport for [[gerrit:905960|[mgwiki] Replace the wordmark on Vector 2022 (T334022)]] [21:15:13] T334022: Word mark for Malagasy Wikipedia mobile site is in Guarani - https://phabricator.wikimedia.org/T334022 [21:16:22] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5005.eqsin.wmnet with OS bullseye [21:16:28] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs5005.eqsin.wmnet with OS bullseye [21:16:34] !log cjming@deploy2002 superpes and cjming: Backport for [[gerrit:905960|[mgwiki] Replace the wordmark on Vector 2022 (T334022)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [21:16:53] Superpes: can you test on a debug server? [21:17:09] Sure! Looking :) [21:17:51] It's fine cjming :) [21:17:59] cool - syncing [21:19:22] (03CR) 10Dzahn: [C: 03+2] "I double-checked on every instance that uses this - noop everywhere - after taavi added Hiera keys for me in those projects. Now this is f" [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [21:21:08] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:21:28] Superpes: once it's sync'd I believe I need to purge that file - not sure where to run purgeList from these days [21:23:07] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:905960|[mgwiki] Replace the wordmark on Vector 2022 (T334022)]] (duration: 07m 58s) [21:23:11] T334022: Word mark for Malagasy Wikipedia mobile site is in Guarani - https://phabricator.wikimedia.org/T334022 [21:23:57] cjming: that would be currently mwmaint2002.codfw.wmnet (whatever is mwmaint.discovery.wmnet DNS entry points to) [21:24:18] mutante: thanks! [21:24:21] yw [21:25:56] 10SRE, 10Wikimedia-Mailing-lists: Add new owners to the wikies-l mailing list - https://phabricator.wikimedia.org/T334135 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup {{done}} [21:26:20] Superpes: your change should be live - i also purged the file so hopefully you see the new wordmark [21:26:50] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Hi @volans - thanks for the details on S/N #7S5LMH3, 7S5MMH3, 7S5NMH3, 7S5PMH3, and 5BF90C3. The first four were deleted in error, which @RobH just fixed...and... [21:27:56] Oh wonderful! I confirm that I see it live :) Many thanks for your time cjming :) [21:28:09] ur welcome! [21:28:33] !log end of UTC late backport window [21:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:39] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001" [21:31:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001" [21:31:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:41:47] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5005.eqsin.wmnet with reason: host reimage [21:41:53] (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:45:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5005.eqsin.wmnet with reason: host reimage [21:51:56] (03PS1) 10Eevans: sessionstore: make native transport (intentionally) unreachable [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) [21:52:32] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:52:35] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [21:52:56] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:55:01] 10SRE, 10LDAP-Access-Requests: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10KFrancis) Hi @MarcoAurelio, please send your email address to kfrancis@wikimedia.org and I'll process this request. Thanks! [22:03:13] (03PS2) 10Eevans: sessionstore: make native transport (intentionally) unreachable [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) [22:03:33] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [22:04:18] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [22:05:45] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5005.eqsin.wmnet with OS bullseye [22:05:52] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs5005.eqsin.wmnet with OS bullseye completed: - lvs5005 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled... [22:08:26] (03PS3) 10Eevans: sessionstore: make native transport (intentionally) unreachable [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) [22:12:17] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [22:12:42] (03CR) 10BBlack: [C: 03+1] "Looks right to me - moves listener from the normal port 9042 to 9043, making it unavailable to clients!" [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [22:14:16] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [22:17:16] (03CR) 10Eevans: [C: 03+2] sessionstore: make native transport (intentionally) unreachable [puppet] - 10https://gerrit.wikimedia.org/r/906131 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [22:20:59] !log restarting Cassandra on sessionstore1001 to apply (intentionally) unreachable native transport — T327954 [22:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:04] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [22:23:01] bblack: oh, this is going to create a service alert (port 9042) [22:23:40] yeah, probably :) [22:23:44] can downtime it! [22:24:22] ha, got it before it paged anyway [22:24:39] * urandom spikes the ball [22:25:02] * brett writhes in pain on the ground [22:33:30] !log rebooting Cassandra on sessionstore1001 — T327954 [22:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:35] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [22:36:32] !log enabling lsw1-e1-eqiad port et-0/0/51 to ssw1-e1-eqiad et-0/0/80 T322937 [22:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:36] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [22:42:46] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:44:24] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:53:54] ^^^ these alerts are related to issue our transport provider is having on path from eqiad to codfw. [22:54:03] emails to noc are about the same thing [22:54:05] currently link is up and stable for ~11min [22:59:36] 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dwisehaupt) [23:02:40] (03CR) 10Cwhite: [C: 03+1] sre: mute etcd-mirror pint promql checks [alerts] - 10https://gerrit.wikimedia.org/r/906011 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [23:03:17] (03CR) 10Cwhite: [C: 03+1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [23:33:13] (03PS5) 10Legoktm: Add to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 [23:35:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by legoktm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 (owner: 10Legoktm) [23:36:30] (03Merged) 10jenkins-bot: Add to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 (owner: 10Legoktm) [23:36:56] !log legoktm@deploy2002 Started scap: Backport for [[gerrit:896837|Add to verify Mastodon account on mediawiki.org]] [23:38:22] !log legoktm@deploy2002 legoktm: Backport for [[gerrit:896837|Add to verify Mastodon account on mediawiki.org]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [23:39:05] > [23:44:43] !log legoktm@deploy2002 Finished scap: Backport for [[gerrit:896837|Add to verify Mastodon account on mediawiki.org]] (duration: 07m 47s) [23:46:04] got the verified tick :D [23:46:49] (03PS2) 10Legoktm: Remove misleading "disable" of Special:Mostlinkedcategories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804805 (https://phabricator.wikimedia.org/T310456) [23:49:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by legoktm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804805 (https://phabricator.wikimedia.org/T310456) (owner: 10Legoktm) [23:50:22] (03Merged) 10jenkins-bot: Remove misleading "disable" of Special:Mostlinkedcategories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804805 (https://phabricator.wikimedia.org/T310456) (owner: 10Legoktm) [23:50:46] !log legoktm@deploy2002 Started scap: Backport for [[gerrit:804805|Remove misleading "disable" of Special:Mostlinkedcategories (T310456)]] [23:50:50] T310456: Re-enable daily updates of formerly slow enwiki QueryPages - https://phabricator.wikimedia.org/T310456 [23:52:08] !log legoktm@deploy2002 legoktm: Backport for [[gerrit:804805|Remove misleading "disable" of Special:Mostlinkedcategories (T310456)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [23:53:21] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10wiki_willy) Thanks for trying out the reimage solution @MatthewVernon. It helps us progress things along further with the Dell support request. The latest note from Dell is that... [23:53:26] PROBLEM - zuul_merger_service_running on contint2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [23:54:18] PROBLEM - Check systemd state on contint2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_zuul-merger.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:50] !log rebooting Cassandra on sessionstore1001 — T327954 [23:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:54] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [23:58:41] !log legoktm@deploy2002 Finished scap: Backport for [[gerrit:804805|Remove misleading "disable" of Special:Mostlinkedcategories (T310456)]] (duration: 07m 55s) [23:58:45] T310456: Re-enable daily updates of formerly slow enwiki QueryPages - https://phabricator.wikimedia.org/T310456