[00:01:48] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3054.esams.wmnet with OS bullseye [00:01:54] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3054.esams.wmnet with OS bullseye completed: - cp3054 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [00:02:10] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp3054.esams.wmnet [00:10:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [00:12:36] PROBLEM - MariaDB Replica Lag: s3 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 740.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:12:50] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3055.esams.wmnet with reason: host reimage [00:15:04] PROBLEM - MariaDB Replica Lag: s4 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 890.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:15:57] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3055.esams.wmnet with reason: host reimage [00:22:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:22:52] PROBLEM - MariaDB Replica Lag: s1 on db1140 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1356.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:23:42] PROBLEM - Disk space on an-airflow1001 is CRITICAL: DISK CRITICAL - free space: / 929 MB (2% inode=71%): /tmp 929 MB (2% inode=71%): /var/tmp 929 MB (2% inode=71%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-airflow1001&var-datasource=eqiad+prometheus/ops [00:37:50] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3055.esams.wmnet with OS bullseye [00:37:56] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3055.esams.wmnet with OS bullseye completed: - cp3055 (**PASS**) - Removed from Puppet and PuppetDB if present -... [00:38:24] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp3055.esams.wmnet [00:41:30] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [01:18:51] (03PS1) 10Dzahn: gitlab/cloud: set sshd listen address for gitlab-prod-1002 [puppet] - 10https://gerrit.wikimedia.org/r/885445 (https://phabricator.wikimedia.org/T318521) [01:19:27] (03CR) 10Dzahn: [C: 03+2] gitlab/cloud: set sshd listen address for gitlab-prod-1002 [puppet] - 10https://gerrit.wikimedia.org/r/885445 (https://phabricator.wikimedia.org/T318521) (owner: 10Dzahn) [01:21:52] (03CR) 10Dzahn: [C: 03+2] "puppet run on gitlab-prod-1002 is now unbroken" [puppet] - 10https://gerrit.wikimedia.org/r/885445 (https://phabricator.wikimedia.org/T318521) (owner: 10Dzahn) [02:05:02] RECOVERY - MariaDB Replica Lag: s1 on db1140 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:10:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:20:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:08] RECOVERY - MariaDB Replica Lag: s3 on db1102 is OK: OK slave_sql_lag Replication lag: 0.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:27:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [02:32:30] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10VirginiaPoundstone) [02:39:10] RECOVERY - MariaDB Replica Lag: s4 on db2099 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:12:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:21:30] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (install2004), Fresh: 117 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:19:54] PROBLEM - MariaDB Replica Lag: s1 on db1140 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1041.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:22:14] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 118 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:19:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:42:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [06:42:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [06:42:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:43:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:43:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T310011)', diff saved to https://phabricator.wikimedia.org/P43520 and previous config saved to /var/cache/conftool/dbconfig/20230201-064311-ladsgroup.json [06:44:20] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:30] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:46:06] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:22] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T310011)', diff saved to https://phabricator.wikimedia.org/P43521 and previous config saved to /var/cache/conftool/dbconfig/20230201-064828-ladsgroup.json [06:50:10] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230201T0700) [07:03:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P43522 and previous config saved to /var/cache/conftool/dbconfig/20230201-070335-ladsgroup.json [07:08:22] RECOVERY - MariaDB Replica Lag: s1 on db1140 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:18:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P43523 and previous config saved to /var/cache/conftool/dbconfig/20230201-071841-ladsgroup.json [07:33:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T310011)', diff saved to https://phabricator.wikimedia.org/P43524 and previous config saved to /var/cache/conftool/dbconfig/20230201-073348-ladsgroup.json [07:34:04] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:48:34] (03PS1) 10Muehlenhoff: Enable DHCP on install2004 [puppet] - 10https://gerrit.wikimedia.org/r/885623 [07:51:07] (03CR) 10Muehlenhoff: [C: 03+2] Enable DHCP on install2004 [puppet] - 10https://gerrit.wikimedia.org/r/885623 (owner: 10Muehlenhoff) [07:57:34] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 55821 [08:00:05] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230201T0800). [08:00:05] phedenskog and Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:45] I'm here. [08:00:54] you can self-serve? [08:00:55] 10SRE, 10SRE-Access-Requests: Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T328517 (10santhosh) [08:01:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 55821 [08:02:18] Amir1: I can test and verify that things don't break? [08:02:37] you can't deply? okay, I do it for you then [08:03:03] (03PS5) 10Ladsgroup: Remove unused eventlogging_RUMSpeedIndex stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726854 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog) [08:03:09] (03CR) 10Ladsgroup: [C: 03+2] Remove unused eventlogging_RUMSpeedIndex stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726854 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog) [08:03:30] Amir1: thank you! [08:03:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726854 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog) [08:03:57] (03Merged) 10jenkins-bot: Remove unused eventlogging_RUMSpeedIndex stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726854 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog) [08:04:17] thank you for the clean up. Much appreciated [08:04:28] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:726854|Remove unused eventlogging_RUMSpeedIndex stream (T286700)]] [08:04:31] T286700: Remove RUM Speed Index from the Navigation Timing extension - https://phabricator.wikimedia.org/T286700 [08:04:40] (03CR) 10Ayounsi: [C: 03+1] Point to install2004 for DHCP in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/885326 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [08:05:33] !log installing libarchive security updates [08:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:17] !log ladsgroup@deploy1002 phedenskog and ladsgroup: Backport for [[gerrit:726854|Remove unused eventlogging_RUMSpeedIndex stream (T286700)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:07:18] 10SRE, 10SRE-Access-Requests: Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T328517 (10santhosh) [08:07:27] phedenskog: it's in mwdebug, is it testable there? [08:07:51] I'm alos here :) [08:08:30] Amir1: No, the way to check is that metrics still is coming in for our navtiming schema, easiest is to check https://grafana.wikimedia.org/d/000000143/navigation-timing?orgId=1&from=now-15m&to=now [08:09:01] okay, I'm pushing it since it's straightforward [08:09:11] Superpes: I'll get to your patches soon [08:09:22] Yep, yep, no rush! I'll wait you :) [08:11:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:14:10] 10SRE, 10Traffic-Icebox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10ayounsi) [08:14:21] 10SRE, 10Traffic-Icebox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10ayounsi) wow I didn't know about this task :) Since then a lot happened, but active/active netbox is still something we're looking at doing. We have https://gerrit.wikimedia.org/r/c/operations/dns/+/808198 bloc... [08:14:43] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:726854|Remove unused eventlogging_RUMSpeedIndex stream (T286700)]] (duration: 10m 15s) [08:14:47] T286700: Remove RUM Speed Index from the Navigation Timing extension - https://phabricator.wikimedia.org/T286700 [08:15:17] phedenskog: it's deployed. Does it look good? [08:15:24] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netbox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10ayounsi) [08:15:27] (03PS3) 10Ladsgroup: Remove former EventLogging streams for navtiming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879926 (https://phabricator.wikimedia.org/T281103) (owner: 10Krinkle) [08:16:13] Amir1: Looks good, thank you! [08:16:22] (03CR) 10Ladsgroup: [C: 03+2] Remove former EventLogging streams for navtiming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879926 (https://phabricator.wikimedia.org/T281103) (owner: 10Krinkle) [08:16:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879926 (https://phabricator.wikimedia.org/T281103) (owner: 10Krinkle) [08:16:39] awesome. Moving forward with the second one [08:16:40] (03CR) 10JMeybohm: [C: 03+1] pontoon: default to not block_abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/885370 (owner: 10Filippo Giunchedi) [08:16:56] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10ayounsi) [08:16:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:17:04] (03PS1) 10Elukey: admin_ng: promote the inference-staging's TLS SANs to inference [deployment-charts] - 10https://gerrit.wikimedia.org/r/885625 (https://phabricator.wikimedia.org/T327302) [08:17:08] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [08:17:18] (03Merged) 10jenkins-bot: Remove former EventLogging streams for navtiming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879926 (https://phabricator.wikimedia.org/T281103) (owner: 10Krinkle) [08:17:44] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:879926|Remove former EventLogging streams for navtiming (T281103 T286703 T308621 T323623)]] [08:17:51] T323623: Decomission the FirstInputTiming instrument from webperf pipeline - https://phabricator.wikimedia.org/T323623 [08:17:52] T286703: Navigation Timing cleanup - https://phabricator.wikimedia.org/T286703 [08:17:52] T281103: Update how we measure LayoutShift - https://phabricator.wikimedia.org/T281103 [08:17:52] T308621: Remove inactive code for Element Timing experiment - https://phabricator.wikimedia.org/T308621 [08:19:20] (03CR) 10Stevemunene: [C: 03+2] Add authzIdentity to jaas config [deployment-charts] - 10https://gerrit.wikimedia.org/r/885360 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene) [08:19:31] !log ladsgroup@deploy1002 ladsgroup and krinkle: Backport for [[gerrit:879926|Remove former EventLogging streams for navtiming (T281103 T286703 T308621 T323623)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:19:52] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for 6 hosts [08:19:54] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 6 hosts [08:25:03] (03CR) 10Elukey: [C: 03+2] admin_ng: promote the inference-staging's TLS SANs to inference [deployment-charts] - 10https://gerrit.wikimedia.org/r/885625 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [08:25:05] (03Merged) 10jenkins-bot: Add authzIdentity to jaas config [deployment-charts] - 10https://gerrit.wikimedia.org/r/885360 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene) [08:25:26] Superpes: two of your patches are marked as WIP [08:25:31] is that intentional? [08:25:34] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884934 (https://phabricator.wikimedia.org/T328357) (owner: 10Superpes15) [08:26:04] Yep now one is in review! I should simply test the wordmark on cswikt and after the merge do a rebase and test the wordmark on trwiki! I added the other tasks but maybe they are out of the inclusion policy (although I saw that other users add similar tasks)! [08:26:05] (03CR) 10Ladsgroup: [C: 03+2] Add mobile wordmark to cswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884934 (https://phabricator.wikimedia.org/T328357) (owner: 10Superpes15) [08:26:51] (03Merged) 10jenkins-bot: Add mobile wordmark to cswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884934 (https://phabricator.wikimedia.org/T328357) (owner: 10Superpes15) [08:27:27] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:879926|Remove former EventLogging streams for navtiming (T281103 T286703 T308621 T323623)]] (duration: 09m 42s) [08:27:29] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:27:32] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:27:34] T323623: Decomission the FirstInputTiming instrument from webperf pipeline - https://phabricator.wikimedia.org/T323623 [08:27:34] T286703: Navigation Timing cleanup - https://phabricator.wikimedia.org/T286703 [08:27:34] T281103: Update how we measure LayoutShift - https://phabricator.wikimedia.org/T281103 [08:27:35] T308621: Remove inactive code for Element Timing experiment - https://phabricator.wikimedia.org/T308621 [08:27:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:27:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:27:52] phedenskog: it should be deployed now ^_^ [08:27:55] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:884934|Add mobile wordmark to cswiktionary (T328357)]] [08:27:58] (03PS1) 10JMeybohm: Revert "Switch staging.svc.eqiad.wmnet to point to codfw k8s" [dns] - 10https://gerrit.wikimedia.org/r/885746 [08:27:58] T328357: Czech mobile Wiktionary - Czech wordmark - https://phabricator.wikimedia.org/T328357 [08:28:32] (03PS1) 10JMeybohm: Revert "Switch the active staging cluster to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/885747 [08:28:44] (03PS2) 10JMeybohm: Revert "Switch the active staging cluster to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/885747 [08:28:55] (03PS2) 10JMeybohm: Revert "Switch staging.svc.eqiad.wmnet to point to codfw k8s" [dns] - 10https://gerrit.wikimedia.org/r/885746 [08:29:08] Amir1: Looks good, thank you again! :) [08:29:44] !log ladsgroup@deploy1002 superpes and ladsgroup: Backport for [[gerrit:884934|Add mobile wordmark to cswiktionary (T328357)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:30:18] Superpes: live in mwdebug, please check [08:30:27] Doing [08:32:23] (03CR) 10Hashar: phabricator: ensure phd uid/gid can not be changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [08:33:21] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: remove Debian revision suffix from version check [cookbooks] - 10https://gerrit.wikimedia.org/r/885385 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [08:33:49] (03CR) 10JMeybohm: [C: 03+2] Revert "Switch the active staging cluster to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/885747 (owner: 10JMeybohm) [08:34:26] Amir1 Seems good [08:34:40] cool syncing [08:35:00] (03CR) 10JMeybohm: [C: 03+2] Revert "Switch staging.svc.eqiad.wmnet to point to codfw k8s" [dns] - 10https://gerrit.wikimedia.org/r/885746 (owner: 10JMeybohm) [08:35:18] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: remove Debian revision suffix from version check [cookbooks] - 10https://gerrit.wikimedia.org/r/885385 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [08:37:56] Superpes: the trwiktionary one is also WIP [08:38:09] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885321 (https://phabricator.wikimedia.org/T328499) (owner: 10Superpes15) [08:38:13] it's fine, I just don't want to deploy unfinished patches :D [08:38:23] @Amir1 Lol [08:38:24] (03CR) 10Ladsgroup: [C: 03+2] Add a wordmark to trwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885321 (https://phabricator.wikimedia.org/T328499) (owner: 10Superpes15) [08:39:05] (03Merged) 10jenkins-bot: Add a wordmark to trwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885321 (https://phabricator.wikimedia.org/T328499) (owner: 10Superpes15) [08:40:21] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:884934|Add mobile wordmark to cswiktionary (T328357)]] (duration: 12m 26s) [08:40:24] T328357: Czech mobile Wiktionary - Czech wordmark - https://phabricator.wikimedia.org/T328357 [08:40:40] (03PS2) 10Muehlenhoff: Move next-server settings from install2003->2004 [puppet] - 10https://gerrit.wikimedia.org/r/885336 (https://phabricator.wikimedia.org/T327867) [08:41:01] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:885321|Add a wordmark to trwiktionary (T328499)]] [08:41:06] T328499: Change Turkish Wiktionary logo - https://phabricator.wikimedia.org/T328499 [08:42:51] !log ladsgroup@deploy1002 superpes and ladsgroup: Backport for [[gerrit:885321|Add a wordmark to trwiktionary (T328499)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:43:02] Superpes: mwdebug [08:43:07] Amir1 It's good [08:43:29] coolio, syncing [08:43:54] Amir1 Many thanks :D [08:43:57] (03CR) 10Muehlenhoff: [C: 03+2] Move next-server settings from install2003->2004 [puppet] - 10https://gerrit.wikimedia.org/r/885336 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [08:44:23] thank you for making the patches and doing the work, I'm just running some fancy mostly automated commands [08:44:40] and a good distraction from dealing with betterworks [08:44:50] :D [08:45:53] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=k8s-ingress-staging [08:45:54] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=k8s-ingress-staging [08:46:22] I'm follow a boring remote lesson in medical emergencies, so at least I do something less boring... At least, less boring for me! [08:46:37] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) [08:49:06] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:885321|Add a wordmark to trwiktionary (T328499)]] (duration: 08m 05s) [08:49:15] (03PS6) 10Ladsgroup: Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15) [08:49:20] (03CR) 10Ladsgroup: [C: 03+2] Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15) [08:49:20] T328499: Change Turkish Wiktionary wordmark logo - https://phabricator.wikimedia.org/T328499 [08:49:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15) [08:49:57] (03Merged) 10jenkins-bot: Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15) [08:50:22] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:883620|Create additional namespaces on shn.wikibooks (T327850)]] [08:50:52] T327850: Create additional namespaces on shn.wikibooks - https://phabricator.wikimedia.org/T327850 [08:52:11] !log ladsgroup@deploy1002 superpes and ladsgroup: Backport for [[gerrit:883620|Create additional namespaces on shn.wikibooks (T327850)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [08:53:23] Superpes: mwdebug [08:54:21] (03PS1) 10QChris: Add .gitreview [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/885733 [08:54:23] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/885733 (owner: 10QChris) [08:54:29] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [08:54:31] !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: apply on main [08:54:41] (03CR) 10Muehlenhoff: [C: 03+2] Point to install2004 for DHCP in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/885326 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [08:55:04] Doing [08:59:21] Ok! It's actually good! [08:59:37] awesome [09:01:11] (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885734 (https://phabricator.wikimedia.org/T328521) [09:01:39] Amir1: can you please ping me once done, so i can deploy ^^? [09:01:51] sure [09:01:57] ty [09:02:10] Superpes: I'd do the last one in the next window, we have had a lot [09:02:22] Yep lol [09:02:49] No rush :D Thanks for your effort! :P [09:05:28] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:883620|Create additional namespaces on shn.wikibooks (T327850)]] (duration: 15m 06s) [09:05:32] T327850: Create additional namespaces on shn.wikibooks - https://phabricator.wikimedia.org/T327850 [09:05:42] ^_^ [09:05:48] urbanecm: the floor is yours [09:05:56] ty [09:06:05] (03CR) 10Urbanecm: [C: 03+2] Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885734 (https://phabricator.wikimedia.org/T328521) (owner: 10Urbanecm) [09:06:26] !log urbanecm@deploy1002 backport aborted: (duration: 00m 01s) [09:06:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885734 (https://phabricator.wikimedia.org/T328521) (owner: 10Urbanecm) [09:06:45] (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885734 (https://phabricator.wikimedia.org/T328521) (owner: 10Urbanecm) [09:07:08] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:885734|Add new throttle rule (T328521)]] [09:07:10] T328521: Request a throttle lift for a Czech senior citizens writing Wikipedia course –2023-02-02 - https://phabricator.wikimedia.org/T328521 [09:09:28] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: default to not block_abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/885370 (owner: 10Filippo Giunchedi) [09:14:32] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:885734|Add new throttle rule (T328521)]] (duration: 07m 24s) [09:14:36] T328521: Request a throttle lift for a Czech senior citizens writing Wikipedia course –2023-02-02 - https://phabricator.wikimedia.org/T328521 [09:15:30] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:49] !log Clean sign up throttle for IP 195.113.145.2 (via resetAuthenticationThrottle.php; T328521) [09:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:14] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:31:59] (03CR) 10Clément Goubert: mediawiki: adapt rsyslog parsing of slowlog to ecs 1.11 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/884360 (owner: 10Giuseppe Lavagetto) [09:35:35] (03PS1) 10Muehlenhoff: Disable DHCP on install2003 [puppet] - 10https://gerrit.wikimedia.org/r/885737 [09:37:41] (03PS1) 10Ayounsi: Peering news: move verbose logs [puppet] - 10https://gerrit.wikimedia.org/r/885738 [09:41:56] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [09:43:56] (03CR) 10Muehlenhoff: [C: 03+2] Disable DHCP on install2003 [puppet] - 10https://gerrit.wikimedia.org/r/885737 (owner: 10Muehlenhoff) [09:47:54] !log upgrade grafana to 8.5.20 on grafana2001 - T328405 [09:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:57] T328405: Grafana: CVE-2022-39324 CVE-2022-23552 - https://phabricator.wikimedia.org/T328405 [09:48:11] (03PS1) 10Elukey: admin_ng: add missing TLS SAN for the inference endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/885740 (https://phabricator.wikimedia.org/T327302) [09:49:32] (03PS4) 10Clément Goubert: mediawiki: adapt rsyslog parsing of slowlog to ecs 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884360 (owner: 10Giuseppe Lavagetto) [09:50:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host krb2002.codfw.wmnet with OS bullseye [09:50:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host krb2002.codfw.wmnet with OS bullseye [09:53:13] jouncebot: next [09:53:13] In 1 hour(s) and 6 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230201T1100) [09:53:47] I'm upgrading grafana shortly, it'll be briefly unavailable during the restart [09:55:42] ack [09:57:04] !log upgrade grafana to 8.5.20 on grafana1002 - T328405 [09:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:08] T328405: Grafana: CVE-2022-39324 CVE-2022-23552 - https://phabricator.wikimedia.org/T328405 [09:57:24] (we're back) [10:01:56] !log upgrade grafana to 8.5.20 on cloudmetrics* - T328405 [10:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:06] (03PS7) 10Ilias Sarantopoulos: feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) [10:04:52] (03CR) 10Ilias Sarantopoulos: feat: add json payload capability (035 comments) [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos) [10:05:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on krb2002.codfw.wmnet with reason: host reimage [10:08:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb2002.codfw.wmnet with reason: host reimage [10:12:36] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:13:31] (03CR) 10Elukey: [C: 03+2] admin_ng: add missing TLS SAN for the inference endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/885740 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [10:16:04] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:13] 10SRE, 10Infrastructure Security, 10observability: Grafana: CVE-2022-39324 CVE-2022-23552 - https://phabricator.wikimedia.org/T328405 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete -- we went with 8.5.20 (the latest available, and not 8.5.16) since the latter shouldn't be used and it... [10:19:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:20:44] 10SRE, 10SRE-Access-Requests: Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T328517 (10Arrbee) This is an approved request for Santhosh [10:21:35] (03PS1) 10Stevemunene: Add authzIdentity to jaas config chart increment [deployment-charts] - 10https://gerrit.wikimedia.org/r/885786 (https://phabricator.wikimedia.org/T327884) [10:23:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host krb2002.codfw.wmnet with OS bullseye [10:23:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host krb2002.codfw.wmnet with OS bullseye completed: - krb2002 (**PAS... [10:34:28] (03PS1) 10Slyngshede: C:idm::deployment collect static files [puppet] - 10https://gerrit.wikimedia.org/r/885787 [10:41:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:41:44] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:42:00] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:42:03] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:42:11] !log start running migrateRevisionCommentTemp in remaining sections (for now except s3) in screens # T275246 [10:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:14] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [10:42:36] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:42:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:47:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Looks okay – I’m undecided whether the Wikibase.php block for $wmgWikibaseRestApiDevelopmentEnabled should be inside the $wmgWikibaseRestA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885422 (https://phabricator.wikimedia.org/T326313) (owner: 10Ollie Shotton) [10:52:16] !log Deploying refinery for ops week [10:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:31] (03PS1) 10JMeybohm: WIP: Allow for differnt staging values per DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/885791 [10:54:18] !log stevemunene@deploy1002 Started deploy [analytics/refinery@a8840b0]: Regular analytics weekly train [analytics/refinery@a8840b0] [10:57:00] (03PS3) 10Ladsgroup: Move CirrusSearch settings from IS.php to ext-CirrusSearch.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) [10:57:14] jouncebot: nowandnext [10:57:14] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [10:57:14] In 0 hour(s) and 2 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230201T1100) [10:57:18] (03CR) 10Silvan Heintze: [C: 03+1] "thanks, LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885422 (https://phabricator.wikimedia.org/T326313) (owner: 10Ollie Shotton) [10:57:41] (03CR) 10CI reject: [V: 04-1] Move CirrusSearch settings from IS.php to ext-CirrusSearch.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [10:58:47] !log stevemunene@deploy1002 Finished deploy [analytics/refinery@a8840b0]: Regular analytics weekly train [analytics/refinery@a8840b0] (duration: 04m 29s) [10:59:38] !log stevemunene@deploy1002 Started deploy [analytics/refinery@a8840b0] (thin): Regular analytics weekly train THIN [analytics/refinery@a8840b0] [10:59:43] !log stevemunene@deploy1002 Finished deploy [analytics/refinery@a8840b0] (thin): Regular analytics weekly train THIN [analytics/refinery@a8840b0] (duration: 00m 05s) [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230201T1100) [11:00:19] !log stevemunene@deploy1002 Started deploy [analytics/refinery@a8840b0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@a8840b0] [11:01:07] (03PS4) 10Ladsgroup: Move CirrusSearch settings from IS.php to ext-CirrusSearch.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) [11:01:23] (03CR) 10Btullis: [C: 03+1] Add authzIdentity to jaas config chart increment [deployment-charts] - 10https://gerrit.wikimedia.org/r/885786 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene) [11:01:37] !log stevemunene@deploy1002 Finished deploy [analytics/refinery@a8840b0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@a8840b0] (duration: 01m 18s) [11:01:47] (03PS2) 10Slyngshede: C:idm::deployment collect static files [puppet] - 10https://gerrit.wikimedia.org/r/885787 [11:02:03] (03CR) 10CI reject: [V: 04-1] Move CirrusSearch settings from IS.php to ext-CirrusSearch.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [11:02:05] (03PS5) 10Btullis: Add a third-party apt repo for ceph-quincy packages [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) [11:02:45] (03PS5) 10Ladsgroup: Move CirrusSearch settings from IS.php to ext-CirrusSearch.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) [11:02:58] (03CR) 10Btullis: [C: 03+2] Add a third-party apt repo for ceph-quincy packages [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [11:03:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39349/console" [puppet] - 10https://gerrit.wikimedia.org/r/885787 (owner: 10Slyngshede) [11:03:56] (03CR) 10Ladsgroup: [C: 03+2] "let's get this party started" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [11:04:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:41] (03Merged) 10jenkins-bot: Move CirrusSearch settings from IS.php to ext-CirrusSearch.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [11:14:13] !log ladsgroup@deploy1002 Synchronized wmf-config/ext-CirrusSearch.php: Move CirrusSearch settings from IS.php to ext-CirrusSearch.php, part I (T308932) (duration: 07m 04s) [11:14:16] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [11:16:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:10] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@e1ca693] (eqiad): Allow stylesheets through CSP [11:17:02] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@e1ca693] (eqiad): Allow stylesheets through CSP (duration: 00m 51s) [11:17:45] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2001.codfw.wmnet [11:20:55] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@e1ca693] (codfw): Allow stylesheets through CSP [11:21:49] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:21:54] !log ladsgroup@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: Move CirrusSearch settings from IS.php to ext-CirrusSearch.php, part II (T308932) (duration: 07m 04s) [11:21:57] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [11:22:40] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@e1ca693] (codfw): Allow stylesheets through CSP (duration: 01m 45s) [11:22:47] awight: ^ done [11:23:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/883113 (owner: 10Slyngshede) [11:23:15] nemo-yiannis: All good. Here's a geoshape, https://test.wikipedia.org/wiki/User:Adamw/Sandbox/Externaldata [11:24:37] Here are the varnish logs: https://logstash.wikimedia.org/goto/11b049f15423097a61ea1d3ae79153ff [11:24:44] lets keep an eye for errors [11:24:55] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:25:20] (03CR) 10EoghanGaffney: [C: 03+2] Add /var/log/mail.{log,info,err,warn} to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/885294 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [11:27:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/885787 (owner: 10Slyngshede) [11:27:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:27:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:27:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2001.codfw.wmnet [11:27:49] 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2001.codfw.wmnet` - testvm2001.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertma... [11:28:56] (03PS1) 10Muehlenhoff: Reimage testvm2002 with Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/885795 [11:29:01] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Move CirrusSearch settings from IS.php to ext-CirrusSearch.php, part III (T308932) (duration: 06m 43s) [11:29:05] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [11:33:42] awight: theoretically we should see performance improvements with geoshapes handling right ? cc effie [11:34:47] oh lovely [11:35:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/885415 (owner: 10Slyngshede) [11:42:02] (03CR) 10Muehlenhoff: [C: 03+2] Reimage testvm2002 with Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/885795 (owner: 10Muehlenhoff) [11:47:39] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:11] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:08:57] (03PS20) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [12:12:38] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39350/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [12:15:54] !log jmm@cumin2002 START - Cookbook sre.puppet.renew-cert for testvm2002.codfw.wmnet: Renew puppet certificate - jmm@cumin2002 [12:16:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for testvm2002.codfw.wmnet: Renew puppet certificate - jmm@cumin2002 [12:18:54] nemo-yiannis: effie: no sorry, not yet. We temporarily enabled that feature (T326317) on Monday but reverted when we found that the migration was still a bit rough. [12:18:54] T326317: Deploy geoshape expansion to wikis - https://phabricator.wikimedia.org/T326317 [12:20:50] (03CR) 10Slyngshede: P:IDM Configure OIDC and LDAP. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [12:21:37] While it was briefly deployed, there was a service interruption for unrelated reasons, and finally the rough migration affects numbers by causing broken maps which would result in no secondary service request. [12:21:46] Graphs can be seen here, https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&from=1675080932897&to=1675091011341 [12:22:10] (03CR) 10Slyngshede: [V: 03+2 C: 03+1] Switch to built in LogoutView. [software/bitu] - 10https://gerrit.wikimedia.org/r/883113 (owner: 10Slyngshede) [12:22:15] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Switch to built in LogoutView. [software/bitu] - 10https://gerrit.wikimedia.org/r/883113 (owner: 10Slyngshede) [12:22:46] (03CR) 10Slyngshede: [C: 03+2] C:apereo_cas fix memberOf to group mapping in OIDC. [puppet] - 10https://gerrit.wikimedia.org/r/885415 (owner: 10Slyngshede) [12:24:00] Correction: https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&from=1675070132000&to=1675091011000 [12:24:55] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:25:21] (03CR) 10Jbond: [C: 03+1] "noticed a few extra nits" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [12:27:27] (03PS1) 10Matthias Mullie: Squashed diff to catch up to master [extensions/SearchVue] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885798 [12:27:36] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2003.codfw.wmnet [12:27:45] (03CR) 10Matthias Mullie: [C: 03+1] Squashed diff to catch up to master [extensions/SearchVue] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885798 (owner: 10Matthias Mullie) [12:28:47] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:31:41] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:37:45] 10SRE, 10ops-codfw, 10DC-Ops: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10Clement_Goubert) 2 more mw hosts affected: ` cgoubert@cumin1001:~$ sudo cumin 'mw23[26,29,32].codfw.wmnet' 'ipmi-sel | tail -n... [12:37:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:38:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10Clement_Goubert) [12:38:18] (03PS10) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [12:40:07] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:41:15] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [12:41:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:41:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:41:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2003.codfw.wmnet [12:41:30] 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2003.codfw.wmnet` - testvm2003.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertma... [12:41:31] (03PS11) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [12:42:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:44:34] (03CR) 10Stevemunene: [C: 03+2] Add authzIdentity to jaas config chart increment [deployment-charts] - 10https://gerrit.wikimedia.org/r/885786 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene) [12:45:11] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [12:45:59] (03PS1) 10Muehlenhoff: Remove installserver role from install2003 [puppet] - 10https://gerrit.wikimedia.org/r/885799 [12:46:20] (03CR) 10CI reject: [V: 04-1] Remove installserver role from install2003 [puppet] - 10https://gerrit.wikimedia.org/r/885799 (owner: 10Muehlenhoff) [12:48:54] (03PS2) 10Muehlenhoff: Remove installserver role from install2003 [puppet] - 10https://gerrit.wikimedia.org/r/885799 (https://phabricator.wikimedia.org/T327867) [12:49:49] (03Merged) 10jenkins-bot: Add authzIdentity to jaas config chart increment [deployment-charts] - 10https://gerrit.wikimedia.org/r/885786 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene) [12:51:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove installserver role from install2003 [puppet] - 10https://gerrit.wikimedia.org/r/885799 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [12:55:34] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) install2004 is fully in service now, I have tested a reimage on baremetal and VM installation successfully. atftpd, dhcpd and nginx have been stop... [12:55:35] (03CR) 10Clément Goubert: "Comment inline" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [12:55:57] (03PS21) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [12:56:44] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) [12:56:56] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:58:45] PROBLEM - TFTP service on install2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [12:59:07] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:00:28] !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:00:46] ^ install2003 is monitoring glitch, will vanish once Puppet has run on alert1001 [13:06:44] (03PS12) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [13:07:12] (03PS22) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [13:07:34] (03CR) 10CI reject: [V: 04-1] P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:08:22] (03PS23) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [13:09:05] (03PS13) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [13:11:02] (03PS14) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [13:13:49] (03PS15) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [13:17:24] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [13:18:46] (03PS1) 10Muehlenhoff: Point DHCP server in esams to install3002 [homer/public] - 10https://gerrit.wikimedia.org/r/885805 (https://phabricator.wikimedia.org/T327867) [13:18:48] (03PS1) 10Muehlenhoff: Point DHCP server in ulsfo to install4002 [homer/public] - 10https://gerrit.wikimedia.org/r/885806 (https://phabricator.wikimedia.org/T327867) [13:21:34] !log installing curl security updates on bullseye [13:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:53] (03CR) 10Ayounsi: [C: 03+1] Point DHCP server in esams to install3002 [homer/public] - 10https://gerrit.wikimedia.org/r/885805 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [13:23:30] (03CR) 10Ayounsi: [C: 03+1] "typo then lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/885806 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [13:24:13] (03PS2) 10Muehlenhoff: Point DHCP server in ulsfo to install4002 [homer/public] - 10https://gerrit.wikimedia.org/r/885806 (https://phabricator.wikimedia.org/T327867) [13:24:25] (03CR) 10Muehlenhoff: Point DHCP server in ulsfo to install4002 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/885806 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [13:26:26] (03PS1) 10Volans: requests: allow to skip the session retry logic [software/pywmflib] - 10https://gerrit.wikimedia.org/r/885808 [13:33:17] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39351/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:36:55] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast5002.wikimedia.org [13:40:20] (03PS2) 10Jdrewniak: Add cswiki to desktop-improvements group. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885391 (https://phabricator.wikimedia.org/T328154) [13:40:33] (03PS24) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [13:41:52] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39352/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:41:56] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [13:43:32] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:44:24] (03PS1) 10Muehlenhoff: Remove bast3005/bast5002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/885810 (https://phabricator.wikimedia.org/T324974) [13:45:05] (03PS25) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [13:45:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/883261 (https://phabricator.wikimedia.org/T311918) (owner: 10Majavah) [13:46:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39353/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:46:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10ayounsi) [13:46:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove bast3005/bast5002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/885810 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff) [13:47:06] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:47:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:47:38] (03CR) 10Slyngshede: [V: 03+1] P:IDM Configure OIDC and LDAP. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:48:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:48:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:48:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast5002.wikimedia.org [13:48:38] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [13:50:43] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10MoritzMuehlenhoff) [13:51:21] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast3005.wikimedia.org [13:53:25] (03CR) 10Majavah: [C: 03+2] kubernetes: Use the shared image-config configmap [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/883261 (https://phabricator.wikimedia.org/T311918) (owner: 10Majavah) [13:54:08] (03Merged) 10jenkins-bot: kubernetes: Use the shared image-config configmap [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/883261 (https://phabricator.wikimedia.org/T311918) (owner: 10Majavah) [13:55:21] (03CR) 10Filippo Giunchedi: "LGTM overall! See optional nit inline." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [13:55:24] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:57:35] (03PS1) 10Majavah: d/changelog: Prepare for 0.89 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885811 [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230201T1400). [14:00:05] matthiasmullie and jan_drewniak: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] o/ [14:00:26] o/ [14:01:06] * TheresNoTime unable to deploy today [14:02:16] PROBLEM - Host sretest1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:50] matthiasmullie: jan_drewniak: Hi, I can deploy. [14:03:41] mine can skip mwdebug, nothing can be tested [14:03:54] I can also self-serve, if that's more convenient for you, awight [14:04:26] matthiasmullie: either way--I'm logged-in and ready though [14:04:40] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:43] alright, sure, go ahead then, if you don't mind :) [14:04:47] :-D [14:05:24] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy1002 using scap backport" [extensions/SearchVue] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885798 (owner: 10Matthias Mullie) [14:05:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3005.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:06:14] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:24] !log updating perf on Bullseye hosts [14:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:40] (03PS16) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [14:06:50] RECOVERY - Host sretest1001 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [14:06:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3005.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:07:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:07:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast3005.wikimedia.org [14:07:09] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast3005.wikimedia.org` - bast3005.wikimedia.org (**PASS**) - Downtimed host on... [14:07:27] (03Merged) 10jenkins-bot: Squashed diff to catch up to master [extensions/SearchVue] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885798 (owner: 10Matthias Mullie) [14:07:53] !log awight@deploy1002 Started scap: Backport for [[gerrit:885798|Squashed diff to catch up to master]] [14:08:02] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:05] jan_drewniak: Is this config change safe if files land in any order? [14:09:05] awight: yup, I've used the scap backport command with this kind of change with no issue in the past [14:09:25] Okay thanks, I couldn't quite imagine how each of the pieces fits together. [14:09:43] !log awight@deploy1002 mlitn and awight: Backport for [[gerrit:885798|Squashed diff to catch up to master]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [14:10:02] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:05] (continuing without mwdebug testing) [14:10:11] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [14:10:23] awight: please hold [14:10:24] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM in general, there's a few details that can be improved, but good job!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [14:10:41] taavi: Okay [14:11:04] matthiasmullie's patch removes the config variable in the same patch as its uses which won't work [14:11:30] !log awight@deploy1002 sync-world aborted: Backport for [[gerrit:885798|Squashed diff to catch up to master]] (duration: 03m 36s) [14:11:30] !log awight@deploy1002 backport aborted: (duration: 06m 09s) [14:11:42] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:45] taavi: It has been partially deployed [14:12:45] taavi: this is only enabled on 3 wikipedias, and none are running that branch yet, so I think this should be fine? [14:13:52] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:01] it should be fine if the code isn't used. but a similar thing took down all wikis for 15 mins a couple of weeks ago so I'd be extra careful [14:14:31] (03PS1) 10Slyngshede: Add social_auth pipeline for group creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/885813 [14:14:52] I cancelled deployment midway through syncing apaches, with no 5xx increase so it seems we've proven that it's safe the hard way :-) [14:15:05] awight: yeah, you need to either re-deploy it or revert [14:15:42] Exactly. matthiasmullie: Let me know which way you prefer to go--no rush. [14:16:13] (and sorry for interrupting it that late) [14:16:18] (03CR) 10Slyngshede: "Pipeline for adding Django users to the groups returned from CAS, and create the groups if required." [software/bitu] - 10https://gerrit.wikimedia.org/r/885813 (owner: 10Slyngshede) [14:16:38] awight: let's move forward & deploy, then? [14:16:44] taavi: sorry here too, I didn't quite understand your polite note in time [14:16:50] matthiasmullie: sounds good--redeploying now [14:17:04] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:38] !log awight@deploy1002 Started scap: Backport for [[gerrit:885798|Squashed diff to catch up to master]] [14:17:43] thanks, and sorry for the confusion - didn't even realize this could've been an issue within extensions! [14:18:06] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:09] +1 this is an unfortunately thing but should go away once we containerize the deployment pipeline [14:18:31] yeah, it is a relatively new failure mode [14:19:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:06] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:14] (03CR) 10Muehlenhoff: P:IDM Configure OIDC and LDAP. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [14:19:29] !log awight@deploy1002 awight and mlitn: Backport for [[gerrit:885798|Squashed diff to catch up to master]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [14:20:28] (03PS1) 10Ayounsi: codfw-stating: allow own AS path loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/885814 (https://phabricator.wikimedia.org/T328523) [14:20:38] (03PS1) 10Andrew Bogott: dumps::web::fetches::kiwix: splay rsyncs so we don't overload the kiwix server [puppet] - 10https://gerrit.wikimedia.org/r/885815 (https://phabricator.wikimedia.org/T260223) [14:20:59] (03CR) 10CI reject: [V: 04-1] dumps::web::fetches::kiwix: splay rsyncs so we don't overload the kiwix server [puppet] - 10https://gerrit.wikimedia.org/r/885815 (https://phabricator.wikimedia.org/T260223) (owner: 10Andrew Bogott) [14:21:26] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:27] (03CR) 10Volans: "reply inline as requested :)" [software/httpbb] - 10https://gerrit.wikimedia.org/r/885273 (owner: 10Ilias Sarantopoulos) [14:21:56] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:22:32] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:09] (03PS2) 10Andrew Bogott: dumps::web::fetches::kiwix: splay rsyncs so we don't overload the kiwix server [puppet] - 10https://gerrit.wikimedia.org/r/885815 (https://phabricator.wikimedia.org/T260223) [14:24:08] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:46] !log awight@deploy1002 Finished scap: Backport for [[gerrit:885798|Squashed diff to catch up to master]] (duration: 09m 07s) [14:26:59] (03PS1) 10Elukey: ml-services: update rr, outlink and nsfw model servers' docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/885823 (https://phabricator.wikimedia.org/T325528) [14:27:02] jan_drewniak: Deploying now [14:27:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885391 (https://phabricator.wikimedia.org/T328154) (owner: 10Jdrewniak) [14:27:15] awight: sounds good [14:27:42] (03Merged) 10jenkins-bot: Add cswiki to desktop-improvements group. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885391 (https://phabricator.wikimedia.org/T328154) (owner: 10Jdrewniak) [14:28:05] !log awight@deploy1002 Started scap: Backport for [[gerrit:885391|Add cswiki to desktop-improvements group. (T328154)]] [14:28:08] T328154: Deploy Vector 2022 skin on Czech Wikipedia - https://phabricator.wikimedia.org/T328154 [14:28:29] if there is time can I squeeze in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/884155 to the deployment window? [14:28:39] it is just for Eventgate basically [14:28:51] awight: thanks! [14:28:52] elukey: sure--is it helpful if I deploy? [14:29:01] (03CR) 10Ilias Sarantopoulos: ci: add pre-commit hooks (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/885273 (owner: 10Ilias Sarantopoulos) [14:29:19] (03Abandoned) 10Ilias Sarantopoulos: ci: add pre-commit hooks [software/httpbb] - 10https://gerrit.wikimedia.org/r/885273 (owner: 10Ilias Sarantopoulos) [14:29:56] !log awight@deploy1002 jdrewniak and awight: Backport for [[gerrit:885391|Add cswiki to desktop-improvements group. (T328154)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:30:01] awight: I'd be grateful, I haven't deployed to mw in a while and I could use some help to avoid issues :) [14:30:08] elukey: sure! [14:30:10] <3 [14:30:15] * urbanecm watches cswiki changing :) [14:30:21] (03CR) 10Ayounsi: [C: 03+2] codfw-stating: allow own AS path loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/885814 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [14:30:24] jan_drewniak: cswiki config is ready to test on mwdebug [14:30:33] urbanecm: glad you're around for this! [14:30:57] matthiasmullie: looks fun, I'm still hoping my team will work on a vue feature one day soon [14:31:08] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "🎉" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885823 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [14:31:18] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, minus a typo. It's also staging-codfw, not codfw-staging (for reasons that... I don't remember)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885814 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [14:31:32] awight: yup, looks good to deploy! [14:31:40] ack [14:32:36] (03CR) 10Andrew Bogott: [C: 03+2] dumps::web::fetches::kiwix: splay rsyncs so we don't overload the kiwix server [puppet] - 10https://gerrit.wikimedia.org/r/885815 (https://phabricator.wikimedia.org/T260223) (owner: 10Andrew Bogott) [14:32:42] (03CR) 10Ayounsi: [C: 03+2] codfw-stating: allow own AS path loop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/885814 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [14:33:08] (03CR) 10AikoChou: [C: 03+1] ml-services: update rr, outlink and nsfw model servers' docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/885823 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [14:35:04] elukey: "scap backport" makes the deployment infinitely less painful, just to boost your confidence for the next patches :-) [14:35:24] (03Merged) 10jenkins-bot: codfw-stating: allow own AS path loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/885814 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [14:36:13] awight: ack I'll read more during the next days, thanks again! [14:36:22] (03CR) 10Ayounsi: [C: 03+2] codfw-stating: allow own AS path loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/885814 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [14:36:59] it doesn't even need reading elukey, you just run "scap backport $gerritid" [14:37:28] !log awight@deploy1002 Finished scap: Backport for [[gerrit:885391|Add cswiki to desktop-improvements group. (T328154)]] (duration: 09m 22s) [14:37:31] T328154: Deploy Vector 2022 skin on Czech Wikipedia - https://phabricator.wikimedia.org/T328154 [14:37:35] Amir1: nice, but if anything goes wrong I have no idea what to do, this is my main worry [14:37:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [14:37:51] (03PS3) 10Awight: wmf-config: add new revision-score streams for EventGate main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [14:37:54] scap backport --revert $gerritid [14:38:25] (and it's interactive, if it's broken in mwdebug, you'd say "no") [14:38:33] don't worry, it's fun :P [14:38:40] Amir1: now don't make me feel too ignorant and full of shame :D [14:38:43] The script does everything btw, including gerrit merge [14:38:56] (03CR) 10TrainBranchBot: "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [14:39:08] nah, I just want you to join the cool kids club [14:39:38] (03Merged) 10jenkins-bot: wmf-config: add new revision-score streams for EventGate main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [14:39:46] !log ayounsi@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:39:47] !log ayounsi@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:40:04] !log awight@deploy1002 Started scap: Backport for [[gerrit:884155|wmf-config: add new revision-score streams for EventGate main (T317768)]] [14:40:07] T317768: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 [14:40:36] !log ayounsi@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:40:37] !log ayounsi@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:40:49] (03PS1) 10Marostegui: mariadb: Disable notifications host in A1/A8 codfw [puppet] - 10https://gerrit.wikimedia.org/r/885825 (https://phabricator.wikimedia.org/T327404) [14:41:20] urbanecm: it's live 🎉 [14:41:40] 🎉 amazing! [14:41:41] It's process-agnostic now, so in theory we won't even have to notice when we switch from git rebase to k8s... [14:41:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2136 db2158 db2157 es2026 db2106 db2146 T327404', diff saved to https://phabricator.wikimedia.org/P43530 and previous config saved to /var/cache/conftool/dbconfig/20230201-144152-root.json [14:41:55] !log awight@deploy1002 elukey and awight: Backport for [[gerrit:884155|wmf-config: add new revision-score streams for EventGate main (T317768)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:41:55] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications host in A1/A8 codfw [puppet] - 10https://gerrit.wikimedia.org/r/885825 (https://phabricator.wikimedia.org/T327404) (owner: 10Marostegui) [14:41:56] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:41:56] T327404: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs 2023-02-02 - https://phabricator.wikimedia.org/T327404 [14:42:36] elukey: live on mwdebug with no gross breakage, so I'm continuing with deployment [14:42:52] super thanks, there is no easy way to test [14:42:56] so we can go ahead [14:43:10] (at least not that I know of) [14:43:14] +1 I think it could only break anything by introducing some huge syntax error [14:43:39] Might be possible to fire an event from the console, but there's no need to block deployment to test [14:44:12] (03PS2) 10Elukey: ml-services: update rr, outlink and nsfw model servers' docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/885823 (https://phabricator.wikimedia.org/T325528) [14:44:49] awight: I asked to Andrew but IIRC I think that the eventgate pods needs to be roll restarted for the config to be picked up (but I could be wrong) [14:44:53] (when new streams are added) [14:46:48] Hmm I thought this config would only affect the producer [14:47:56] !log ayounsi@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:48:29] !log awight@deploy1002 Finished scap: Backport for [[gerrit:884155|wmf-config: add new revision-score streams for EventGate main (T317768)]] (duration: 08m 25s) [14:48:30] !log ayounsi@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:48:32] T317768: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 [14:49:45] awight: it affects eventgate IIRC since it will accept new streams, otherwise I think it refuses to validate them [14:51:24] (03CR) 10Elukey: [C: 03+2] ml-services: update rr, outlink and nsfw model servers' docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/885823 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [14:52:20] !log EU deployment window complete [14:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:23] o/ [14:55:40] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet,service=cdn [14:55:40] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet,service=ats-be [14:56:05] !depool cp1075.eqiad.wmnet for idrac firmware upgrade testing [14:56:05] for s in nginx varnish-fe varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_text,service=$s --action set/pooled=no cp1053.eqiad.wmnet; done [14:56:08] (03Merged) 10jenkins-bot: ml-services: update rr, outlink and nsfw model servers' docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/885823 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [14:56:12] log cp1075.eqiad.wmnet for idrac firmware upgrade testing [14:56:13] !log cp1075.eqiad.wmnet for idrac firmware upgrade testing [14:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:21] Wednesday is going great [14:56:54] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Calico and BFD - https://phabricator.wikimedia.org/T328338 (10ayounsi) As a datapoint, I pushed this change https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/885814 and it showed up in the Bird config without a drop in the BGP sessi... [15:07:14] (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for 0.89 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885811 (owner: 10Majavah) [15:08:55] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.89 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885811 (owner: 10Majavah) [15:12:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:55] (03PS17) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [15:19:06] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [15:19:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:32] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [15:27:03] (03CR) 10Hashar: [C: 03+1] "I have send it to the puppet compiler but it ends up doing a full diff." [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [15:29:50] (03PS1) 10DCausse: flink-app: add preliminary H/A support [deployment-charts] - 10https://gerrit.wikimedia.org/r/885832 [15:30:36] (03CR) 10CI reject: [V: 04-1] flink-app: add preliminary H/A support [deployment-charts] - 10https://gerrit.wikimedia.org/r/885832 (owner: 10DCausse) [15:31:08] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5019.eqsin.wmnet with OS bullseye [15:31:16] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5019.eqsin.wmnet with OS bullseye [15:31:51] (03CR) 10Jbond: [C: 03+1] "not tested but lgtm" [software/bitu] - 10https://gerrit.wikimedia.org/r/885813 (owner: 10Slyngshede) [15:32:55] (03CR) 10Hashar: [C: 03+1] "This change will be to switch deployment of the release Jenkins to be made with scap." [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [15:33:35] (03PS19) 10Hashar: jenkins: add hieradata config for Scap3-based deployments [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [15:34:39] (03PS7) 10Hashar: jenkins: use Scap3 deployment for releases instances [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [15:34:48] (03PS6) 10Hashar: jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [15:35:17] (03CR) 10Hashar: [C: 03+1] "Rebased to clear a file "conflict" in Gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [15:38:35] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:38:45] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10ayounsi) With the above patch, plus the following test router config: `lang=diff [edit policy-options] + policy-statement kubestage_test_out { + term stage... [15:38:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [15:42:15] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:43:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [15:51:38] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5019.eqsin.wmnet with OS bullseye [15:51:45] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5019.eqsin.wmnet with OS bullseye executed with errors: - cp5019 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [15:51:49] (03CR) 10Ottomata: "I'd like to avoid adding new .Values.app. variables if we don't need to. Almost all of the settings you have now can be provided in " [deployment-charts] - 10https://gerrit.wikimedia.org/r/885832 (owner: 10DCausse) [15:53:29] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:53:39] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:53:52] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-toolforge-tests: refresh some definition [puppet] - 10https://gerrit.wikimedia.org/r/885834 [15:53:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5019.eqsin.wmnet with OS bullseye [15:54:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5019.eqsin.wmnet with OS bullseye [15:55:10] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: adapt rsyslog parsing of slowlog to ecs 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884360 (owner: 10Giuseppe Lavagetto) [15:56:21] (03CR) 10Ottomata: flink-app: add preliminary H/A support (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/885832 (owner: 10DCausse) [15:57:03] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:57:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-toolforge-tests: refresh some definition [puppet] - 10https://gerrit.wikimedia.org/r/885834 (owner: 10Arturo Borrero Gonzalez) [15:58:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:00:08] (03Merged) 10jenkins-bot: mediawiki: adapt rsyslog parsing of slowlog to ecs 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884360 (owner: 10Giuseppe Lavagetto) [16:00:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:13:49] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:14:28] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:14:46] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:15:23] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:15:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:30] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) @Volans and I were discussing this on IRC today, some more observations with cp5019, that failed the first attempt but worked on the second.... [16:19:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:23:59] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:25:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5030.eqsin.wmnet with OS bullseye [16:25:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5030.eqsin.wmnet with OS bullseye [16:25:39] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:25:54] !log reloaded apache on mailman [16:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49566 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:26:32] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5019.eqsin.wmnet with reason: host reimage [16:27:05] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.925 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:29:38] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5019.eqsin.wmnet with reason: host reimage [16:30:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/885808 (owner: 10Volans) [16:31:09] (03CR) 10Volans: [C: 03+2] requests: allow to skip the session retry logic [software/pywmflib] - 10https://gerrit.wikimedia.org/r/885808 (owner: 10Volans) [16:31:35] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5030.eqsin.wmnet with OS bullseye [16:31:43] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5030.eqsin.wmnet with OS bullseye executed with errors: - cp5030 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [16:33:04] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3056.esams.wmnet with OS bullseye [16:33:10] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3056.esams.wmnet with OS bullseye [16:33:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5030.eqsin.wmnet with OS bullseye [16:34:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5030.eqsin.wmnet with OS bullseye [16:35:33] (03Merged) 10jenkins-bot: requests: allow to skip the session retry logic [software/pywmflib] - 10https://gerrit.wikimedia.org/r/885808 (owner: 10Volans) [16:38:16] (03PS1) 10Marostegui: Revert "mariadb: Disable notifications host in A1/A8 codfw" [puppet] - 10https://gerrit.wikimedia.org/r/885774 [16:38:47] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Disable notifications host in A1/A8 codfw" [puppet] - 10https://gerrit.wikimedia.org/r/885774 (owner: 10Marostegui) [16:39:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43531 and previous config saved to /var/cache/conftool/dbconfig/20230201-163921-root.json [16:39:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P43532 and previous config saved to /var/cache/conftool/dbconfig/20230201-163941-root.json [16:39:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P43533 and previous config saved to /var/cache/conftool/dbconfig/20230201-163947-root.json [16:39:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P43534 and previous config saved to /var/cache/conftool/dbconfig/20230201-163955-root.json [16:40:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P43535 and previous config saved to /var/cache/conftool/dbconfig/20230201-164002-root.json [16:40:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P43536 and previous config saved to /var/cache/conftool/dbconfig/20230201-164007-root.json [16:42:34] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5030.eqsin.wmnet with OS bullseye [16:42:40] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5030.eqsin.wmnet with OS bullseye executed with errors: - cp5030 (**FAIL**) - Removed from Puppet and PuppetDB if p... [16:42:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5030.eqsin.wmnet with OS bullseye [16:42:56] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5030.eqsin.wmnet with OS bullseye [16:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:03] (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/885838 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [16:49:04] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:49:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:30] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:48] 10SRE, 10SRE-Access-Requests: Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T328517 (10herron) [16:53:02] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:01] 10SRE, 10SRE-Access-Requests: Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T328517 (10herron) @Ottomata @odimitrijevic could you please review/approve this request for groupadd to analytics-privatedata-users? [16:54:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43538 and previous config saved to /var/cache/conftool/dbconfig/20230201-165426-root.json [16:54:31] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3057.esams.wmnet with OS bullseye [16:54:37] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3057.esams.wmnet with OS bullseye [16:54:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43539 and previous config saved to /var/cache/conftool/dbconfig/20230201-165446-root.json [16:54:48] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3056.esams.wmnet with reason: host reimage [16:54:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43540 and previous config saved to /var/cache/conftool/dbconfig/20230201-165452-root.json [16:55:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43541 and previous config saved to /var/cache/conftool/dbconfig/20230201-165500-root.json [16:55:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43542 and previous config saved to /var/cache/conftool/dbconfig/20230201-165506-root.json [16:55:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43543 and previous config saved to /var/cache/conftool/dbconfig/20230201-165512-root.json [16:55:39] (03CR) 10Jbond: redfish: add upload/update methods (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [16:55:45] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:56:20] (03PS2) 10JHathaway: Add jaeger-{builder,query,collector} [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) [16:57:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3056.esams.wmnet with reason: host reimage [16:57:31] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5030.eqsin.wmnet with OS bullseye [16:57:37] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5030.eqsin.wmnet with OS bullseye executed with errors: - cp5030 (**FAIL**) - Removed from Puppet and PuppetDB if p... [16:57:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5030.eqsin.wmnet with OS bullseye [16:58:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5030.eqsin.wmnet with OS bullseye [16:59:34] (03CR) 10JHathaway: "thanks for the review, rev2 pushed" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [17:00:45] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:04:13] (03PS1) 10Nray: Enable client preferences for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885841 (https://phabricator.wikimedia.org/T327979) [17:05:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:06:06] (03CR) 10Ottomata: "Okay, to deal with the secret, it looks like using k8s Secret is gonna be hard, cuz we'd have to render it into a separate config file, or" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885832 (owner: 10DCausse) [17:06:15] (03PS1) 10Herron: admin: add user santhosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/885842 (https://phabricator.wikimedia.org/T328517) [17:09:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43544 and previous config saved to /var/cache/conftool/dbconfig/20230201-170931-root.json [17:09:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43545 and previous config saved to /var/cache/conftool/dbconfig/20230201-170951-root.json [17:09:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43546 and previous config saved to /var/cache/conftool/dbconfig/20230201-170957-root.json [17:10:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43547 and previous config saved to /var/cache/conftool/dbconfig/20230201-171005-root.json [17:10:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43548 and previous config saved to /var/cache/conftool/dbconfig/20230201-171011-root.json [17:10:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43549 and previous config saved to /var/cache/conftool/dbconfig/20230201-171016-root.json [17:10:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:12:05] (03PS5) 10Clément Goubert: sre: add alerting for mediawiki on k8s [alerts] - 10https://gerrit.wikimedia.org/r/797315 (owner: 10Giuseppe Lavagetto) [17:12:19] (03PS6) 10Clément Goubert: sre: add alerting for mediawiki on k8s [alerts] - 10https://gerrit.wikimedia.org/r/797315 (owner: 10Giuseppe Lavagetto) [17:15:55] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3057.esams.wmnet with reason: host reimage [17:17:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5019.eqsin.wmnet with OS bullseye [17:17:15] (03PS1) 10BCornwall: idp: Set cloud TLS/SSL compatiblility to strong [puppet] - 10https://gerrit.wikimedia.org/r/885844 (https://phabricator.wikimedia.org/T238518) [17:17:15] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5019.eqsin.wmnet with OS bullseye completed: - cp5019 (**WARN**) - Removed from Puppet and PuppetDB if present -... [17:17:28] (03CR) 10Ottomata: flink-app: add preliminary H/A support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/885832 (owner: 10DCausse) [17:18:19] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix slowlog rsyslog message [deployment-charts] - 10https://gerrit.wikimedia.org/r/885838 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [17:19:18] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3057.esams.wmnet with reason: host reimage [17:22:24] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3056.esams.wmnet with OS bullseye [17:22:30] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3056.esams.wmnet with OS bullseye completed: - cp3056 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [17:22:37] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp3056.esams.wmnet [17:23:30] (03Merged) 10jenkins-bot: mediawiki: Fix slowlog rsyslog message [deployment-charts] - 10https://gerrit.wikimedia.org/r/885838 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [17:23:41] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:23:50] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3058.esams.wmnet with OS bullseye [17:23:56] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3058.esams.wmnet with OS bullseye [17:24:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43550 and previous config saved to /var/cache/conftool/dbconfig/20230201-172436-root.json [17:24:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43551 and previous config saved to /var/cache/conftool/dbconfig/20230201-172456-root.json [17:25:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43552 and previous config saved to /var/cache/conftool/dbconfig/20230201-172502-root.json [17:25:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43553 and previous config saved to /var/cache/conftool/dbconfig/20230201-172510-root.json [17:25:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43554 and previous config saved to /var/cache/conftool/dbconfig/20230201-172516-root.json [17:25:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43555 and previous config saved to /var/cache/conftool/dbconfig/20230201-172521-root.json [17:31:46] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5030.eqsin.wmnet with reason: host reimage [17:37:47] 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Radar): git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10hashar) [17:39:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5030.eqsin.wmnet with reason: host reimage [17:39:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43557 and previous config saved to /var/cache/conftool/dbconfig/20230201-173941-root.json [17:40:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43558 and previous config saved to /var/cache/conftool/dbconfig/20230201-174001-root.json [17:40:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43559 and previous config saved to /var/cache/conftool/dbconfig/20230201-174007-root.json [17:40:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43560 and previous config saved to /var/cache/conftool/dbconfig/20230201-174015-root.json [17:40:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43561 and previous config saved to /var/cache/conftool/dbconfig/20230201-174021-root.json [17:40:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43562 and previous config saved to /var/cache/conftool/dbconfig/20230201-174026-root.json [17:40:34] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3057.esams.wmnet with OS bullseye [17:40:39] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3057.esams.wmnet with OS bullseye completed: - cp3057 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [17:40:57] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) a:03Trizek-WMF [17:40:58] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp3057.esams.wmnet [17:41:27] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:41:45] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3059.esams.wmnet with OS bullseye [17:41:52] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3059.esams.wmnet with OS bullseye [17:45:19] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3058.esams.wmnet with reason: host reimage [17:48:33] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3058.esams.wmnet with reason: host reimage [17:52:34] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (Seen): operations/docker-images/production-images has no CI - https://phabricator.wikimedia.org/T283855 (10hashar) [17:54:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43563 and previous config saved to /var/cache/conftool/dbconfig/20230201-175446-root.json [17:55:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43564 and previous config saved to /var/cache/conftool/dbconfig/20230201-175506-root.json [17:55:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43565 and previous config saved to /var/cache/conftool/dbconfig/20230201-175511-root.json [17:55:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43566 and previous config saved to /var/cache/conftool/dbconfig/20230201-175519-root.json [17:55:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43567 and previous config saved to /var/cache/conftool/dbconfig/20230201-175526-root.json [17:55:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43568 and previous config saved to /var/cache/conftool/dbconfig/20230201-175531-root.json [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230201T1800) [18:00:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:30] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [18:01:50] (03CR) 10Cwhite: [C: 03+1] "Looks like an improvement, but doesn't solve the problem that the opensearch user cannot invoke systemd-tmpfiles." [puppet] - 10https://gerrit.wikimedia.org/r/885373 (owner: 10Filippo Giunchedi) [18:02:41] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [18:02:55] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) @Clement_Goubert Has anything major changed in your process since the last time (noticeable things that... [18:03:00] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/885371 (owner: 10Filippo Giunchedi) [18:03:47] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3059.esams.wmnet with reason: host reimage [18:04:34] (03CR) 10Cwhite: [C: 03+1] "Agreed, it should also be applied to the ES units." [puppet] - 10https://gerrit.wikimedia.org/r/885372 (owner: 10Filippo Giunchedi) [18:04:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:42] (03PS18) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [18:06:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3059.esams.wmnet with reason: host reimage [18:10:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43569 and previous config saved to /var/cache/conftool/dbconfig/20230201-181011-root.json [18:10:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43570 and previous config saved to /var/cache/conftool/dbconfig/20230201-181016-root.json [18:10:22] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [18:10:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43571 and previous config saved to /var/cache/conftool/dbconfig/20230201-181024-root.json [18:10:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43572 and previous config saved to /var/cache/conftool/dbconfig/20230201-181031-root.json [18:10:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43573 and previous config saved to /var/cache/conftool/dbconfig/20230201-181036-root.json [18:10:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5030.eqsin.wmnet with OS bullseye [18:11:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5030.eqsin.wmnet with OS bullseye completed: - cp5030 (**PASS**) - Removed from Puppet and PuppetDB if present -... [18:11:35] PROBLEM - Host cp1075 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:51] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3058.esams.wmnet with OS bullseye [18:12:56] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3058.esams.wmnet with OS bullseye completed: - cp3058 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [18:13:14] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp3058.esams.wmnet [18:13:36] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:13:51] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3060.esams.wmnet with OS bullseye [18:13:58] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3060.esams.wmnet with OS bullseye [18:14:38] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) p:05Triage→03High [18:19:30] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5019.eqsin.wmnet,service=cdn [18:19:30] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5019.eqsin.wmnet,service=ats-be [18:19:30] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5030.eqsin.wmnet,service=cdn [18:19:31] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5030.eqsin.wmnet,service=ats-be [18:20:07] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:20:11] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:20:22] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [18:20:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp1075.eqiad.wmnet with reason: downtimed for idrac firmware testing [18:20:45] RECOVERY - Host cp1075 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [18:20:59] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp1075.eqiad.wmnet with reason: downtimed for idrac firmware testing [18:21:22] ^ cp1075 was already depooled but additionally downtimed now [18:22:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5021.eqsin.wmnet with OS bullseye [18:22:46] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5021.eqsin.wmnet with OS bullseye [18:29:11] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5021.eqsin.wmnet with OS bullseye [18:29:17] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5021.eqsin.wmnet with OS bullseye executed with errors: - cp5021 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [18:29:23] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts puppetmaster2003.codfw.wmnet [18:29:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5021.eqsin.wmnet with OS bullseye [18:29:31] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5021.eqsin.wmnet with OS bullseye [18:30:18] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) As you gave 3 dates in the task description, can you confirm precisely **when** the wikis will be in a... [18:31:15] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3059.esams.wmnet with OS bullseye [18:31:21] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3059.esams.wmnet with OS bullseye completed: - cp3059 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [18:31:30] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp3059.esams.wmnet [18:31:51] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:32:06] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3061.esams.wmnet with OS bullseye [18:32:13] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3061.esams.wmnet with OS bullseye [18:35:14] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3060.esams.wmnet with reason: host reimage [18:37:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5031.eqsin.wmnet with OS bullseye [18:37:16] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5031.eqsin.wmnet with OS bullseye [18:38:20] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3060.esams.wmnet with reason: host reimage [18:39:53] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts puppetmaster2003.codfw.wmnet [18:39:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:40:17] (03CR) 10Dzahn: [C: 03+2] phabricator: ensure phd uid/gid can not be changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [18:42:05] (03PS1) 10Sbailey: Enable Linter write namespace, tag and template for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885852 (https://phabricator.wikimedia.org/T299612) [18:44:14] (03CR) 10Cwhite: "One last item, then LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [18:44:38] (03CR) 10Sbailey: "Final release config that enables linter write of namespace, tag and template fields to all all wikis, group 0, and group 1 have already b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885852 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [18:44:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:46:47] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5031.eqsin.wmnet with OS bullseye [18:46:53] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5031.eqsin.wmnet with OS bullseye executed with errors: - cp5031 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [18:47:08] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5031.eqsin.wmnet with OS bullseye [18:47:14] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5031.eqsin.wmnet with OS bullseye [18:52:14] (03PS1) 10EoghanGaffney: Rotate aphlict logs either daily, or when they reach 1G [puppet] - 10https://gerrit.wikimedia.org/r/885858 (https://phabricator.wikimedia.org/T325246) [18:52:52] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3061.esams.wmnet with reason: host reimage [18:54:46] (03PS4) 10Southparkfan: rsyslog: allow subject name validation [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) [18:55:26] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5031.eqsin.wmnet with OS bullseye [18:55:31] (03CR) 10Southparkfan: rsyslog: allow subject name validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [18:55:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5031.eqsin.wmnet with OS bullseye executed with errors: - cp5031 (**FAIL**) - Removed from Puppet and PuppetDB if p... [18:55:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5031.eqsin.wmnet with OS bullseye [18:55:47] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5031.eqsin.wmnet with OS bullseye [18:56:15] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3061.esams.wmnet with reason: host reimage [19:00:05] dancy and brennen: (Dis)respected human, time to deploy Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230201T1900). Please do the needful. [19:00:05] dancy and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230201T1900). [19:01:34] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5021.eqsin.wmnet with reason: host reimage [19:02:20] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3060.esams.wmnet with OS bullseye [19:02:26] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3060.esams.wmnet with OS bullseye completed: - cp3060 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [19:02:39] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp3060.esams.wmnet [19:03:14] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:03:23] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3062.esams.wmnet with OS bullseye [19:03:29] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3062.esams.wmnet with OS bullseye [19:04:47] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5021.eqsin.wmnet with reason: host reimage [19:05:53] 10SRE-tools, 10Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593 (10jbond) p:05Triage→03Medium [19:06:55] (03CR) 10Dzahn: [C: 03+1] "looks good to me. thank you. here is the result of compiling this in the puppet compiler on host aphlict1001. note this is a dedicated VM " [puppet] - 10https://gerrit.wikimedia.org/r/885858 (https://phabricator.wikimedia.org/T325246) (owner: 10EoghanGaffney) [19:07:15] I'll roll the train in about 5 minutes [19:11:47] (03PS1) 10Majavah: kubernetes: Fix restart for some tools with default resources [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885859 (https://phabricator.wikimedia.org/T328589) [19:11:50] (03PS1) 10Majavah: d/changelog: Prepare for 0.90 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885860 [19:12:08] (03CR) 10Majavah: [C: 03+2] kubernetes: Fix restart for some tools with default resources [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885859 (https://phabricator.wikimedia.org/T328589) (owner: 10Majavah) [19:12:14] (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for 0.90 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885860 (owner: 10Majavah) [19:13:00] (03Merged) 10jenkins-bot: kubernetes: Fix restart for some tools with default resources [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885859 (https://phabricator.wikimedia.org/T328589) (owner: 10Majavah) [19:13:28] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.90 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885860 (owner: 10Majavah) [19:14:25] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10jbond) @ssingh i have finished with cp1075, i have upgraded it to the most recent network, bios and idrac version. in relation to other servers that you may have issues with i have noticed that any machine with... [19:15:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:41] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3061.esams.wmnet with OS bullseye [19:17:47] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3061.esams.wmnet with OS bullseye completed: - cp3061 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [19:19:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:02] o/ [19:20:07] Ready to press the button [19:21:16] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885862 (https://phabricator.wikimedia.org/T325584) [19:21:18] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885862 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [19:21:55] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885862 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [19:24:10] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3062.esams.wmnet with reason: host reimage [19:24:25] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp3061.esams.wmnet [19:24:55] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:24:58] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3063.esams.wmnet with OS bullseye [19:25:05] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3063.esams.wmnet with OS bullseye [19:25:09] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: move version check to earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593) [19:26:42] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3062.esams.wmnet with reason: host reimage [19:27:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5031.eqsin.wmnet with reason: host reimage [19:27:52] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) >>! In T321309#8579719, @jbond wrote: > @ssingh i have finished with cp1075, i have upgraded it to the most recent network, bios and idrac version. in relation to other servers that you may have issues w... [19:27:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [19:28:08] (03PS1) 10BCornwall: ssl_ciphersuite: Clean up old support checks [puppet] - 10https://gerrit.wikimedia.org/r/885865 [19:29:33] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.21 refs T325584 [19:29:36] T325584: 1.40.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T325584 [19:30:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5031.eqsin.wmnet with reason: host reimage [19:30:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:17] (03Abandoned) 10BCornwall: ssl_ciphersuite: Clean up old support checks [puppet] - 10https://gerrit.wikimedia.org/r/885865 (owner: 10BCornwall) [19:32:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [19:32:56] New errors.. Rolling back. [19:33:00] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) [19:33:07] (03PS1) 10Majavah: kubernetes: fix null handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885866 (https://phabricator.wikimedia.org/T328589) [19:33:12] !log dancy@deploy1002 sync-file aborted: group1 wikis to 1.40.0-wmf.21 refs T325584 (duration: 03m 38s) [19:33:12] !log dancy@deploy1002 deploy-promote aborted: (duration: 11m 58s) [19:33:13] (03PS1) 10Majavah: d/changelog: Prepare for 0.91 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885867 [19:33:17] (03CR) 10Majavah: [C: 03+2] kubernetes: fix null handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885866 (https://phabricator.wikimedia.org/T328589) (owner: 10Majavah) [19:33:22] (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for 0.91 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885867 (owner: 10Majavah) [19:33:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:49] (03CR) 10CI reject: [V: 04-1] kubernetes: fix null handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885866 (https://phabricator.wikimedia.org/T328589) (owner: 10Majavah) [19:33:58] (03CR) 10CI reject: [V: 04-1] d/changelog: Prepare for 0.91 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885867 (owner: 10Majavah) [19:34:31] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885868 (https://phabricator.wikimedia.org/T325584) [19:34:33] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885868 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [19:35:18] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885868 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [19:35:20] (03PS2) 10Majavah: kubernetes: fix null handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885866 (https://phabricator.wikimedia.org/T328589) [19:35:26] (03PS2) 10Majavah: d/changelog: Prepare for 0.91 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885867 [19:35:32] (03CR) 10Majavah: [C: 03+2] kubernetes: fix null handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885866 (https://phabricator.wikimedia.org/T328589) (owner: 10Majavah) [19:35:42] (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for 0.91 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885867 (owner: 10Majavah) [19:36:13] (03Merged) 10jenkins-bot: kubernetes: fix null handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885866 (https://phabricator.wikimedia.org/T328589) (owner: 10Majavah) [19:36:46] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.91 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/885867 (owner: 10Majavah) [19:37:51] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5021.eqsin.wmnet with OS bullseye [19:37:57] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5021.eqsin.wmnet with OS bullseye completed: - cp5021 (**PASS**) - Removed from Puppet and PuppetDB if present -... [19:38:17] (03Abandoned) 10BCornwall: ssl_ciphersuite: remove pre-jessie compat [puppet] - 10https://gerrit.wikimedia.org/r/449746 (owner: 10BBlack) [19:39:54] (03PS3) 10BCornwall: wmflib::ssl_ciphersuites: drop suppport for anything less then jessie [puppet] - 10https://gerrit.wikimedia.org/r/640467 (owner: 10Jbond) [19:41:18] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5021.eqsin.wmnet,service=cdn [19:41:19] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5021.eqsin.wmnet,service=ats-be [19:42:24] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [19:42:46] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.20 refs T325584 [19:42:49] T325584: 1.40.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T325584 [19:44:25] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [19:44:40] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [19:45:38] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3063.esams.wmnet with reason: host reimage [19:48:47] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3063.esams.wmnet with reason: host reimage [19:48:47] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3062.esams.wmnet with OS bullseye [19:48:53] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3062.esams.wmnet with OS bullseye completed: - cp3062 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [19:49:18] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp3062.esams.wmnet [19:49:22] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.20 refs T325584 (duration: 06m 36s) [19:49:25] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [19:49:25] T325584: 1.40.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T325584 [19:49:52] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3064.esams.wmnet with OS bullseye [19:49:52] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:49:58] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3064.esams.wmnet with OS bullseye [19:53:17] !log The train is blocked on T328601 [19:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:20] T328601: TypeError: Argument 2 passed to Parser::parse() must implement interface MediaWiki\Page\PageReference, null given, called in /srv/mediawiki/php-1.40.0-wmf.21/extensions/Wikibase/lib/includes/Formatters/CachingKartographerEmbedd - https://phabricator.wikimedia.org/T328601 [20:00:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5031.eqsin.wmnet with OS bullseye [20:00:20] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5031.eqsin.wmnet with OS bullseye completed: - cp5031 (**PASS**) - Removed from Puppet and PuppetDB if present -... [20:03:00] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5031.eqsin.wmnet,service=cdn [20:03:00] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5031.eqsin.wmnet,service=ats-be [20:03:17] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [20:03:57] (03CR) 10JHathaway: [C: 03+1] "looks good to me as well" [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [20:06:19] (03Abandoned) 10Nray: Enable client preferences for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885841 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray) [20:07:05] (03PS1) 10Jbond: redfish: allow for refreshing the manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/885873 [20:08:35] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3064.esams.wmnet with reason: host reimage [20:09:47] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3063.esams.wmnet with OS bullseye [20:09:53] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3063.esams.wmnet with OS bullseye completed: - cp3063 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [20:10:32] (03CR) 10CI reject: [V: 04-1] redfish: allow for refreshing the manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/885873 (owner: 10Jbond) [20:11:47] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3064.esams.wmnet with reason: host reimage [20:13:22] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10jbond) >>! In T321309#8579793, @ssingh wrote: >>>! In T321309#8579719, @jbond wrote: >> @ssingh i have finished with cp1075, i have upgraded it to the most recent network, bios and idrac version. in relation to... [20:21:37] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp3063.esams.wmnet [20:21:57] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:22:05] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3065.esams.wmnet with OS bullseye [20:22:11] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3065.esams.wmnet with OS bullseye [20:33:55] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3064.esams.wmnet with OS bullseye [20:34:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3064.esams.wmnet with OS bullseye completed: - cp3064 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [20:38:19] !log eevans@puppetmaster1001 conftool action : get/pooled; selector: dnsdisc=$SERVICE,name=$DC [20:39:08] (03Restored) 10Nray: Enable client preferences for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885841 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray) [20:40:02] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp3064.esams.wmnet [20:40:24] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:42:55] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3065.esams.wmnet with reason: host reimage [20:43:40] !log depooling sessionstore —codfw— in preparation for Cassandra restarts — T327675 [20:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:43] T327675: Replace expiring Cassandra SSL certificates (sessionstore cluster) - https://phabricator.wikimedia.org/T327675 [20:44:02] !log eevans@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore,name=codfw [20:45:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:13] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3065.esams.wmnet with reason: host reimage [20:48:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:49:16] (03PS1) 10Ottomata: Finalize mediawiki/page/change schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230201T2100). [21:00:05] Dreamy_Jazz, arlolra, sbailey, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:11] \o [21:00:36] Shannon present [21:02:10] I can deploy. [21:02:18] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore200*: Applying new TLS certificates — T327675 - eevans@cumin1001 [21:02:22] T327675: Replace expiring Cassandra SSL certificates (sessionstore cluster) - https://phabricator.wikimedia.org/T327675 [21:02:31] Dreamy_Jazz: we'll start with yours [21:02:39] My one will need modification as group 1 is not yet on wmf.21 [21:02:46] Let me do that quickly. [21:02:51] here [21:03:01] !log start UTC late backport deployment window [21:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:14] (03PS2) 10Dreamy Jazz: Disable write old for CheckUserLog reason on group 0 and group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885358 (https://phabricator.wikimedia.org/T233004) [21:03:47] RECOVERY - cassandra-a SSL 10.192.16.95:7001 on sessionstore2001 is OK: SSL OK - Certificate sessionstore2001-a valid until 2025-01-31 20:52:59 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:04:08] (03PS3) 10Dreamy Jazz: Disable write old for CheckUserLog reason on group 0 and group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885358 (https://phabricator.wikimedia.org/T233004) [21:04:25] (03PS4) 10Dreamy Jazz: Disable write old for CheckUserLog reason on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885358 (https://phabricator.wikimedia.org/T233004) [21:05:02] Ready to proceed with my one if someone with checkuser rights on group 0 can test for me [21:05:04] (03CR) 10BCornwall: [C: 03+1] "Looks like this would affect PuppetDB, but these seem like reasonable changes to me:" [puppet] - 10https://gerrit.wikimedia.org/r/640467 (owner: 10Jbond) [21:05:57] I do not have checkuser rights. [21:06:07] Urbanecm: + TheresNoTime: [21:07:12] They both have checkuser rights on group 0 wikis through having steward rights [21:07:26] Feel free to jump over me in the queue if there isn't a quick reply [21:07:45] ack, looking.. [21:08:48] Any recommended wiki to create a test page on with lint errors in group 2? [21:09:05] Test steps are: [21:09:05] * Have a user with checkuser rights make a testing check on a group 0 wiki using a non-empty reason [21:09:05] * Have someone with database access check that the row generated has cul_reason as the empty string [21:09:29] *cul_reason in the cu_log table [21:10:16] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3065.esams.wmnet with OS bullseye [21:10:18] TheresNoTime: I'm comfortable deploying and doing the database query if you can do the checkuser [21:10:22] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3065.esams.wmnet with OS bullseye completed: - cp3065 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [21:11:17] kindrobot: I'd suggest moving onto another config patch for a moment - just reviewing where that sits policy-wise [21:11:50] (tl;dr I can via my steward rights, the question is if I'm allowed to) [21:12:28] Ah OK. [21:12:49] CScott answered my question, I can create a page on enwiki, did not think I had perms [21:12:50] If it's any help, I can give you consent to check my user. [21:12:58] But I'll go to the next one for now. [21:13:40] Is arlolra present for their patch? [21:13:47] yes [21:13:55] chatting with him on slack [21:13:56] yup [21:14:00] :-) [21:14:06] If mine can't be tested, then I'll instead bundle group 0 and group 1 into to disabling write old everywhere on Thursday evening and test using my enwiki CU rights. There should be little risk in not first going group 0 and 1, and it's already on testwiki. Thanks for looking into it TNT. [21:14:41] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp3065.esams.wmnet [21:14:56] arlolra: could you solve the merge conflict on your patch? [21:14:57] RECOVERY - cassandra-a SSL 10.192.48.132:7001 on sessionstore2003 is OK: SSL OK - Certificate sessionstore2003-a valid until 2025-01-31 20:53:04 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:15:07] RECOVERY - cassandra-a SSL 10.192.32.101:7001 on sessionstore2002 is OK: SSL OK - Certificate sessionstore2002-a valid until 2025-01-31 20:53:01 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:15:20] kindrobot: sure, does it not rebase cleanly? one sec [21:15:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:49] (03PS4) 10Stef Dunlap: Disable wgParserEnableLegacyMediaDOM on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865214 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [21:16:03] (03PS2) 10Ottomata: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) [21:16:11] (03PS5) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865214 (https://phabricator.wikimedia.org/T314318) [21:16:16] Oh, it did, never mind. A UI change threw me off. ;) [21:16:42] np, done anyways [21:16:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865214 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [21:16:52] (03PS3) 10Ottomata: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) [21:17:28] kindrobot: do you know if the explicit metawiki value is going to override the group1 setting in that patch [21:17:50] (03Merged) 10jenkins-bot: Disable wgParserEnableLegacyMediaDOM on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865214 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [21:17:51] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:17:58] (03CR) 10Cwhite: [C: 03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/876248/39360/" [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [21:18:16] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:865214|Disable wgParserEnableLegacyMediaDOM on group1 wikis (T314318)]] [21:18:20] T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318 [21:19:40] I don't know off the top of my head. It depends how the flag being read is implemented. [21:19:48] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore200*: Applying new TLS certificates — T327675 - eevans@cumin1001 [21:19:51] T327675: Replace expiring Cassandra SSL certificates (sessionstore cluster) - https://phabricator.wikimedia.org/T327675 [21:20:08] !log kindrobot@deploy1002 arlolra and kindrobot: Backport for [[gerrit:865214|Disable wgParserEnableLegacyMediaDOM on group1 wikis (T314318)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:20:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:49] arlolra: can you confirm? [21:21:18] testing [21:24:54] Dreamy_Jazz: as testwiki is a group0, would that be okay? [21:24:56] !log aokoth@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release [21:25:12] TheresNoTime: testwiki already has the change deployed [21:25:17] However test2wiki doesn't [21:25:35] kindrobot: ok, hewiki, which is group1 now has the change, and metawiki looks like it does not. so, seems good to proceed [21:25:40] Or other testwikis [21:25:56] Dreamy_Jazz: I can do test2wiki :) [21:26:00] !log eevans@puppetmaster1001 conftool action : get/pooled=true; selector: dnsdisc=sessionstore,name=codfw [21:26:04] (which is group1 iirc) [21:26:08] !log eevans@puppetmaster1001 conftool action : get/pooled=true; selector: dnsdisc=sessionstore,name=codfw [21:26:18] OK, thank you for checking both arlolra, syncing now. [21:26:25] !log eevans@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=codfw [21:27:11] So Dreamy_Jazz you'll be ready after this sync then? [21:27:22] Dreamy_Jazz: if doing the check on test2wiki (group1) is okay for your tests, I can do that [21:27:31] Hmm. I thought it was group0 [21:27:48] group1 isn't on wmf.21 which is needed for this to work [21:28:46] test.wikidata.org should work? [21:28:50] Dreamy_Jazz: testwikidatawiki? (group0) [21:28:56] ah yeah, great minds etc. [21:29:00] Yeah. That should work. [21:29:21] Okay, can do that then [21:32:13] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:865214|Disable wgParserEnableLegacyMediaDOM on group1 wikis (T314318)]] (duration: 13m 56s) [21:32:17] T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318 [21:32:54] You should be live arlolra [21:33:04] Ready Dreamy_Jazz ? [21:33:12] kindrobot: many thanks [21:33:17] Yup as long as TNT is ready [21:33:21] ready [21:35:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885358 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [21:35:55] (03PS5) 10Stef Dunlap: Disable write old for CheckUserLog reason on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885358 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [21:36:03] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885358 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [21:36:48] (03Merged) 10jenkins-bot: Disable write old for CheckUserLog reason on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885358 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [21:37:12] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:885358|Disable write old for CheckUserLog reason on group 0 (T233004)]] [21:37:15] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:39:08] !log kindrobot@deploy1002 dreamyjazz and kindrobot: Backport for [[gerrit:885358|Disable write old for CheckUserLog reason on group 0 (T233004)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:39:39] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore100*: Applying new TLS certificates — T327675 - eevans@cumin1001 [21:39:40] Dreamy_Jazz, TheresNoTime ready to confirm. [21:39:42] T327675: Replace expiring Cassandra SSL certificates (sessionstore cluster) - https://phabricator.wikimedia.org/T327675 [21:39:43] Okay, would you like me to switch to a mwdebug and make a self-check per the test steps you listed above? [21:40:03] Sure [21:40:28] Dreamy_Jazz: could you give me the exact SQL query to run? [21:41:15] RECOVERY - cassandra-a SSL 10.64.0.144:7001 on sessionstore1001 is OK: SSL OK - Certificate sessionstore1001-a valid until 2025-01-31 21:35:51 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:41:17] test check done [21:41:41] Working on a SQL query [21:42:05] TheresNoTime: did you learn anything new about yourself? ;) [21:42:31] It seems I have not been socking on test.wikidata :D [21:43:07] Good for you. :) [21:43:29] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5022.eqsin.wmnet with OS bullseye [21:43:36] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5022.eqsin.wmnet with OS bullseye [21:44:21] SELECT cul_reason FROM `cu_log` JOIN `actor` `cu_log_actor` ON ((actor_id = cul_actor)) WHERE actor_name = TheresNoTime ORDER BY cul_timestamp DESC LIMIT 1 [21:44:42] On testwikidatatest, right? [21:44:45] Yes [21:44:59] It should work [21:45:21] cul_reason is an empty string [21:45:32] Good. That is as expected. Test complete. [21:45:51] Great. Syncing. :) [21:45:51] (as write old is turned off in this change) [21:46:55] RECOVERY - cassandra-a SSL 10.64.32.85:7001 on sessionstore1002 is OK: SSL OK - Certificate sessionstore1002-a valid until 2025-01-31 21:35:54 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:52:05] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:885358|Disable write old for CheckUserLog reason on group 0 (T233004)]] (duration: 14m 53s) [21:52:09] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:52:31] Dreamy_Jazz: you should be live :) [21:52:37] Thanks! [21:53:02] sbailey: you ready? [21:53:09] yes [21:53:36] !log aokoth@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: Security Release [21:53:45] RECOVERY - cassandra-a SSL 10.64.48.178:7001 on sessionstore1003 is OK: SSL OK - Certificate sessionstore1003-a valid until 2025-01-31 21:35:56 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:54:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885852 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:54:50] (03PS2) 10Stef Dunlap: Enable Linter write namespace, tag and template for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885852 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:54:59] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885852 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:55:00] kindrobot: is it okay to overrun this window? [21:55:17] Yeah, I'm fine if you're ok. [21:55:26] Great! thanks! :) [21:55:42] (03Merged) 10jenkins-bot: Enable Linter write namespace, tag and template for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885852 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:56:08] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:885852|Enable Linter write namespace, tag and template for all wikis (T299612)]] [21:56:11] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [21:57:45] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore100*: Applying new TLS certificates — T327675 - eevans@cumin1001 [21:57:48] T327675: Replace expiring Cassandra SSL certificates (sessionstore cluster) - https://phabricator.wikimedia.org/T327675 [21:57:56] !log kindrobot@deploy1002 kindrobot and sbailey: Backport for [[gerrit:885852|Enable Linter write namespace, tag and template for all wikis (T299612)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:58:18] sbailey: can you confirm? [21:58:36] testing [21:59:41] Linter errors are run as a job, so I need enwiki to reparse the page and generate linter errors [22:00:07] Is this change propagated to enwiki, or does that require a sync [22:01:49] 10SRE, 10LDAP-Access-Requests: Grant Access to cn=wmf for xihua - https://phabricator.wikimedia.org/T328607 (10BTullis) [22:02:00] It's on enwiki if you use the toolbar [22:02:36] Oh, wait, I see. I'm not sure if the job runner are synced yet. [22:03:00] yes, using that, not seeing the changes yet in enwiki through Quarry sql query. [22:03:00] Yes that needs to be synced [22:03:46] TheresNoTime: do you know if there's any way to test a job pre-sync? [22:05:43] kindrobot: I'm not sure, sorry - can this job be manually run from a mwdebug server? [22:05:50] 10SRE, 10LDAP-Access-Requests: Grant Access to cn=wmf for xihua - https://phabricator.wikimedia.org/T328607 (10BTullis) Note that this request is follow-up of {T325857} where it was discovered that @HXi-WMF required LDAP access in order to carry out with with Jupyter. I will make the required change to the LDA... [22:06:53] I don't think it can be run from mwdebug, it is triggered by parsoid being run, then ques the job to run. It may require a full sync for this to be testable :-( [22:07:42] ah this is a config patch to (dis|en)able the job for a group? [22:08:04] But it has been tested on group 0 and 1, and there is no drift on any database it runs against, so should be safe to enable for all at this point. [22:08:23] I guess sync and watch the logs closely! :) [22:08:24] OK, we'll sync it and test it, and revert it if need-be. [22:08:31] 10SRE, 10LDAP-Access-Requests: Grant Access to cn=wmf for xihua - https://phabricator.wikimedia.org/T328607 (10BTullis) I have added Hua to the `wmf` group. ` btullis@seaborgium:~$ ldapsearch -x member=uid=xihua,ou=people,dc=wikimedia,dc=org dn # extended LDIF # # LDAPv3 # base (default)... [22:08:56] Syncing now... [22:09:12] yes the issue was last time 2 databases had not been updated with three new columns, but otherwise worked fine on the rest before it was reverted to allow the databases to be updated [22:10:27] those databases accesses caused a failure, so we need to see if there are any issues (such as some straggling database not being updated, having a fit when we try to write to fields that aren't there [22:11:16] 10SRE, 10LDAP-Access-Requests: Grant Access to cn=wmf for xihua - https://phabricator.wikimedia.org/T328607 (10BTullis) 05Open→03Resolved a:03BTullis Added to the WMF-NDA group in phabricator. {F36578092,width=70%} [22:12:35] Test on enwiki is passing new errors being reported. So if no databases are failing on write to linter we are all good, probably should be kept an eye on though for a few minutes. [22:13:38] No new errors, we're about 75% synced. [22:14:22] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:885852|Enable Linter write namespace, tag and template for all wikis (T299612)]] (duration: 18m 14s) [22:14:26] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [22:15:38] OK, we're live sbailey . :) Next time a heads up on the testing logistics would be appreciated. [22:15:41] (03PS1) 10Krinkle: noc: Improve wiki.php diff by using wikidiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885897 [22:16:23] Jdlrobson: ready? [22:16:24] Ok, will be more specific as to testing needs, still learning [22:16:30] kindrobot: yep! [22:16:50] No worries, sbailey, and thank you. :D [22:16:57] (and I'm aware the train hasn't rolled out yet to group 1 wikis and that's okay) [22:17:58] (03PS2) 10Stef Dunlap: Enable client preferences for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885841 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray) [22:19:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885841 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray) [22:19:42] dancy: am a but confused by the train today. It seems that this morning wmf20 was synced to group 1 rather than wmf21 https://phabricator.wikimedia.org/T325584#8579950 ? [22:20:11] Ah I see it got rolled back (https://phabricator.wikimedia.org/T328601) [22:20:16] that makes sense. [22:20:31] (03Merged) 10jenkins-bot: Enable client preferences for group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885841 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray) [22:20:55] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:885841|Enable client preferences for group1 (T327979)]] [22:21:02] T327979: Enable persistent fixed width setting for anonymous users - https://phabricator.wikimedia.org/T327979 [22:21:34] Jdlrobson: ack. Sorry for the confusion [22:22:07] dancy: i'm just failing to read phab logs today. [22:22:14] Is this likely to delay group 2 until next week? [22:22:37] I'm just wondering since English Wikipedia were promised a feature on Thursday relating to the new Vector 2022 rollout [22:22:47] !log kindrobot@deploy1002 nray and kindrobot: Backport for [[gerrit:885841|Enable client preferences for group1 (T327979)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [22:22:51] So if there's a risk of that happening I need to let my PM know [22:23:07] I don't think so. There is already a proposed fix in Gerrit [22:23:56] Jdlrobson: there’s a risk of that every week. The patch is +2’d though and going through CI so I’m assuming dancy can roll forward either tonight or tomorrow depending on their working hours. [22:24:24] Jdlrobson: can you confirm this backport? [22:24:27] * RhinosF1 wonders if zabe was watching in here [22:24:29] kindrobot: looking now [22:25:07] I am, I can deploy the fix once kindrobot is done deploying [22:25:22] LGTM please sync kindrobot [22:25:43] Great, thanks. Syncing... [22:25:46] dancy: RhinosF1 that sounds good. Glad to hear we have a fix already [22:26:35] zabe: sounds good [22:27:45] dancy: would train roll again tonight or morning? I forget whether we’re on EU or US week [22:29:57] (03PS1) 10Zabe: Stop writing to cuc_comment_id in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885898 (https://phabricator.wikimedia.org/T233004) [22:30:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:33] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:885841|Enable client preferences for group1 (T327979)]] (duration: 10m 37s) [22:31:37] T327979: Enable persistent fixed width setting for anonymous users - https://phabricator.wikimedia.org/T327979 [22:31:52] ...and we're live Jdlrobson :) [22:32:12] !log close UTC late backport window [22:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:29] Thanks everyone. :) [22:33:22] zabe: I'm done [22:33:28] thanks :) [22:33:34] (03CR) 10Zabe: [C: 03+2] Stop writing to cuc_comment_id in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885898 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:34:20] (03PS1) 10Zabe: CachingKartographerEmbeddingHandler: Fall back to Special:BlankPage title [extensions/Wikibase] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885781 (https://phabricator.wikimedia.org/T328601) [22:34:23] (03Merged) 10jenkins-bot: Stop writing to cuc_comment_id in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885898 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:34:26] (03CR) 10Zabe: [C: 03+2] CachingKartographerEmbeddingHandler: Fall back to Special:BlankPage title [extensions/Wikibase] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885781 (https://phabricator.wikimedia.org/T328601) (owner: 10Zabe) [22:35:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:34] RhinosF1: We're on US week but I can roll forward this afternoon if there are no conflicts with other activities [22:36:03] !log zabe@deploy1002 Started scap: Backport for [[gerrit:885898|Stop writing to cuc_comment_id in group0 wikis (T233004)]] [22:36:04] dancy: no deploys after zabe unblocks you for ~8 hours [22:36:06] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [22:36:16] So zabe ping dancy when done [22:37:53] thanks kindrobot [22:38:00] !log zabe@deploy1002 zabe: Backport for [[gerrit:885898|Stop writing to cuc_comment_id in group0 wikis (T233004)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [22:38:05] Zabe: Is there a reason why cuc_comment_id is not being written to now? [22:38:22] Actually the change seems to be stopping writing to cuc_comment, [22:38:27] Was confused by the commit title [22:38:44] yeah, that commit message is bullshit [22:38:47] sorry [22:39:11] Okay. Just wanted to make sure there wasn't something that meant cul_comment_id needed to go back too. [22:39:19] *cul_reason_id [22:40:27] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5022.eqsin.wmnet with OS bullseye [22:40:35] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5022.eqsin.wmnet with OS bullseye executed with errors: - cp5022 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [22:47:12] !log dzahn@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release [22:49:06] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:885898|Stop writing to cuc_comment_id in group0 wikis (T233004)]] (duration: 13m 03s) [22:49:09] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [22:52:51] (03Merged) 10jenkins-bot: CachingKartographerEmbeddingHandler: Fall back to Special:BlankPage title [extensions/Wikibase] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885781 (https://phabricator.wikimedia.org/T328601) (owner: 10Zabe) [22:53:30] !log zabe@deploy1002 Started scap: Backport for [[gerrit:885781|CachingKartographerEmbeddingHandler: Fall back to Special:BlankPage title (T328601)]] [22:53:34] T328601: TypeError: Argument 2 passed to Parser::parse() must implement interface MediaWiki\Page\PageReference, null given, called in /srv/mediawiki/php-1.40.0-wmf.21/extensions/Wikibase/lib/includes/Formatters/CachingKartographerEmbedd - https://phabricator.wikimedia.org/T328601 [22:54:08] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5022.eqsin.wmnet with OS bullseye [22:54:15] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5022.eqsin.wmnet with OS bullseye [22:55:25] !log zabe@deploy1002 zabe: Backport for [[gerrit:885781|CachingKartographerEmbeddingHandler: Fall back to Special:BlankPage title (T328601)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [23:01:16] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:885781|CachingKartographerEmbeddingHandler: Fall back to Special:BlankPage title (T328601)]] (duration: 07m 45s) [23:01:19] T328601: TypeError: Argument 2 passed to Parser::parse() must implement interface MediaWiki\Page\PageReference, null given, called in /srv/mediawiki/php-1.40.0-wmf.21/extensions/Wikibase/lib/includes/Formatters/CachingKartographerEmbedd - https://phabricator.wikimedia.org/T328601 [23:01:30] dancy, train is no longer blocked [23:01:37] Yay! [23:01:42] Rolling forward [23:02:04] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885905 (https://phabricator.wikimedia.org/T325584) [23:02:06] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885905 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [23:03:13] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885905 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [23:10:37] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.21 refs T325584 [23:10:40] T325584: 1.40.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T325584 [23:11:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [23:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [23:17:34] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.21 refs T325584 (duration: 06m 57s) [23:17:37] T325584: 1.40.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T325584 [23:19:36] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: security release [23:20:34] Everything looks good. Thanks Zabe! [23:20:40] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:21:10] yw [23:25:22] PROBLEM - Host db2181 #page is DOWN: PING CRITICAL - Packet loss = 100% [23:25:24] (03PS1) 10Zabe: Stop writing to cuc_user and cuc_user_text in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885908 (https://phabricator.wikimedia.org/T233004) [23:27:32] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5022.eqsin.wmnet with reason: host reimage [23:29:11] urandom: I'm hands off for now, but here if you need a hand [23:29:37] I guess I should wait with deploying until that host has been depooled? [23:29:55] zabe: probably a good idea [23:30:40] okay the page escalated -- acked it and I'm depooling [23:31:00] hi :) [23:31:07] need any help? [23:31:13] nah, should be all good [23:31:19] ack! [23:31:19] hello [23:31:23] ok :) [23:31:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5022.eqsin.wmnet with reason: host reimage [23:31:28] that works :P [23:31:38] thanks rzl [23:31:41] !log rzl@cumin2002 dbctl commit (dc=all): 'Depool db2181', diff saved to https://phabricator.wikimedia.org/P43574 and previous config saved to /var/cache/conftool/dbconfig/20230201-233140-rzl.json [23:31:56] o/ [23:32:19] urandom: hello! I just got done depooling db2181 [23:32:34] (s8 replica) [23:33:03] zabe: give it another few minutes just in case it's more complicated than it looks, then you should be good to go [23:33:29] ok [23:35:31] (03CR) 10Zabe: [C: 03+2] Stop writing to cuc_user and cuc_user_text in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885908 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [23:36:23] (03Merged) 10jenkins-bot: Stop writing to cuc_user and cuc_user_text in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885908 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [23:37:37] !log zabe@deploy1002 Started scap: Backport for [[gerrit:885908|Stop writing to cuc_user and cuc_user_text in group1 wikis (T233004)]] [23:37:40] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [23:39:07] 10SRE, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10RLazarus) p:05Triage→03High [23:39:28] !log zabe@deploy1002 zabe: Backport for [[gerrit:885908|Stop writing to cuc_user and cuc_user_text in group1 wikis (T233004)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [23:45:45] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:885908|Stop writing to cuc_user and cuc_user_text in group1 wikis (T233004)]] (duration: 08m 07s) [23:45:48] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004