[00:02:58] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 22:00:00 on wdqs[1017-1021].eqiad.wmnet with reason: bringing new wdqs hosts online T351671 [00:03:16] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 22:00:00 on wdqs[1017-1021].eqiad.wmnet with reason: bringing new wdqs hosts online T351671 [00:03:18] T351671: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 [00:16:55] (JobUnavailable) resolved: Reduced availability for job jmx_query_service_streaming_updater in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:17:19] (03PS1) 10BCornwall: wmf-debci: Also create man1 dir [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984297 (https://phabricator.wikimedia.org/T352003) [00:25:23] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:27:52] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:27:53] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:34:24] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:34:38] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:38:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/984229 [00:39:01] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/984229 (owner: 10TrainBranchBot) [01:00:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/984229 (owner: 10TrainBranchBot) [01:56:11] RECOVERY - WDQS SPARQL on wdqs1021 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:56:11] RECOVERY - WDQS SPARQL on wdqs1020 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:07:44] (03PS2) 10RLazarus: admin_ng: Split the sidecar-job-controller role into two [deployment-charts] - 10https://gerrit.wikimedia.org/r/983963 (https://phabricator.wikimedia.org/T348284) [02:08:17] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [02:08:33] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [02:20:18] (03CR) 10RLazarus: [C: 03+2] admin_ng: Split the sidecar-job-controller role into two (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/983963 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [02:23:20] (03Merged) 10jenkins-bot: admin_ng: Split the sidecar-job-controller role into two [deployment-charts] - 10https://gerrit.wikimedia.org/r/983963 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [02:36:55] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:58] !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [02:39:27] !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [02:41:36] !log rzl@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [02:43:02] !log rzl@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [02:43:24] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [02:44:02] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [02:45:33] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [02:47:03] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [03:08:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:56] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T353681 (10Papaul) 05Open→03Resolved a:03Papaul [05:07:17] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster) [05:08:00] (03CR) 10CI reject: [V: 04-1] InitialiseSettings.php: Allow thanking bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster) [05:20:56] 10SRE, 10Bitu, 10Infrastructure-Foundations: Automatic detection of inactive LDAP account - https://phabricator.wikimedia.org/T335478 (10bd808) >>! In T335478#9417258, @bking wrote: > Time and again we'd see an account with zero logins in the last few years get compromised, and all of a sudden they'd be maxi... [05:24:01] (03CR) 10Anzx: [C: 04-1] InitialiseSettings.php: Allow thanking bots (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster) [06:01:32] (03PS1) 10Marostegui: installserver: Do not format db1243 [puppet] - 10https://gerrit.wikimedia.org/r/984327 [06:04:33] (03CR) 10Marostegui: [C: 03+2] installserver: Do not format db1243 [puppet] - 10https://gerrit.wikimedia.org/r/984327 (owner: 10Marostegui) [06:31:12] !log T351671 Pooled `wdqs10[17-21]*`; data xfers completed and test queries are passing on `wdqs1018`. Will decom related hosts tomorrow (2023-12-20) [06:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:17] T351671: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 [06:50:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:58:59] (03PS1) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/984329 (https://phabricator.wikimedia.org/T352583) [06:59:04] (03PS1) 10Kosta Harlan: Check for false from ThumbnailImage::getStoragePath [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984324 (https://phabricator.wikimedia.org/T353758) [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231220T0700) [07:10:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:15:20] Amir1 urbanecm I have a patch to wmf.10 for the upcoming deployment window in 45 minutes that requires no verification (it's for a maintenance script that is only run manually at the moment). The problem is that I need to be afk during the deploy window. Can I sync it now, or, could one of you make sure it is synced during the window, please? [07:33:40] (03PS5) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 [07:46:47] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:48:10] (03PS6) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 [07:52:23] (03PS7) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 [08:00:05] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231220T0800). [08:00:05] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:49] hi, per my message above -- I need to go afk. if someone can sync this patch (no verification needed, for maintenance script) I would appreciate it. [08:08:13] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:14:28] (SystemdUnitFailed) firing: user@499.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:37] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:03] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: user@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:14] (SystemdUnitFailed) firing: (2) user@499.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:28] (SystemdUnitFailed) firing: (2) user@499.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:31:21] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:36:57] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,user@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:28] (SystemdUnitFailed) firing: (3) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:41:25] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:59] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:31] !log fabfur@cumin1001 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough [08:47:59] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:55:51] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:56:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:56:07] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:56:11] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:56:14] (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:39] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:59:28] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:41] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:01:13] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:01:14] (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:53] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 203, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:02:03] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 286, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:02:15] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:07:59] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:08:15] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,user@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:28] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:13] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for doh2001.wikimedia.org [09:10:13] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for doh2001.wikimedia.org [09:11:14] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:37] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:15:27] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:16:19] (03PS8) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 [09:16:49] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:54] (03CR) 10Filippo Giunchedi: [C: 03+1] graphite::production: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984248 (owner: 10Muehlenhoff) [09:19:28] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:22:35] PROBLEM - cassandra-a service on restbase2029 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:22:37] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:22:53] PROBLEM - cassandra-a SSL 10.192.16.240:7000 on restbase2029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [09:22:57] PROBLEM - Check systemd state on restbase2029 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:21] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:23:45] PROBLEM - cassandra-a CQL 10.192.16.240:9042 on restbase2029 is CRITICAL: connect to address 10.192.16.240 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:24:15] (03PS1) 10JMeybohm: Bump memory for calico-node on wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/984482 [09:25:35] RECOVERY - cassandra-a service on restbase2029 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:25:57] RECOVERY - Check systemd state on restbase2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:07] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:43] RECOVERY - cassandra-a CQL 10.192.16.240:9042 on restbase2029 is OK: TCP OK - 0.030 second response time on 10.192.16.240 port 9042 https://phabricator.wikimedia.org/T93886 [09:27:21] RECOVERY - cassandra-a SSL 10.192.16.240:7000 on restbase2029 is OK: SSL OK - Certificate restbase2029-a valid until 2025-12-05 16:11:10 +0000 (expires in 716 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [09:27:55] (03CR) 10Volans: Debian packaging configuration (032 comments) [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [09:29:28] (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:19] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:39:05] (03PS6) 10JMeybohm: Alert for containers with memory issues [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) [09:39:07] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:39:14] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for doh5002.wikimedia.org [09:39:15] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for doh5002.wikimedia.org [09:39:15] (03CR) 10JMeybohm: Alert for containers with memory issues (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) (owner: 10JMeybohm) [09:43:13] !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough [09:49:39] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) 05Resolved→03Open krbkdc.logs have increased by a lot (used to be ~ 2.5G uncompressed per days, now at 25G since just 0:00 UTC), reopening [09:49:41] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:09] (03PS1) 10Muehlenhoff: Switch KDC log rotation to hourly [puppet] - 10https://gerrit.wikimedia.org/r/984485 (https://phabricator.wikimedia.org/T337906) [09:51:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984485 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [09:55:12] (03CR) 10Filippo Giunchedi: [C: 04-1] "I like it! Tested it in Pontoon and found a bunch of things to change (inline)" [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [10:02:33] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:03:33] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Patch-For-Review: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) >>! In T337906#9417847, @MoritzMuehlenhoff wrote: > krbkdc.logs have increased by a lot (used to be ~ 2.5G uncompressed per days... [10:03:49] (03PS7) 10JMeybohm: Alert for containers with memory issues [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) [10:08:14] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master: Drop absent resource [puppet] - 10https://gerrit.wikimedia.org/r/983360 (owner: 10JMeybohm) [10:09:08] (03CR) 10Volans: "question inline, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/984485 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [10:10:48] (03PS9) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 [10:13:52] (03PS10) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 [10:14:51] (03PS2) 10Muehlenhoff: Switch KDC log rotation to hourly [puppet] - 10https://gerrit.wikimedia.org/r/984485 (https://phabricator.wikimedia.org/T337906) [10:15:36] (03CR) 10Muehlenhoff: Switch KDC log rotation to hourly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984485 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [10:18:04] (03PS11) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 [10:22:17] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on mw2448.codfw.wmnet with reason: hw failure [10:22:23] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on mw2448.codfw.wmnet with reason: hw failure [10:22:30] 10SRE, 10ops-codfw, 10serviceops: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=caa5cb6a-6361-4151-9946-cb47394c0eec) set by cgoubert@cumin2002 for 14 days, 0:00:00 on 1 host(s) and their services with reason: hw fai... [10:24:12] (03CR) 10Filippo Giunchedi: [C: 04-1] pyrra: reload pyrra-filesystem and thanos-rule on cfg change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [10:29:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/984329 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [10:39:09] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/984485 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [10:44:00] (03CR) 10Filippo Giunchedi: [C: 03+2] oauth2_proxy: skip provider button [puppet] - 10https://gerrit.wikimedia.org/r/984146 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [10:46:50] (03CR) 10Clément Goubert: [C: 03+2] Add MediaModeration module and support for running updateMetrics [puppet] - 10https://gerrit.wikimedia.org/r/984196 (https://phabricator.wikimedia.org/T353703) (owner: 10Kosta Harlan) [10:47:04] (03PS8) 10JMeybohm: Alert for containers with memory issues [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) [10:51:57] (ProbeDown) firing: (2) Service titan2002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:53:02] 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, 10Data-Platform-SRE (23/24 Q3 Milestone 1), and 3 others: Update WDQS Runbook following update lag incident - https://phabricator.wikimedia.org/T336577 (10Gehel) a:05bking→03None [10:53:16] 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, 10Data-Platform-SRE (23/24 Q3 Milestone 1), and 3 others: Update WDQS Runbook following update lag incident - https://phabricator.wikimedia.org/T336577 (10Gehel) [10:53:20] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10ayounsi) @Eevans reviving this years old thread now that Cassandra has been upgraded to 4.x since a few months. Would it be possible to look into not using extra IPs, at least on new/future/re-i... [10:56:51] (03PS1) 10Aqu: [Analytics] Activate metrics for all Airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/984510 (https://phabricator.wikimedia.org/T349532) [10:57:29] PROBLEM - cassandra-a service on restbase2028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:57:41] PROBLEM - Check systemd state on restbase2028 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:05] (03PS1) 10Alexandros Kosiaris: jaeger: add oauth2-proxy sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [10:58:19] PROBLEM - cassandra-a CQL 10.192.16.237:9042 on restbase2028 is CRITICAL: connect to address 10.192.16.237 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [10:58:39] PROBLEM - cassandra-a SSL 10.192.16.237:7000 on restbase2028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231220T1100) [11:00:38] (03PS2) 10Hnowlan: thumbor: pin image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/967466 (https://phabricator.wikimedia.org/T348856) [11:01:57] (ProbeDown) firing: (4) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:02:08] erm wat - cassandra-a on restbase2028 got OOMkilled [11:02:15] that's unusual. Happened yesterday too [11:05:16] (03CR) 10Alexandros Kosiaris: [C: 04-1] Alert for containers with memory issues (035 comments) [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) (owner: 10JMeybohm) [11:06:57] (ProbeDown) firing: (6) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:09:04] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10JayCano) As @Dreamy_Jazz's manager, I approve this request [11:09:08] (03PS2) 10Aqu: [Analytics] Activate metrics for all Airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/984510 (https://phabricator.wikimedia.org/T349532) [11:11:57] (ProbeDown) firing: (8) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:14:09] (03CR) 10Filippo Giunchedi: "Thank you for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [11:15:21] RECOVERY - cassandra-a service on restbase2028 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:15:33] RECOVERY - Check systemd state on restbase2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:38] the titan probe failures should recover soon [11:16:27] RECOVERY - cassandra-a SSL 10.192.16.237:7000 on restbase2028 is OK: SSL OK - Certificate restbase2028-a valid until 2025-12-03 21:32:59 +0000 (expires in 714 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [11:17:37] RECOVERY - cassandra-a CQL 10.192.16.237:9042 on restbase2028 is OK: TCP OK - 0.030 second response time on 10.192.16.237 port 9042 https://phabricator.wikimedia.org/T93886 [11:30:19] !log T353703 Manual run: /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/mediamoderation.dblist extensions/MediaModeration/maintenance/updateMetrics.php --verbose [11:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:33] T353703: Implement daily runs of updateMetrics on WMF wikis - https://phabricator.wikimedia.org/T353703 [11:34:55] (03PS1) 10AikoChou: ml-services: add a batcher for RRLA with smaller triggering value [deployment-charts] - 10https://gerrit.wikimedia.org/r/984512 (https://phabricator.wikimedia.org/T348536) [12:00:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [12:01:15] (03CR) 10Muehlenhoff: [C: 03+2] Switch KDC log rotation to hourly [puppet] - 10https://gerrit.wikimedia.org/r/984485 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [12:01:45] (03PS1) 10Filippo Giunchedi: oauth2_proxy: update probe definition [puppet] - 10https://gerrit.wikimedia.org/r/984515 (https://phabricator.wikimedia.org/T331512) [12:02:20] 10SRE-swift-storage: Missing original File:Ignatyevo.jpg - https://phabricator.wikimedia.org/T353797 (10TheDJ) [12:03:50] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/976/console" [puppet] - 10https://gerrit.wikimedia.org/r/984515 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [12:04:44] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] oauth2_proxy: update probe definition [puppet] - 10https://gerrit.wikimedia.org/r/984515 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [12:05:03] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: add a batcher for RRLA with smaller triggering value [deployment-charts] - 10https://gerrit.wikimedia.org/r/984512 (https://phabricator.wikimedia.org/T348536) (owner: 10AikoChou) [12:06:59] (03CR) 10AikoChou: [C: 03+2] "Thanks! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984512 (https://phabricator.wikimedia.org/T348536) (owner: 10AikoChou) [12:07:52] (03Merged) 10jenkins-bot: ml-services: add a batcher for RRLA with smaller triggering value [deployment-charts] - 10https://gerrit.wikimedia.org/r/984512 (https://phabricator.wikimedia.org/T348536) (owner: 10AikoChou) [12:08:54] 10SRE-swift-storage, 10Data-Persistence: Missing original File:Ignatyevo.jpg - https://phabricator.wikimedia.org/T353797 (10TheDJ) [12:10:50] (03PS1) 10Muehlenhoff: swift: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984516 [12:11:18] (03CR) 10CI reject: [V: 04-1] swift: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984516 (owner: 10Muehlenhoff) [12:12:38] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:13:16] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/977/con" [puppet] - 10https://gerrit.wikimedia.org/r/984510 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [12:13:45] (03PS2) 10Muehlenhoff: swift: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984516 [12:15:43] (03CR) 10Btullis: [V: 03+1 C: 03+1] "Looks good to me. We will also need a change to the prometheus config to start scraping the metrics, but I can make that patch." [puppet] - 10https://gerrit.wikimedia.org/r/984510 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [12:16:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984516 (owner: 10Muehlenhoff) [12:19:32] (03PS1) 10Btullis: Start scraping airflow metrics from all instances [puppet] - 10https://gerrit.wikimedia.org/r/984520 (https://phabricator.wikimedia.org/T343232) [12:20:59] (03CR) 10Btullis: [V: 03+1 C: 03+1] "I have created https://gerrit.wikimedia.org/r/c/operations/puppet/+/984520 which will cause prometheus to start scraping these metrics." [puppet] - 10https://gerrit.wikimedia.org/r/984510 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [12:23:16] (03PS1) 10Muehlenhoff: Update pws-trusted-users template file [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/984522 (https://phabricator.wikimedia.org/T333212) [12:23:55] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/978/con" [puppet] - 10https://gerrit.wikimedia.org/r/984520 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [12:26:28] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update pws-trusted-users template file [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/984522 (https://phabricator.wikimedia.org/T333212) (owner: 10Muehlenhoff) [12:32:46] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/984252 (owner: 10Muehlenhoff) [12:33:37] (03CR) 10Volans: "Looks ok but I see that the debian-glue job is failing with:" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [12:48:35] (ProbeDown) firing: (8) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:57] (ProbeDown) resolved: (8) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:04:49] (03PS1) 10Muehlenhoff: Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/984559 [13:05:32] 10SRE-swift-storage, 10Data-Persistence, 10media-backups: Missing original File:Ignatyevo.jpg - https://phabricator.wikimedia.org/T353797 (10jcrespo) The first backup for enwiki media is from 2021, and back then it already failed to be downloaded from swift, so sadly, it is not on the media backups. Here is... [13:05:47] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/984559 (owner: 10Muehlenhoff) [13:09:23] (03PS1) 10Bartosz Dziewoński: Replace $wgCommandLineMode checks with MW_ENTRY_POINT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984601 (https://phabricator.wikimedia.org/T353751) [13:10:29] 10SRE, 10SRE-swift-storage, 10Traffic, 10MediaWiki-Platform-Team (Radar), 10Performance Issue: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10TheDJ) >>! In T211661#8377883, @Ladsgroup wrote: > The best part: We don't even pre-generate thumbnails for these... [13:10:37] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10BTullis) @MoritzMuehlenhoff - Thanks for sorting out the more frequent log rotation as a workaround. At the moment, it seems that more than half of the log entries being... [13:10:43] (03PS9) 10JMeybohm: Alert for containers with memory issues [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) [13:11:09] (03CR) 10JMeybohm: Alert for containers with memory issues (035 comments) [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) (owner: 10JMeybohm) [13:21:29] (03CR) 10Slyngshede: Debian packaging configuration (031 comment) [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [13:25:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Check for false from ThumbnailImage::getStoragePath [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984324 (https://phabricator.wikimedia.org/T353758) (owner: 10Kosta Harlan) [13:29:28] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10Patch-For-Review: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 (10ayounsi) Big and needed change, thanks ! Looking at the doc at https://wikitech.wikimedia.o... [13:31:08] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] [13:31:09] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] (duration: 00m 01s) [13:31:48] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) So with the current hourly compression we're always safely in the realm where compression of the chunks completes and which should prevent the server... [13:32:05] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] [13:32:07] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] (duration: 00m 01s) [13:33:56] 10SRE-swift-storage, 10Data-Persistence, 10media-backups: Missing original File:Ignatyevo.jpg - https://phabricator.wikimedia.org/T353797 (10MatthewVernon) 05Open→03Declined The object doesn't appear in the container listing either (so it's not a "ghost" as we have seen occasionally) (I checked with `swi... [13:34:00] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] [13:34:12] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] (duration: 00m 11s) [13:34:54] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] [13:34:59] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] (duration: 00m 05s) [13:35:38] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_product@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] [13:35:47] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_product@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] (duration: 00m 09s) [13:35:59] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802 (10MoritzMuehlenhoff) [13:36:08] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:36:22] !log aqu@deploy2002 Started deploy [airflow-dags/platform_eng@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] [13:36:35] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) 05Open→03Resolved Closing the task since the immediate issue is resolved, I created https://phabricator.wikimedia.org/T353802 as a followup [13:36:48] !log aqu@deploy2002 Finished deploy [airflow-dags/platform_eng@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] (duration: 00m 25s) [13:37:17] !log aqu@deploy2002 Started deploy [airflow-dags/research@90f280e]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@e2ed6162] [13:37:23] !log aqu@deploy2002 Finished deploy [airflow-dags/research@90f280e]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@e2ed6162] (duration: 00m 06s) [13:37:54] !log aqu@deploy2002 Started deploy [airflow-dags/search@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] [13:38:25] !log aqu@deploy2002 Finished deploy [airflow-dags/search@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] (duration: 00m 30s) [13:38:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "I wonder if this will help with an issue I noticed a few months ago where exceptions from RunSingleJob.php seemingly didn’t get normalized" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982416 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [13:38:50] !log aqu@deploy2002 Started deploy [airflow-dags/wmde@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac513] [13:38:55] !log aqu@deploy2002 Finished deploy [airflow-dags/wmde@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac513] (duration: 00m 05s) [13:40:23] (03CR) 10Aqu: [C: 03+1] Start scraping airflow metrics from all instances [puppet] - 10https://gerrit.wikimedia.org/r/984520 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [13:41:33] (03CR) 10Btullis: [V: 03+1 C: 03+2] Start scraping airflow metrics from all instances [puppet] - 10https://gerrit.wikimedia.org/r/984520 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [13:43:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Looks fine to me, but is there a reason to limit this to wiktionary and mw.o, other than that those wikis were mentioned on the task so fa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [13:43:51] (03CR) 10Btullis: [V: 03+1 C: 03+2] [Analytics] Activate metrics for all Airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/984510 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [13:49:17] PROBLEM - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:49:17] PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:49:17] PROBLEM - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:50:27] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:50:27] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:50:51] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:54:23] (03CR) 10Pols12: Make wiktionary and mw.org provide og:site_name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231220T1400). [14:00:05] Pols12, Dreamy_Jazz, WMDE-Mell, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] o/ [14:00:26] hi [14:00:32] \o [14:00:41] I can deploy :) [14:00:44] :) [14:00:58] I'm stand in for Mell for the moment [14:00:58] Pols12: hi! are you ready for the deployment? [14:01:00] 10SRE-swift-storage, 10Commons: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon So I tried to reproduce this just now with the PDF... [14:01:04] WMDE-Fisch: ok [14:01:18] Hi! Yes, I am. [14:01:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Make wiktionary and mw.org provide og:site_name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [14:01:40] ok! [14:01:44] (03PS3) 10Lucas Werkmeister (WMDE): Make wiktionary and mw.org provide og:site_name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [14:01:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [14:02:15] Pols12: are you familiar with how to test changes on mwdebug? [14:02:17] Hello, I am here now :) [14:02:24] hi ^^ [14:02:29] (03PS1) 10Btullis: Add the prometheus_statsd_exporter to all remaining airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/984607 (https://phabricator.wikimedia.org/T349532) [14:02:41] (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: force coredns to resolve to A records in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984250 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [14:03:12] (03Merged) 10jenkins-bot: Make wiktionary and mw.org provide og:site_name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [14:03:19] It is actually my first config patch. I have understood I have to turn on WikimediaDebug for related wikis, and check the meta tag is well added and nothing else [14:03:31] alright, sounds good [14:03:41] (https://wikitech.wikimedia.org/wiki/WikimediaDebug for more info, in case you didn’t see that yet) [14:04:01] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:981636|Make wiktionary and mw.org provide og:site_name (T348203)]] [14:04:09] T348203: Google displays “Wikipedia” as site title for some Wiktionary and MediaWiki.org pages - https://phabricator.wikimedia.org/T348203 [14:04:23] (don’t test anything yet, btw, I just wanted to check ahead of time ^^) [14:04:25] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/984607 (https://phabricator.wikimedia.org/T349532) (owner: 10Btullis) [14:04:45] !log installing cups updates from bookworm point release [14:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:50] (03PS4) 10WMDE-Fisch: Remove BetaFeature code related to ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) [14:05:13] (03PS5) 10WMDE-Fisch: Remove wgPopupsReferencePreviews now that it defaults to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978035 (https://phabricator.wikimedia.org/T351708) (owner: 10Awight) [14:05:19] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add the prometheus_statsd_exporter to all remaining airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/984607 (https://phabricator.wikimedia.org/T349532) (owner: 10Btullis) [14:05:21] (03CR) 10Aqu: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/984607 (https://phabricator.wikimedia.org/T349532) (owner: 10Btullis) [14:06:18] !log lucaswerkmeister-wmde@deploy2002 pols12 and lucaswerkmeister-wmde: Backport for [[gerrit:981636|Make wiktionary and mw.org provide og:site_name (T348203)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:06:30] Pols12: alright, now you can test with WIkimediaDebug [14:06:43] (you can pick any server from the dropdown, the change should be on all of them) [14:06:55] OK thank you [14:08:08] (03PS7) 10Bartosz Dziewoński: RunSingleJob.php: Fix use of MWExceptionHandler before it's defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982416 (https://phabricator.wikimedia.org/T352265) [14:08:12] (03CR) 10Bartosz Dziewoński: RunSingleJob.php: Fix use of MWExceptionHandler before it's defined (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982416 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [14:10:22] Pols12: how does it look so far? [14:11:38] I didn’t notice any issue on Wikitionary. I’m testing on mediawiki.org [14:11:47] ok [14:11:50] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [14:12:13] I guess the main thing we can test so far is that the tag appears… whether it has the desired effect on google, only time can tell [14:12:37] !log installing debootstrap bugfix updates from Bookworm point release [14:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:30] I see a translated site_name (Викисловарь) on ru.wiktionary.org, so that looks good [14:13:53] Yes, I don’t see any issue either. [14:14:01] alright, I’ll go ahead and sync it then [14:14:15] !log lucaswerkmeister-wmde@deploy2002 pols12 and lucaswerkmeister-wmde: Continuing with sync [14:14:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Check for false from ThumbnailImage::getStoragePath [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984324 (https://phabricator.wikimedia.org/T353758) (owner: 10Kosta Harlan) [14:14:35] and starting the gate-and-submit for the backport already [14:14:36] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [14:16:04] !log installing distro-info-data updates from Bookworm point release [14:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:21] (03Merged) 10jenkins-bot: Check for false from ThumbnailImage::getStoragePath [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984324 (https://phabricator.wikimedia.org/T353758) (owner: 10Kosta Harlan) [14:19:11] This backport cannot be tested by me as it requires running a maintenance script (which I don't have access to do). This will be verified when running the script again. The script is currently only run manually, so should be safe to not test and test when running the script later. [14:19:24] Happy to answer any questions around the patch though. [14:19:56] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:981636|Make wiktionary and mw.org provide og:site_name (T348203)]] (duration: 15m 54s) [14:20:00] T348203: Google displays “Wikipedia” as site title for some Wiktionary and MediaWiki.org pages - https://phabricator.wikimedia.org/T348203 [14:20:15] ok [14:20:39] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10Jhancock.wm) [14:21:00] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:984324|Check for false from ThumbnailImage::getStoragePath (T353758)]] [14:21:05] T353758: MediaModerationPhotoDNAServiceProvider provides false to FileBackend::getFileContents causing a PHP Notice - https://phabricator.wikimedia.org/T353758 [14:21:11] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [14:21:58] WMDE-Mell: is it okay to deploy both your changes together? [14:22:03] Dreamy_Jazz: It's the script from https://phabricator.wikimedia.org/T351400 right ? I can run it for you if you need me to [14:22:04] since they should both be no-ops IIUC [14:22:18] !log lucaswerkmeister-wmde@deploy2002 kharlan and lucaswerkmeister-wmde: Backport for [[gerrit:984324|Check for false from ThumbnailImage::getStoragePath (T353758)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:22:23] oops, lots of red output from scap… [14:22:37] “Build of K8s images failed (non-K8s deployment will continue normally)” [14:22:42] “Disabling K8s deployments” errrrrr [14:22:51] that seems less than ideal if we’re serving 25% of traffic from them by now? [14:23:03] help me claime you’re my only hope ;) [14:23:09] Yeah looking [14:23:18] Lucas_WMDE: WMDE-Mell Also just vanished from our meeting. But yes to your question. [14:23:28] * Lucas_WMDE checks if k8s is actually disabled [14:23:48] I just got a response from mw-web.eqiad.main-85d98c5859-zzwcd [14:23:56] 500 from the registry [14:23:57] and another from mw-web.eqiad.main-85d98c5859-4wnk5 [14:24:08] so it doesn’t look like we completely stopped serving traffic from it, phew [14:24:10] I think it just skips and doesn't deploy to k8s [14:24:19] it doesn't disable mw-on-k8s [14:24:24] that sounds more reasonable [14:24:33] I guess the word “deployments” is ambiguous in that message ^^ [14:24:46] The question is why are we getting a 500 on image push from the registry... [14:25:04] if the error only happened once, then skipping the deployment to k8s should be fine, it’ll just catch up with the next deployment I assume [14:25:07] I'll relaunch a k8s only scap with image build to see if it's transient [14:25:08] but yeah, what you said [14:25:10] ok [14:25:28] WMDE-Fisch: ok thanks [14:25:30] claime: Once/If the k8s issue is solved, the maintenance script can be run on testwiki only currently. The expected behaviour is to not see any PHP Notices. [14:26:00] These would appear in logstash but don't appear in the output of the script in the terminal. [14:26:05] Lucas_WMDE: Ah it's locked for the backport for now [14:26:11] should I resume the backport then? [14:26:13] Yeah [14:26:15] !log lucaswerkmeister-wmde@deploy2002 kharlan and lucaswerkmeister-wmde: Continuing with sync [14:26:20] (there was nothing to test anyways) [14:26:29] I'll fix it in postTM [14:26:33] ok [14:26:44] or should I just do the next config change after that and see if that sorts out k8s too? [14:26:46] And WMDE-Mell is also back [14:26:57] But the script doesn't necessarily need testing, as it will be run manually again in the next day or so. [14:27:03] Lucas_WMDE: you can do that yeah, it's functionally equivalent [14:27:08] alright, will do [14:28:57] Dreamy_Jazz: ok, your call [14:29:01] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki/Analyt [14:29:01] ems/Airflow [14:29:01] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wik [14:29:02] ics/Systems/Airflow [14:29:04] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki/Analytics/ [14:29:05] Airflow [14:29:05] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimed [14:29:05] iki/Analytics/Systems/Airflow [14:29:06] I think no test is needed, but thanks for the offer. [14:29:07] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki/Analytics/Syst [14:29:07] low [14:29:09] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed Btullis T353806 - This is a problem with the monitoring system, related to the latest deploy https://wikitech.wikimedia.org/wiki/An [14:29:09] Systems/Airflow [14:30:13] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10Patch-For-Review: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 (10cmooney) @ayounsi thanks for the feedback! >>! In T346428#9418490, @ayounsi wrote: > Lookin... [14:30:39] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:984324|Check for false from ThumbnailImage::getStoragePath (T353758)]] (duration: 09m 38s) [14:30:45] T353758: MediaModerationPhotoDNAServiceProvider provides false to FileBackend::getFileContents causing a PHP Notice - https://phabricator.wikimedia.org/T353758 [14:30:57] Thanks for the deploy. [14:31:23] the scap exited nonzero, I’m guessing that’s just due to the k8s issue [14:31:25] * Lucas_WMDE scrolls up [14:31:38] yeah I don’t see any other issues in the output [14:31:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch) [14:31:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978035 (https://phabricator.wikimedia.org/T351708) (owner: 10Awight) [14:32:45] (03Merged) 10jenkins-bot: Remove BetaFeature code related to ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch) [14:32:48] (03Merged) 10jenkins-bot: Remove wgPopupsReferencePreviews now that it defaults to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978035 (https://phabricator.wikimedia.org/T351708) (owner: 10Awight) [14:33:11] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:976650|Remove BetaFeature code related to ReferencePreviews (T351708)]], [[gerrit:978035|Remove wgPopupsReferencePreviews now that it defaults to true (T351708)]] [14:33:17] T351708: Cleanup beta feature code in the ReferencePreviews and related - https://phabricator.wikimedia.org/T351708 [14:35:02] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and awight and wmde-fisch: Backport for [[gerrit:976650|Remove BetaFeature code related to ReferencePreviews (T351708)]], [[gerrit:978035|Remove wgPopupsReferencePreviews now that it defaults to true (T351708)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:35:04] claime: looks like it recovered [14:35:14] WMDE-Mell: please test :) [14:35:20] Thanks, will do [14:35:23] (that the feature is still available, I guess ^^) [14:36:55] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:08] claime: opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/202 as a suggestion to improve the scap message that confused me :) [14:37:31] Lucas_WMDE , all fine [14:37:35] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and awight and wmde-fisch: Continuing with sync [14:37:38] ok, thanks! [14:38:51] Lucas_WMDE: yep, checking the logs, the image that failed to push was rebuilt and pushed correctly this time around [14:38:56] yay [14:38:59] (03PS1) 10Muehlenhoff: os-reports: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984613 [14:39:22] MatmaRex: IIUC, the DiscussionTools config change only affects beta enwiki, is that right? [14:39:33] (because all other wikis already have the permalinks backend enabled in IS.php afaict) [14:39:47] Lucas_WMDE: yes [14:39:50] alright [14:42:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984613 (owner: 10Muehlenhoff) [14:43:27] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:976650|Remove BetaFeature code related to ReferencePreviews (T351708)]], [[gerrit:978035|Remove wgPopupsReferencePreviews now that it defaults to true (T351708)]] (duration: 10m 16s) [14:43:32] T351708: Cleanup beta feature code in the ReferencePreviews and related - https://phabricator.wikimedia.org/T351708 [14:43:35] (03PS2) 10Lucas Werkmeister (WMDE): DiscussionTools: Enable permalinks backend on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983440 (owner: 10Esanders) [14:43:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983440 (owner: 10Esanders) [14:44:24] (03Merged) 10jenkins-bot: DiscussionTools: Enable permalinks backend on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983440 (owner: 10Esanders) [14:45:06] oh nice, I forgot scap backport automatically skips beta changes ^^ [14:45:08] onwards then [14:45:17] (03PS8) 10Lucas Werkmeister (WMDE): RunSingleJob.php: Fix use of MWExceptionHandler before it's defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982416 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [14:45:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982416 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [14:45:58] (scap warned about the “undeployed” Depends-On change because it’s puppet) [14:46:23] (03Merged) 10jenkins-bot: RunSingleJob.php: Fix use of MWExceptionHandler before it's defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982416 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [14:46:48] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:982416|RunSingleJob.php: Fix use of MWExceptionHandler before it's defined (T352265)]] [14:46:55] T352265: Make changeprop-jobqueue error handling/httpbb tests better behaved: Uncaught Error: Class 'MWExceptionHandler' not found in /srv/mediawiki/rpc/RunSingleJob.php:42 - https://phabricator.wikimedia.org/T352265 [14:47:15] Lucas_WMDE, thanks for the deployment, everything looks fine [14:47:25] +1 :-) [14:47:51] np :) [14:48:29] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and matmarex: Backport for [[gerrit:982416|RunSingleJob.php: Fix use of MWExceptionHandler before it's defined (T352265)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:49:03] MatmaRex: can you test this change? [14:49:24] (I’m not sure how tbh… I guess we’d need to curl /rpc/RunSingleJob.php internally?) [14:49:42] (03PS1) 10Btullis: Add PYTHONPATH to the airflow-scheduler command [puppet] - 10https://gerrit.wikimedia.org/r/984614 (https://phabricator.wikimedia.org/T353806) [14:49:44] Lucas_WMDE: not really. but i can watch the logs later to see if that error disappears [14:50:07] (03PS2) 10Btullis: Add PYTHONPATH to the airflow-scheduler command [puppet] - 10https://gerrit.wikimedia.org/r/984614 (https://phabricator.wikimedia.org/T353806) [14:50:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Ouch!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984250 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [14:51:47] alright, let’s try it [14:51:49] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and matmarex: Continuing with sync [14:51:55] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/980/con" [puppet] - 10https://gerrit.wikimedia.org/r/984614 (https://phabricator.wikimedia.org/T353806) (owner: 10Btullis) [14:52:32] (03PS1) 10Muehlenhoff: rsync::quickdatacopy: Add support for creating nftables-compatible firewall services [puppet] - 10https://gerrit.wikimedia.org/r/984615 [14:53:02] (03CR) 10CI reject: [V: 04-1] rsync::quickdatacopy: Add support for creating nftables-compatible firewall services [puppet] - 10https://gerrit.wikimedia.org/r/984615 (owner: 10Muehlenhoff) [14:54:12] jouncebot: next [14:54:12] In 0 hour(s) and 5 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231220T1500) [14:54:32] James_F or others: do you have anything to deploy for wikifunctions? [14:54:43] I’d still like to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/984601/ but it would probably run into your window [14:55:01] (03PS2) 10Muehlenhoff: rsync::quickdatacopy: Add support for creating nftables-compatible firewall services [puppet] - 10https://gerrit.wikimedia.org/r/984615 [14:56:54] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add PYTHONPATH to the airflow-scheduler command [puppet] - 10https://gerrit.wikimedia.org/r/984614 (https://phabricator.wikimedia.org/T353806) (owner: 10Btullis) [14:56:55] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:19] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:982416|RunSingleJob.php: Fix use of MWExceptionHandler before it's defined (T352265)]] (duration: 10m 30s) [14:57:23] T352265: Make changeprop-jobqueue error handling/httpbb tests better behaved: Uncaught Error: Class 'MWExceptionHandler' not found in /srv/mediawiki/rpc/RunSingleJob.php:42 - https://phabricator.wikimedia.org/T352265 [14:58:01] MatmaRex: I’ll wait a few minutes and deploy the $wgCommandLineMode change if it doesn’t look like anything wikifunctionsy is happening [14:58:05] (03CR) 10CI reject: [V: 04-1] rsync::quickdatacopy: Add support for creating nftables-compatible firewall services [puppet] - 10https://gerrit.wikimedia.org/r/984615 (owner: 10Muehlenhoff) [14:58:47] (03PS3) 10Muehlenhoff: rsync::quickdatacopy: Add support for creating nftables-compatible firewall [puppet] - 10https://gerrit.wikimedia.org/r/984615 [14:58:50] !log bking@cumin2002 disable/mask wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories on wdqs102[24] T352878 [14:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:55] T352878: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 [14:59:01] Lucas_WMDE: thanks [14:59:33] RECOVERY - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/platform_eng AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231220T1500) [15:00:45] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:01:03] RECOVERY - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/wmde AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:01:03] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1024.eqiad.wmnet [15:01:03] RECOVERY - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/search AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:01:51] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:01:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs1024.eqiad.wmnet [15:02:12] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1024.eqiad.wmnet [15:02:41] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1024.eqiad.wmnet [15:03:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984615 (owner: 10Muehlenhoff) [15:04:52] (03PS1) 10Muehlenhoff: kerberos::kdc: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984617 [15:04:55] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:04:56] MatmaRex: btw, for `curl -i -X POST -H 'Content-Length: 0' https://jobrunner.discovery.wmnet/rpc/RunSingleJob.php` I get 422 back now [15:04:59] (on mwdebug2002) [15:05:00] (03PS4) 10Slyngshede: Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 [15:05:02] so that sounds good to me [15:05:05] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:05:05] (03CR) 10Slyngshede: Changes to Python infrastucture to help building Debian package. (034 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede) [15:05:09] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1023.eqiad.wmnet [15:05:25] Lucas_WMDE: oh, neat [15:05:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs1023.eqiad.wmnet [15:05:40] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1022.eqiad.wmnet [15:05:48] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts wdqs1022.eqiad.wmnet [15:05:51] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10Patch-For-Review: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 (10ayounsi) To follow up only on the Cassandra usecase, my proposal here is to actually remove... [15:05:52] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1022.eqiad.wmnet [15:05:59] (03PS2) 10Lucas Werkmeister (WMDE): Replace $wgCommandLineMode checks with MW_ENTRY_POINT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984601 (https://phabricator.wikimedia.org/T353751) (owner: 10Bartosz Dziewoński) [15:06:03] ^ deploying this now since it looks like the window is otherwise free at the moment [15:06:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984601 (https://phabricator.wikimedia.org/T353751) (owner: 10Bartosz Dziewoński) [15:06:16] (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984601 (https://phabricator.wikimedia.org/T353751) (owner: 10Bartosz Dziewoński) [15:06:42] (03CR) 10DDesouza: [C: 03+1] research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/984329 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [15:07:32] (03CR) 10Lucas Werkmeister (WMDE): "I think this is unblocked now, the config change was deployed and seems to work as this change expects:" [puppet] - 10https://gerrit.wikimedia.org/r/982236 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [15:07:57] (03PS1) 10Muehlenhoff: gitlab: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984618 [15:08:12] (03PS2) 10Muehlenhoff: gitlab: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984618 [15:08:14] (03Merged) 10jenkins-bot: Replace $wgCommandLineMode checks with MW_ENTRY_POINT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984601 (https://phabricator.wikimedia.org/T353751) (owner: 10Bartosz Dziewoński) [15:08:35] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:984601|Replace $wgCommandLineMode checks with MW_ENTRY_POINT (T353751)]] [15:08:49] T353751: Replace use of $wgCommandLineMode in operations/mediawiki-config - https://phabricator.wikimedia.org/T353751 [15:09:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1024.eqiad.wmnet [15:09:04] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts wdqs1024.eqiad.wmnet [15:09:21] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:25] MatmaRex: Confirming 422 behaviour on mw-jobrunners as well [15:09:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984618 (owner: 10Muehlenhoff) [15:10:03] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and matmarex: Backport for [[gerrit:984601|Replace $wgCommandLineMode checks with MW_ENTRY_POINT (T353751)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:10:25] so this is another one that might be tricky to test [15:10:36] yeah [15:10:39] i guess you can run some harmless maintenance script like lag.php, and verify that it doesn't crash [15:10:44] I quickly checked that shell.php isn’t completely broken at least [15:10:47] oh yeah, good point [15:10:51] RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:08] yup, lag looks good [15:11:12] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and matmarex: Continuing with sync [15:11:12] i'm not 100% sure that it actually reaches that code path though. that config code is quite weird [15:12:28] actually, looking at the code more closely… [15:12:32] why is it there at all? [15:12:38] we’re inside if ( !$wmgUseCSPReportOnly && !$wmgUseCSP ) [15:12:53] and the later blocks are if ( $wmgUseCSPReportOnly ) and if ( $wmgUseCSP ) [15:13:22] so I don’t see what inside that function would even happen if it didn’t return early o_O [15:14:27] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts wdqs1022.eqiad.wmnet [15:15:12] I guess the early return had an effect before https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/579890/2/wmf-config/CommonSettings.php [15:16:57] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:984601|Replace $wgCommandLineMode checks with MW_ENTRY_POINT (T353751)]] (duration: 08m 22s) [15:17:12] T353751: Replace use of $wgCommandLineMode in operations/mediawiki-config - https://phabricator.wikimedia.org/T353751 [15:17:56] heh [15:18:44] anyway [15:18:44] * Lucas_WMDE done [15:18:50] !log UTC afternoon backport+config window done [15:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:36] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2074.codfw.wmnet with OS bullseye [15:22:42] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2074.codfw.wmnet with OS bullseye executed with er... [15:25:08] (03CR) 10Volans: [C: 03+1] "LGTM" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [15:26:56] Lucas_WMDE: Sorry, yes, go for it. [15:27:16] James_F: all done already, but thanks ^^ [15:27:52] Yeah, I guessed. :-) [15:28:36] (03CR) 10JMeybohm: [C: 03+1] "nice find" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984250 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [15:31:33] (03PS1) 10Arnaudb: mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) [15:32:02] (03CR) 10CI reject: [V: 04-1] mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [15:32:25] filed T353816 for the config oddity discussed above, in case anyone’s interested :) [15:32:25] T353816: Fix or remove $wmgUseCSPReportOnlyHasSession (CSP only for logged-in users) - https://phabricator.wikimedia.org/T353816 [15:33:10] (03PS2) 10Arnaudb: mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) [15:33:39] (03CR) 10CI reject: [V: 04-1] mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [15:34:48] Lucas_WMDE: ah, i was just about to do the same thing. thanks [15:34:58] Lucas_WMDE: i also found https://phabricator.wikimedia.org/T255562 and https://phabricator.wikimedia.org/T291867 complaining that it's broken [15:35:02] (03PS3) 10Arnaudb: mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) [15:35:30] ah, I see [15:35:30] (03CR) 10CI reject: [V: 04-1] mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [15:35:35] my bad for not searching [15:36:08] lemme cross-link those [15:37:44] (03PS1) 10Bking: wdqs: Work around systemd unit failures [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) [15:38:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) (owner: 10Bking) [15:39:39] (03PS4) 10Arnaudb: mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) [15:40:35] 10SRE-swift-storage, 10Data-Persistence, 10media-backups: Missing original File:Ignatyevo.jpg - https://phabricator.wikimedia.org/T353797 (10Nardog) But //how// did it go missing in the first place? [15:40:40] (03CR) 10CI reject: [V: 04-1] wdqs: Work around systemd unit failures [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) (owner: 10Bking) [15:41:49] (03PS2) 10Bking: wdqs: Work around systemd unit failures [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) [15:42:39] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [15:43:00] (03CR) 10Bartosz Dziewoński: DiscussionTools: Enable permalinks backend on labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983440 (owner: 10Esanders) [15:43:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ABran-WMF) [15:43:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ABran-WMF) [15:43:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ABran-WMF) [15:43:57] (03PS6) 10Bartosz Dziewoński: Update expected RunSingleJob.php status code [puppet] - 10https://gerrit.wikimedia.org/r/982236 (https://phabricator.wikimedia.org/T352265) [15:44:34] (03CR) 10Bartosz Dziewoński: Update expected RunSingleJob.php status code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982236 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [15:44:47] (03PS3) 10Bking: wdqs: Work around systemd unit failures [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) [15:44:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) (owner: 10Bking) [15:45:16] (03CR) 10CI reject: [V: 04-1] wdqs: Work around systemd unit failures [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) (owner: 10Bking) [15:46:40] (03PS4) 10Bking: wdqs: Work around systemd unit failures [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) [15:46:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) (owner: 10Bking) [15:48:28] (03CR) 10Ahmon Dancy: "Thanks for deployment Dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/984259 (https://phabricator.wikimedia.org/T307738) (owner: 10Ahmon Dancy) [15:51:48] (03PS5) 10Bking: wdqs: Work around systemd unit failures [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) [15:52:04] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) (owner: 10Bking) [15:53:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] k8s topology labels: add row to rack transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [15:58:37] (03CR) 10Volans: "I did a quick pass on the python side, left some comments." [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [16:01:06] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10Patch-For-Review: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 (10cmooney) >>! In T346428#9418800, @ayounsi wrote: > To follow up only on the Cassandra usecas... [16:03:04] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2080.codfw.wmnet with OS bullseye [16:03:11] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be2080.codfw.wmnet with OS bullseye [16:03:30] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:03:45] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:04:51] 10SRE-swift-storage, 10Data-Persistence, 10media-backups: Missing original File:Ignatyevo.jpg - https://phabricator.wikimedia.org/T353797 (10MatthewVernon) I don't know, and I suspect it is impossible to know, a number of years after the fact. [16:06:40] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984618 (owner: 10Muehlenhoff) [16:06:42] (03PS4) 10Sergio Gimeno: Temporary users: set notifyBeforeExpirationDays to ten days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983755 (https://phabricator.wikimedia.org/T344694) [16:29:38] 10SRE-swift-storage, 10Commons: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10Alexkrakovsky) The error after uploading is gone. But it now appears later when you type in all the details... [16:38:31] (03CR) 10Andrew Bogott: [C: 03+1] [toolsdb] remove leftover template [puppet] - 10https://gerrit.wikimedia.org/r/984218 (owner: 10FNegri) [16:38:50] 10SRE-swift-storage, 10Commons: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10MatthewVernon) That's a different issue, then; and I don't know anything about the internals of the upload w... [16:40:37] 10SRE-swift-storage, 10Commons: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10Alexkrakovsky) >>! In T353498#9419172, @MatthewVernon wrote: > That's a different issue, then; and I don't k... [16:46:00] 10SRE-swift-storage, 10Commons: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10MatthewVernon) But evidently plenty of people can upload (we have metrics for this - [[ https://grafana.wiki... [16:48:10] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) i have been back and forth with dell no answer yet of what is causing this I still believe it is stil... [16:50:46] 10SRE-swift-storage, 10Commons: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10Alexkrakovsky) >>! In T353498#9419204, @MatthewVernon wrote: > But evidently plenty of people can upload (we... [16:56:24] (03PS3) 10Andrew Bogott: wikireplicas: add 'section' to meta_p.wiki [puppet] - 10https://gerrit.wikimedia.org/r/788697 (owner: 10Majavah) [16:56:26] (03PS1) 10Andrew Bogott: wikireplicas maintain-meta_p: don't store cursor in schema class [puppet] - 10https://gerrit.wikimedia.org/r/984626 [17:04:09] 10SRE-swift-storage, 10Commons: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10PantheraLeo1359531) The same case as Alexkrakovsky described also applies for the upload/publishing of TIF f... [17:05:20] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [17:06:16] 10SRE-swift-storage, 10Commons: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10Alexkrakovsky) It's a fact. Uploader doesn't work. We have hundreds of freshly scanned archival documents we... [17:18:37] (03CR) 10Andrew Bogott: "I believe this can be abandoned thanks to https://gerrit.wikimedia.org/r/c/operations/puppet/+/961783" [puppet] - 10https://gerrit.wikimedia.org/r/801380 (owner: 10Majavah) [17:19:28] (03PS11) 10Bking: wdqs: Work around systemd unit failures [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) [17:19:36] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) (owner: 10Bking) [17:21:39] (03CR) 10Andrew Bogott: wikireplicas: add 'section' to meta_p.wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/788697 (owner: 10Majavah) [17:21:45] (03PS1) 10Ottomata: wgEventStreams - Add eventlogging_MediaWikiPingback stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984627 (https://phabricator.wikimedia.org/T323828) [17:22:00] (03PS11) 10Bking: wdqs: Work around systemd unit failures [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) [17:25:06] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1022.eqiad.wmnet [17:25:55] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs1022.eqiad.wmnet [17:26:31] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1022.eqiad.wmnet [17:29:44] (03CR) 10Dzahn: [C: 03+2] research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/984329 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [17:30:07] (03CR) 10Dzahn: [C: 03+2] "https://gitlab.wikimedia.org/repos/sre/miscweb/research-landing-page/-/commits/master/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984329 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [17:37:50] 10SRE-swift-storage, 10Commons: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10PantheraLeo1359531) Yes. I have for example some orthophotos of Saxony I'd like to upload and a self-made HD... [17:39:42] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Improve sre.network.configure-switch-interfaces cookbook error-handling - https://phabricator.wikimedia.org/T353825 (10cmooney) p:05Triage→03Low [17:53:09] (03CR) 10Ryan Kemper: "Just some nits" [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) (owner: 10Bking) [17:53:17] 10SRE-swift-storage, 10Data-Persistence, 10media-backups: Missing original File:Ignatyevo.jpg - https://phabricator.wikimedia.org/T353797 (10Ladsgroup) yeah, it can be anything, from hardware failure to network, to bugs in swift or mw, disk issues, etc. etc. We can't know without a time machine (for more rec... [17:56:25] 10SRE-swift-storage, 10Commons: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10Smaxims) It's also my case. Some files can be uploaded without any problems, but others cannot. [17:56:54] (03PS12) 10Bking: wdqs: Work around systemd unit failures [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) [17:56:58] (03CR) 10Bking: wdqs: Work around systemd unit failures (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) (owner: 10Bking) [17:57:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) (owner: 10Bking) [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231220T1800) [18:05:05] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [18:05:06] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [18:05:30] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [18:05:46] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2003 [18:06:05] PROBLEM - Host lsw1-a8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:06:35] topranks: ^^^ oh ohhh [18:06:37] expectd? [18:06:39] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [18:06:49] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2003 [18:07:06] volans: emm..... it makes sense :( [18:07:24] I didn't expect it'd cause that - I'm just tinkering with something but all is ok - switch is up [18:07:31] ok [18:07:35] all yours then :D [18:07:38] thx [18:08:38] thanks for the heads up :) [18:08:43] np :) [18:09:49] PROBLEM - Host lsw1-a8-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:55] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10herron) [18:09:59] 10SRE-swift-storage, 10Commons, 10UploadWizard: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10MatthewVernon) [18:12:47] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Improve sre.network.configure-switch-interfaces cookbook error-handling - https://phabricator.wikimedia.org/T353825 (10cmooney) FWIW here is a log of running the cookbook against a switch where the interface is not set up: ` cmooney@cumin1001:~$ sudo cookbook... [18:13:19] (03PS1) 10Herron: admin: add dreamyjazz to deployment [puppet] - 10https://gerrit.wikimedia.org/r/984630 (https://phabricator.wikimedia.org/T353735) [18:13:55] RECOVERY - Host lsw1-a8-codfw is UP: PING OK - Packet loss = 0%, RTA = 75.41 ms [18:15:19] RECOVERY - Host lsw1-a8-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.73 ms [18:15:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10herron) p:05Triage→03Medium Thanks for the detailed request! @thcipriani could you please reivew/approve this groupadd to `deployment`? Thanks in advance! [18:25:08] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 (10cmooney) > it seems this limitation does not apply to 22.2 which we are using in codfw. An update on this. It seems that we do have this bug in 22.2, but we don't... [18:26:30] (03PS5) 10Herron: pyrra: reload pyrra-filesystem and thanos-rule on cfg change [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) [18:27:00] (03CR) 10CI reject: [V: 04-1] pyrra: reload pyrra-filesystem and thanos-rule on cfg change [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [18:27:35] (03PS6) 10Herron: pyrra: reload pyrra-filesystem and thanos-rule on cfg change [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) [18:29:02] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/982/con" [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [18:29:10] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Aki Nakanishi - https://phabricator.wikimedia.org/T353363 (10herron) [18:35:19] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 (10cmooney) [18:36:05] PROBLEM - Check systemd state on an-airflow1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@research.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:38:04] !log krinkle@deploy2002 Started deploy [integration/docroot@355ddbb]: (no justification provided) [18:38:11] !log krinkle@deploy2002 Finished deploy [integration/docroot@355ddbb]: (no justification provided) (duration: 00m 07s) [18:38:44] (03PS1) 10RLazarus: k8s-controller-sidecars: Bump to 1.0.2-1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984632 [18:39:00] (03CR) 10Herron: [V: 03+1] pyrra: reload pyrra-filesystem and thanos-rule on cfg change (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [18:46:38] (03PS2) 10RLazarus: k8s-controller-sidecars: Bump to 1.0.2-1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984632 [18:53:59] (PuppetFailure) firing: Puppet has failed on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:57:39] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [18:58:02] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:58:22] (03CR) 10RLazarus: [V: 03+2 C: 03+2] k8s-controller-sidecars: Bump to 1.0.2-1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984632 (owner: 10RLazarus) [18:58:41] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:59:05] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:59:07] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:59:45] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:00:05] dancy and brennen: gettimeofday() says it's time for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231220T1900) [19:00:05] dancy and brennen: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231220T1900). [19:04:33] 🍔:🟢 [19:04:33] 🥤:🟢 [19:04:33] 🚅:🟢 [19:05:17] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984633 (https://phabricator.wikimedia.org/T350086) [19:05:19] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984633 (https://phabricator.wikimedia.org/T350086) (owner: 10TrainBranchBot) [19:06:04] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984633 (https://phabricator.wikimedia.org/T350086) (owner: 10TrainBranchBot) [19:07:08] 🚀 [19:07:24] (03PS2) 10Dwisehaupt: Add dyna and discovery records for community-crm [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T302995) [19:08:29] (03CR) 10CI reject: [V: 04-1] Add dyna and discovery records for community-crm [dns] - 10https://gerrit.wikimedia.org/r/983950 (https://phabricator.wikimedia.org/T302995) (owner: 10Dwisehaupt) [19:10:23] Rolling back due to errors. [19:10:41] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984634 (https://phabricator.wikimedia.org/T350086) [19:10:43] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984634 (https://phabricator.wikimedia.org/T350086) (owner: 10TrainBranchBot) [19:11:27] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984634 (https://phabricator.wikimedia.org/T350086) (owner: 10TrainBranchBot) [19:12:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:15:05] en.wikisource.org just went down, returning "Cannot declare class ParserOutput, because the name is already in use" [19:15:43] xover: I'm rolling the train back right now. It should be done and (hopefully) back to normal in a few minutes. [19:16:04] Thanks. [19:16:15] I'll file a phabricator task shortl [19:16:18] *shortly [19:16:26] (03PS2) 10Dwisehaupt: Add CDN configuration for new community-crm [puppet] - 10https://gerrit.wikimedia.org/r/983951 (https://phabricator.wikimedia.org/T302995) [19:17:16] (MediaWikiHighErrorRate) firing: (6) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:17:18] saw a spike in 5XX requests rate on all DCs, most probably related [19:18:11] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.10 refs T350086 [19:18:16] T350086: 1.42.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T350086 [19:18:19] I'm still getting a 500 from wikidata.org [19:18:28] Cannot declare class ParserOutput, because the name is already in use [19:19:06] yeah.. I'm still seeing errors logged.. [19:19:10] from at least mw1350.eqiad.wmnet, mw1397.eqiad.wmnet [19:19:14] this smells like an opcache bug [19:19:41] hmm, i saw someone complain about this error on slack the other day [19:19:48] yeah, this is oddly familiar [19:20:01] https://wikimedia.slack.com/archives/C01R06P8D1B/p1702933341679149 [19:20:13] bblack: cwhite: ping as you seem to be on call [19:20:17] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) (owner: 10Bking) [19:21:32] errors are coming from wmf.9 too [19:22:17] (MediaWikiHighErrorRate) firing: (10) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:22:26] is there a backtrace for this? [19:22:28] I think bblack is on vacation on these days [19:22:31] https://phabricator.wikimedia.org/T353835 [19:22:31] is it an (un)serialization problem? [19:22:38] is OOO on his calendar [19:22:41] (03PS3) 10Eevans: restbase: set production role and add config for restbase2033 [puppet] - 10https://gerrit.wikimedia.org/r/981607 (https://phabricator.wikimedia.org/T352468) [19:22:43] (03PS3) 10Eevans: restbase: set production role and add config for restbase2034 [puppet] - 10https://gerrit.wikimedia.org/r/981608 (https://phabricator.wikimedia.org/T352468) [19:22:45] (03PS3) 10Eevans: restbase: set production role and add config for restbase2035 [puppet] - 10https://gerrit.wikimedia.org/r/981609 (https://phabricator.wikimedia.org/T352468) [19:23:11] I suspect the fix is a rolling php-fpm restart or something similar [19:23:59] we're not seeing recoveries as the rollback is progressing? [19:24:09] no recovery. [19:24:28] (03CR) 10Eevans: [C: 03+2] restbase: set production role and add config for restbase2033 [puppet] - 10https://gerrit.wikimedia.org/r/981607 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [19:24:38] !log dancy@deploy2002 Starting php-fpm-restarts [19:25:12] Commons affected too. [19:25:13] cwhite: fabfur: who's IC? we have an user-visible outage of at least some level [19:25:59] also I'm trying to figure out if it's just some appservers affected, or all of them [19:26:23] the stack trace here sure looks like an (un)serialization problem: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2023.12.20?id=KqCliIwBaR9sTI-qWnyz [19:26:30] But thumbs are still loading on enWP so it's not really affecting the multimedia stack / thumbor / etc. [19:27:06] !log dancy@deploy2002 Finished php-fpm-restarts [19:27:23] I'm going to start a doc [19:27:36] i would guess that wmf.9 serializes instances of ParserOutput that wmf.10 can't unserialize, and wmf.10 serializes instances of MediaWiki\Parser\ParserOutput that wmf.9 can't unserialize [19:27:43] (03CR) 10Bking: [C: 03+2] wdqs: Work around systemd unit failures [puppet] - 10https://gerrit.wikimedia.org/r/984620 (https://phabricator.wikimedia.org/T352878) (owner: 10Bking) [19:27:44] just guessing though, might be wrong [19:27:57] if this is what I think it is, it's going to get way worse [19:27:59] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host wdqs1022.eqiad.wmnet [19:28:06] Oh, weird, Commons:Main_Page is affected; but not seemingly file descd pages or village pump. [19:28:08] ParserOutputs shouldn't be serialized cross-wiki though [19:28:20] there's probably an easy way to hack around this, if you find someone who understands this [19:28:21] the decode is happening from PC entries? [19:28:23] has a revert started yet? [19:28:25] does parseroutput have a cache version we can easily bump? [19:28:35] probably from the content transform team [19:28:36] legoktm: the train was already rolled back [19:28:36] taavi: that will bring down everything [19:28:53] yeah, now the entries have been polluted [19:29:06] https://www.mediawiki.org/wiki/MediaWiki_1.42/wmf.10 has several parseroutput related changes [19:29:21] very likely the namespace work I think [19:29:23] purges will fix the affected pages [19:29:24] Incident doc https://docs.google.com/document/d/1or2rzRcvBiPQPA-EACWZktUUwIJpi3NRJ9X-FzU6p6A/edit [19:29:25] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:29:35] yeah about to say, just purge any page [19:29:39] * taavi purges wikidata main page [19:29:49] this is not the first time this happened, let me see what we did then [19:29:55] I see that we are debugging parser stuff - https://wikitech.wikimedia.org/wiki/Main_Page also shows that ParserOutput name is already in use [19:30:02] cwhite: can i have edit access? taavi@wm.o [19:30:20] the last time, it was a renaming of some properties of the class, or something like that [19:30:25] but now we renamed the whole class [19:30:33] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:43] Why the class is being serialized [19:30:46] Purge brought Main_Page of enWS back. [19:30:57] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host restbase2033.codfw.wmnet with OS bullseye [19:31:26] taavi: done [19:32:07] wikitech main page back up [19:32:21] why scap canary checks didn't catch this? [19:32:52] can we do a try { } catch { purge(); } hack around the unserializing code? [19:33:16] (purge Commons:Main_Page -> all ok) [19:33:27] if you purge every single page while deploying wmf.10, that probably won't be good [19:33:36] ^^ any transition has to be gradual [19:33:48] here now. [19:33:51] (the purge of our thankyou page - https://thankyou.wikipedia.org/wiki/Thank_You/en?country=US - fixed it as well, thanks) [19:34:09] Trying to revert the patch [19:34:10] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/984491 [19:34:11] yeah, but I mean in wmf.9 to recover the pages that are currently broken [19:34:20] (no donation impact other than the thankyou page not loading for donors after their donations were already complete, so not the worst, but bad optics for a bit) [19:34:29] Amir1: are you sure it's that one and not any of the several other parseroutput touching patches in this train? [19:35:02] not saying 100% sure but that's very likely [19:35:36] legoktm: the class has an alias so that shouldn't be an issue but I'm probably missing something obvious [19:35:58] the line it is complaining about is the class ParserOutput... line [19:36:05] 5xx responses still elevated [19:36:19] that phab task error sounds like a name space issue ... I saw reports of those errors in dev checkouts (on slack previously). [19:36:42] subbu: here's a backtrace that look fishy: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2023.12.20?id=KqCliIwBaR9sTI-qWnyz [19:36:45] here for the fun [19:36:51] it goes through #4 /srv/mediawiki/php-1.42.0-wmf.9/includes/json/JsonCodec.php(73): class_exists() [19:37:05] this is clearly the problem [19:37:15] anything after that in the backtrace is some nonsense garbage [19:37:35] php is keeping a cache which isn't being flushed by the rollback?  is that what's going on? [19:38:07] i think we aren't sure what's going on [19:38:08] why class exists tries to load the class [19:38:22] it probably should be a different check [19:38:27] is that specific to foundationwiki OR other wikis as well? [19:38:28] Isn't that what class_exist() does if it's not already loaded? [19:38:33] the parser cache contains the name of the class it is supposed to deserialize [19:38:39] my best guess was: i would guess that wmf.9 serializes instances of ParserOutput that wmf.10 can't unserialize, and wmf.10 serializes instances of MediaWiki\Parser\ParserOutput that wmf.9 can't unserialize [19:38:44] that cache info was corrupted when the class was temporarily namespace [19:38:45] d [19:38:47] but i don't understand why it can't unserialize them [19:39:01] https://stackoverflow.com/questions/3812851/there-is-a-way-to-use-class-exists-and-autoload-without-crash-the-script#14269760 [19:39:07] maybe it needs a false? [19:39:11] because the stored json is literally { _type: "some namespaced class name for parserOutput" } [19:39:13] can someone get a diff for a problematic and an unproblematic parser cache entry? [19:39:23] I can tell you what it the diff will show :) [19:39:37] there is a _type field which names the class to deserialize [19:39:45] taavi: it's two ways now, the namespaced class now won't be recognized by the old rolled back code [19:39:49] right [19:40:12] I suggest just adding false to class_exists [19:40:15] https://www.php.net/manual/en/function.class-exists.php [19:40:17] Amir1: in that case I would expect a "No such class MediaWiki\..." error, not an "unable to redefine" error [19:40:20] Isn't there a hook that can act as an adaptor for parser cache values ... where we can massage the loaded josn obejct to change the class name? [19:40:26] you have to flush the entries created during the roll-forward.  arguably the parser cache loading code should be more robust and just drop the cache entry on the ground rather than crashing the whole site [19:40:38] subbu: the class has an alias, that's not the problem [19:40:50] it doesn't have an alias now that you've rolled back, though [19:41:03] yeah, I think we should actually roll forward after the fix [19:41:11] otherwise it's going to be even messier [19:41:16] let me try something [19:41:28] do we have a reproducible case? [19:41:47] (the User: page on enWS where I first saw the problem is still broken, so I'm guessing all pages hit while wmf.10 was out is affected) [19:41:48] Amir1: https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Privacy_policy [19:42:04] cool [19:42:09] please don't purge this page [19:42:40] amir: i think 'false' to class_exists will work in the case of ParserOutput, it will probably cause loads to fail in weird corner cases, though [19:42:41] MatmaRex, I asked earlier, but is this only on foundationwiki then? thankfully this is group 1, not group 2. [19:42:55] i don't know [19:43:00] subbu: I think the hook you're talking about is https://www.mediawiki.org/wiki/Manual:Hooks/RejectParserCacheValue [19:43:09] like wikidata puts its own class types in extension data, and i'm not 100% certain they will be already loaded when JsonCodec tries to deserialize the parser output [19:43:28] the train was rolled forwards to group1, in theory that means that any group1 page that was loaded could be affected aiui [19:43:29] taavi: check it out in mwdebug2001 [19:43:33] adding false fixes the issue [19:43:48] Amir1 see above, i think you'll still fail but in corner cases [19:44:13] subbu: I have seen it everywhere [19:44:19] ack. [19:44:40] we can't really reject PC, it'll bring down everything [19:45:00] why can we not reject entries from the specific affected timestamps on the specific affected wikis? [19:45:01] just reject the PC entries created during the rollforward [19:45:04] yeah [19:45:04] you odn't have to reject it .. just modify the PC object and return true. [19:45:12] cscott: weird corner cases is better than status quo tbh [19:45:20] we've done that in the past [19:45:39] eg https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/608971 [19:45:45] i think we should revert the parser output namespacing and reject the cached entries during the time period affected [19:45:54] then we can discuss a more proper solution in 2024 [19:46:07] if the problem is that the class name in __type property in ParserOutput object is wrong (because of the namespace), change the __type field and return true? [19:46:11] the revert is not clean [19:46:55] subbu: that could work [19:46:55] does ParserOutput not have a big warning message that says "please do not change this class in any way without carefully considering parser cache issues"? [19:47:25] PC shouldn't serialize a php object in the first place [19:47:32] 10SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 10DBA, 10MediaWiki-libs-Rdbms, 10Epic: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255 (10Pppery) [19:48:19] Amir1, I would recommend that approach of using the PC hook. I haven't written such an adaptor before .. but Daniel or Timo might have .. but one of us could probably try that approach to see if that works. [19:48:35] i like subbu's hook better than adding `false` to class_exists, but that leaves us with rollback issues, which is Not Great. [19:48:50] what is the rollback issue? [19:48:51] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2033.codfw.wmnet with reason: host reimage [19:49:01] that is, we need to have a hook in mw-config that matches the "current" version on wiki. [19:49:19] so you still can't rollback without changing the hook, or rollforward without changing the hook [19:50:05] as long as there are no other reasons to rollback the current train, we're fine, but that's a big caveat. [19:50:20] so far we haven't seen any issues regarding that [19:50:37] hm, just realized.. does using RejectParserCacheValue require being able to load the entry from the cache in the first place? [19:50:42] i still think we're better just getting rid of the namespacing until 2024, then at least the hook can just normalize to the old parseroutput name instead of trying to juggle two different names at once [19:50:47] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/984635? [19:50:59] taavi, yes. [19:51:25] the class has an alias why it's not working? [19:51:39] (i think) [19:51:48] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2033.codfw.wmnet with reason: host reimage [19:51:49] the class_exists false trick is IMO still way too likely to cause weird edge cases to be deployed just before the holidays [19:52:13] Amir1 i suspect the original issue was transient, but the rollback created permanent issues. [19:53:12] cscott: the error is not class is not found, it's class already exists, if all of our errors were "class not found" yes but that's not the case [19:53:51] Amir1: that looks like it will cause new "class not found" errors for other classes that actually depended on autoloading there [19:53:51] I think you can mitigate by just catching the new error in ParserCache:restoreFromJson() [19:53:52] taavi: idea how to tackle the edge cases? e.g. once called and it works, call it again in the condition? [19:54:33] yeah, let's do that [19:55:11] my hook idea won't work ... as taavi suspected, this fails even more the hook runner can invoke it. [19:55:12] someone earlier suggested just checking for `$class == 'ParserOutput'` and `$class == 'MediaWiki\\Parser\\ParserOutput'` in that code. that sounds more likely to work [19:55:27] ^ should work. [19:55:30] yeah [19:55:40] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/984636 B [19:55:41] making the patch [19:55:42] ^ [19:55:56] belt and suspenders -- we shouldn't crash when we can't unserialize [19:58:11] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service,httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service,httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:59:33] can you just try-catch a class declaration error? [19:59:53] I couldn't find the exact class name being thrown in logstash [19:59:57] maybe you can dig it out of there [19:59:59] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:00:14] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/984637 is the explicit class name check [20:00:50] so https://gerrit.wikimedia.org/r/c/mediawiki/core/+/984636 to make this not crash, and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/984637 (on both old and new versions) to make us robust. [20:01:25] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:01:29] actually both patches could/should be cherry-picked to the old branch too, maybe i should squash them. [20:01:49] I'm not sure if it's an actual exception, iirc php has different concepts for error and exception [20:01:50] cscott: this is good imo, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/984637 just one note [20:02:11] by the way. there are also errors like: "Cannot declare class SiteList, because the name is already in use". probably caused by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/983206 [20:02:13] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:02:16] (MediaWikiHighErrorRate) firing: (10) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:02:40] MatmaRex: do you have a sample link for that? [20:02:42] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from appservers-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=esams%20prometheus/ops&var-cluster=text&var-origin=appservers-ro.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHi [20:02:42] MatmaRex: transient?  or ongoing? [20:02:44] MatmaRex: that is very likely cache pollution [20:02:51] it'll go away [20:02:53] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:02:59] (a logstash link that is) [20:03:20] https://logstash.wikimedia.org/goto/c012db54770432932942ce9b55b33476 [20:03:55] yeah, cache pollution, it'll go away [20:04:41] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:04:45] cwhite: Amir1: anything I can do to help? [20:05:07] I hope to get this merged and deployed ASAP https://gerrit.wikimedia.org/r/c/mediawiki/core/+/984637/1/includes/json/JsonCodec.php [20:05:33] Also random note, it really even shouldn't try to load json on a wikitext, these errors all are on json [20:05:38] *wikitext pages [20:07:20] Amir1, I didn't understand your review comment on Scott's patch: why throw an exception there .. that will reject all parser cache values effectively. [20:07:39] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:08:00] subbu: Let me explain, PC::get() [20:08:05] has is_string() [20:08:13] and if the value is string, it tries to load the json [20:08:18] right. [20:08:21] for reasons unknown to me [20:08:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:08:54] then if it's null, it should be fine? [20:09:07] like "attempt to load json from wikitext" [20:09:08] sorry i missed something due to a flakey client, could you repeat? [20:09:26] (ParserCache serializes ParserOutput as Json since a couple of releases ago, it's much better than the old PHP serialization) [20:09:49] ah, I think I get it now [20:10:00] Amir1, the json is stored as a string in the db ... so, it has to be serialized back to json object. [20:10:26] and i think the json codec is using that info to construct the parser output (and nested objects). [20:10:33] yeah, I understand that part, my confusion was that I thought it's trying to see if it's a json content type [20:10:36] ok. [20:11:07] no, it's just loading the previous-parsed page from the parser cache [20:11:22] let me just double check if it works in mwdebug [20:11:55] Amir1, ok reg checking .. cscott: so i'm happy +2 your first patch in the chain. Looking at the second patch now assuming Amir is happy with removing his comment there. [20:12:23] not remove but that we sholdn't be throwing an objection there. [20:13:24] (03PS1) 10Ladsgroup: Protect against ParserOutput re-namespacing [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984492 (https://phabricator.wikimedia.org/T353835) [20:13:42] Okay if I deploy this? subbu cscott [20:13:54] works for me, hopefully works for the wikis [20:14:08] +2ed https://gerrit.wikimedia.org/r/c/mediawiki/core/+/984636 which should at least protect from crashers .. but, it can still cause problems since it will now force a reparse for failed fetched. So, we need the other patch. [20:14:20] cscott: tested in mwdebug [20:14:43] Amir1, okay. [20:14:49] deploying [20:14:53] amir that might effectively purge your test page, just fyi [20:14:55] question. with these patches, can we avoid reverting the namespacing patch? [20:15:09] i think so since it handles both names. [20:15:10] (03CR) 10Ladsgroup: [C: 03+2] Protect against ParserOutput re-namespacing [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984492 (https://phabricator.wikimedia.org/T353835) (owner: 10Ladsgroup) [20:15:13] MatmaRex i think so [20:15:31] I wrote it using ParserCache::class so it should be safe to backport to the old train as well [20:15:41] yeah, specially given the mess reverts would create [20:15:42] (i'm asking because it seems good to avoid it, in case some code in wmf.10 relies on it already) [20:15:45] that way hopefully future rollbacks won't be an issue [20:17:04] Amir1, are you first putting this on mwdebug for additional testing? Or are you satisfied? [20:17:28] https://integration.wikimedia.org/ci/job/mediawiki-core-php74-phan-docker/28546/console [20:17:35] I did [20:17:38] but it's broken :D [20:17:42] (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from appservers-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=esams%20prometheus/ops&var-cluster=text&var-origin=appservers-ro.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrors [20:19:10] (03PS2) 10Ladsgroup: Protect against ParserOutput re-namespacing [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984492 (https://phabricator.wikimedia.org/T353835) [20:19:17] (by the way, the error rate has dropped. probably the affected pages got mostly purged: https://logstash.wikimedia.org/goto/39a168271f1e434ea6e86d5b5a61c953 ) [20:19:22] cscott: happy with my change to make phan happy? [20:19:44] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/984492/1..2/includes/json/JsonCodec.php [20:19:57] Amir1: shouldn't that be `use ParserOutput` on wmf.9 [20:20:03] working on it [20:20:10] MatmaRex yeah i was too clever by half [20:20:15] and `use MediaWiki\Parser\ParserOutput;` on wmf.10 [20:20:21] yeah [20:20:54] (03PS3) 10Ladsgroup: Protect against ParserOutput re-namespacing [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984492 (https://phabricator.wikimedia.org/T353835) [20:21:15] oh, i might have just overwrite that. [20:21:17] (03CR) 10Ladsgroup: [C: 03+2] Protect against ParserOutput re-namespacing [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984492 (https://phabricator.wikimedia.org/T353835) (owner: 10Ladsgroup) [20:21:32] Pushing wmf.9 [20:22:16] (MediaWikiHighErrorRate) firing: (8) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:22:51] (03PS1) 10Ladsgroup: Protect against ParserOutput re-namespacing [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984493 (https://phabricator.wikimedia.org/T353835) [20:23:17] (03PS2) 10Ladsgroup: Protect against ParserOutput re-namespacing [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984493 (https://phabricator.wikimedia.org/T353835) [20:23:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:23:37] fixed patches on master [20:24:04] I'm deploying wmf.9 [20:24:07] and then wmf.10 [20:24:15] shall we roll the train forward afterwards? [20:24:33] OK with me! [20:24:39] do you want to take https://gerrit.wikimedia.org/r/c/mediawiki/core/+/984636 as well? [20:26:22] I'm a bit scared of that change in general, if there is an issue with serialization in general, it can basically trigger a full invalidation of parser cache. Correct? [20:26:29] (03CR) 10C. Scott Ananian: [C: 03+1] Protect against ParserOutput re-namespacing [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984493 (https://phabricator.wikimedia.org/T353835) (owner: 10Ladsgroup) [20:26:49] Amir1 yes, but it will also spam logs [20:27:17] so either the site is down completely with 500s or the site slows to a crawl w/ no parser cache (but still front end cache working presumably) [20:27:20] i think the latter is better [20:27:50] I think that error gets filtered out by default filters, not super sure [20:28:14] maybe we should tweak it to use a log message that's more UBN [20:28:25] I'd be okay if you split to two [20:28:26] i don't know the details of the log filters and what it takes to get SRE paged [20:28:33] first, InvalidArugment, etc. [20:28:40] and then another general catch [20:28:57] with a different error message [20:29:14] would that work? [20:29:19] cscott: we can consider that for the new year probably ... since the jsoncodec one handles the current namespace issue. [20:30:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984492 (https://phabricator.wikimedia.org/T353835) (owner: 10Ladsgroup) [20:30:02] Amir1 that was my original thought, then phan complained about duplicate catches, let me see. [20:30:33] cscott, because you had identical catch blocks .. changing the message should suffice to make it happy. [20:30:33] suppress phan, he is not our overlord, we are his xD [20:30:49] !log eevans@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase2033.codfw.wmnet with OS bullseye [20:31:30] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T353838 (10Damilare) [20:31:31] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host restbase2033.codfw.wmnet with OS bullseye [20:32:08] subbu if so, phan is smarter than i thought [20:32:32] new PS uploaded for that one, i'm sure we could bikeshed on the exact message [20:33:38] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T353838 (10Damilare) [20:35:36] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Damilare Adedoyin - https://phabricator.wikimedia.org/T353838 (10Pppery) [20:36:22] thanks ^^ [20:37:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2074.codfw.wmnet with OS bullseye [20:38:07] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2074.codfw.wmnet with OS bullseye [20:39:17] in 20 mins, I have to jump on an interview. but looks like cscott and Amir1 have this under control now. [20:39:33] yeah, [20:39:44] I'm around, technically sick but sigh, whatever [20:40:13] (03Merged) 10jenkins-bot: Protect against ParserOutput re-namespacing [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984492 (https://phabricator.wikimedia.org/T353835) (owner: 10Ladsgroup) [20:40:13] Where are we at now? Deployment step? [20:40:20] yup, just got merged [20:40:24] about to be rolled [20:40:30] thanks for jumping in and fixing despite being sick! [20:40:36] ack, thank you! [20:40:39] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:984492|Protect against ParserOutput re-namespacing (T353835)]] [20:40:46] T353835: PHP Fatal Error from line 51 of /srv/mediawiki/php-1.42.0-wmf.9/includes/parser/ParserOutput.php: Cannot declare class ParserOutput, because the name is already in use - https://phabricator.wikimedia.org/T353835 [20:41:04] subbu: sorry for breaking things :D [20:41:15] we have 900 more classes to go sooooo [20:41:24] you must be running low on t-shirts. [20:41:47] actually I think this is the first since a year ago or so [20:41:57] (when I accidentally depooled all of codfw) [20:42:01] :-) [20:42:11] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:984492|Protect against ParserOutput re-namespacing (T353835)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:42:16] (MediaWikiHighErrorRate) firing: (6) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:42:37] well, we know what not to do again when namespacing. [20:43:21] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [20:43:31] Amir1: I suspect that one problem w/ the canary check is that it doesn't check k8s pods. [20:44:09] i think the namespacing was actually fine (except for transient issues) the big issue was that it wasn't rollback safe because the new names got into the parser cache [20:44:14] And also the error rate just wasn't high enough during the 20 seconds that the check was happening. [20:45:11] better group1 than group2 though. [20:45:19] Indeed. [20:45:32] Thanks to all of you for jumping in so quickly! I was stressed out. [20:45:43] probably we should add a semi-permanent lookaside to jsoncodec, because i can see this happening for other classes which happen to end up in parser cache.  there aren't /that/ many, but it's "anything which an extension can pass to ::setExtensionData" so we should probably have a big warning at ::setExtensionData about that and some means to [20:45:44] provide forward- and back- compatibility hooks for extensions. [20:46:11] a project for 2024 [20:47:25] !log eevans@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase2033.codfw.wmnet with OS bullseye [20:47:28] there are also php objects being serialized into caches like sitelink that is also riding with this train [20:47:33] but with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/984636 hopefully all that would happen is that pages with that certain extension on them would end up temporarily uncached, rather than returning 500s. [20:47:38] but that should be also problem of forward compat [20:47:53] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host restbase2033.codfw.wmnet with OS bullseye [20:48:00] errors have basically gone to zero [20:48:12] MatmaRex, thanks for alerting us on slack. [20:48:44] subbu thanks for alerting me on google meet :) [20:48:53] lol [20:48:58] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:984492|Protect against ParserOutput re-namespacing (T353835)]] (duration: 08m 19s) [20:49:03] T353835: PHP Fatal Error from line 51 of /srv/mediawiki/php-1.42.0-wmf.9/includes/parser/ParserOutput.php: Cannot declare class ParserOutput, because the name is already in use - https://phabricator.wikimedia.org/T353835 [20:49:21] (03CR) 10Ladsgroup: [C: 03+2] Protect against ParserOutput re-namespacing [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984493 (https://phabricator.wikimedia.org/T353835) (owner: 10Ladsgroup) [20:49:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984493 (https://phabricator.wikimedia.org/T353835) (owner: 10Ladsgroup) [20:50:03] i;m still confused why this situation would fail in this particular way… it can't declare class ParserOutput, because it already exists, but what is the ParserOutput that already exists? [20:50:20] MatmaRex: the alias? [20:51:05] can't say for sure either [20:51:07] php already loaded and cached the old name for the cache locally.  we scap up new code but don't fully restart php and so it still has the old definition of the class in a cache.  that's my handwavy answer at least. [20:51:43] hmm [20:51:58] so it's phantom code? not on disk? [20:52:16] (03PS1) 10Andrew Bogott: keystone haproxy: increase server timeout for admin service to 10m [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) [20:52:16] (MediaWikiHighErrorRate) resolved: (6) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:52:23] yay! [20:52:39] MatmaRex well, it's the code that *used* to be on disk, does that count? [20:52:46] (03CR) 10CI reject: [V: 04-1] keystone haproxy: increase server timeout for admin service to 10m [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [20:52:46] heh [20:53:02] I think we rebooted php-fpm so the cache shouldn't be an issue [20:53:06] I saw Ahmon do it [20:53:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2074.codfw.wmnet with reason: host reimage [20:53:29] (my first thought was to reboot php fpm everywhere) [20:53:39] Amir1: but that was after you rolled back?  once you rolled back you had the problemof the new name saved in the cache, that's a different bug [20:53:40] php-fpm restart happens at the end of all deployments (including scap sync-wikiversions). [20:53:43] I think the transient error was before the reboots. [20:53:48] subbu right [20:54:41] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:51] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:54:51] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:54:58] i should have said "new name saved in the parser cache" since there are two different caches here.  the transient exception is about the old ParserOutput being in the php *class cache*, and the exception after rollback is about the new name being in the mediawiki *parser cache*. [20:55:04] alright .. tuning out of here now to prep for the interview. will catch up later. [20:56:33] (03PS2) 10Andrew Bogott: keystone haproxy: increase server timeout for admin service to 10m [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) [20:56:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2074.codfw.wmnet with reason: host reimage [20:57:27] " Error 1213: Deadlock found when trying to get lock; try restarting transaction" errors have increased since that last deployment. [20:58:29] cscott: Is this part of the "slow to a crawl" that you expected? [20:59:17] no, the patch that would throw out cache contents wasn't merged nor backported yet, so it's not my fault :) [20:59:24] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:59:26] haha ok.. hmm. [20:59:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [21:00:02] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231220T2100). [21:00:04] No Gerrit patches in the queue for this window AFAICS. [21:00:10] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:00:16] dancy: i think that's just unrelated bad timing [21:00:24] ok good. [21:00:43] I'm maxed out on drama! [21:00:55] someone edited Module:String on eswiktionary [21:01:12] so it's currently parsing approximatly every single page on the wiki [21:01:19] which probably shouldn't cause exceptions, but… [21:01:35] !log aqu@deploy2002 Started deploy [airflow-dags/research@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] [21:01:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:01:58] RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/research AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [21:02:04] !log aqu@deploy2002 Finished deploy [airflow-dags/research@d5ac513]: Make sure airflow-dags is up-to-date before activating metrics [airflow-dags@d5ac5131] (duration: 00m 28s) [21:02:08] Can someone refresh me on whether or not we're ready to try rolling forward to group1 again? [21:02:24] (03PS3) 10Andrew Bogott: keystone haproxy: increase server timeout for admin service to 10m [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) [21:02:26] dancy: I'm pushing wmf.10 [21:02:26] I was just about to ask the same question. [21:02:32] jenkins is slooooowwwww [21:02:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [21:02:52] (like the backport of the fix in wmf.10) [21:03:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:03:20] (not the actual train) [21:04:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:04:45] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2033.codfw.wmnet with reason: host reimage [21:06:46] (03Merged) 10jenkins-bot: Protect against ParserOutput re-namespacing [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984493 (https://phabricator.wikimedia.org/T353835) (owner: 10Ladsgroup) [21:07:04] (03PS4) 10Andrew Bogott: keystone haproxy: increase server timeout for admin service to 10m [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) [21:07:08] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:984493|Protect against ParserOutput re-namespacing (T353835)]] [21:07:14] T353835: PHP Fatal Error from line 51 of /srv/mediawiki/php-1.42.0-wmf.9/includes/parser/ParserOutput.php: Cannot declare class ParserOutput, because the name is already in use - https://phabricator.wikimedia.org/T353835 [21:07:57] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [21:08:08] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2033.codfw.wmnet with reason: host reimage [21:08:41] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:984493|Protect against ParserOutput re-namespacing (T353835)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:09:49] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [21:14:55] (03PS5) 10Andrew Bogott: keystone haproxy: increase server timeout for admin service to 10m [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) [21:15:06] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [21:15:22] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:984493|Protect against ParserOutput re-namespacing (T353835)]] (duration: 08m 13s) [21:15:26] T353835: PHP Fatal Error from line 51 of /srv/mediawiki/php-1.42.0-wmf.9/includes/parser/ParserOutput.php: Cannot declare class ParserOutput, because the name is already in use - https://phabricator.wikimedia.org/T353835 [21:15:57] dancy: deployed. Feel free to move the train forward [21:16:05] sorry for the mess [21:16:07] alright.. fingers crossed! [21:16:26] thanks all for braving the mess <3 [21:16:26] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984641 (https://phabricator.wikimedia.org/T350086) [21:16:28] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984641 (https://phabricator.wikimedia.org/T350086) (owner: 10TrainBranchBot) [21:17:10] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984641 (https://phabricator.wikimedia.org/T350086) (owner: 10TrainBranchBot) [21:19:47] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:19:54] (03PS6) 10Andrew Bogott: keystone haproxy: increase server timeout for admin service to 10m [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) [21:20:06] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [21:21:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:24:48] (03PS7) 10Andrew Bogott: keystone haproxy: increase server timeout for admin service to 10m [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) [21:24:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:24:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2074.codfw.wmnet with OS bullseye [21:24:53] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.10 refs T350086 [21:24:58] T350086: 1.42.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T350086 [21:24:59] (03PS1) 10Eevans: Revert "restbase: set production role and add config for restbase2033" [puppet] - 10https://gerrit.wikimedia.org/r/984494 [21:25:00] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2074.codfw.wmnet with OS bullseye completed: - ms-... [21:25:04] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [21:26:06] !log eevans@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase2033.codfw.wmnet with OS bullseye [21:26:29] (03CR) 10Eevans: [C: 03+2] Revert "restbase: set production role and add config for restbase2033" [puppet] - 10https://gerrit.wikimedia.org/r/984494 (owner: 10Eevans) [21:27:33] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:26] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host restbase2033.codfw.wmnet with OS bullseye [21:30:51] !log dancy@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.10 refs T350086 (duration: 05m 57s) [21:31:00] T350086: 1.42.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T350086 [21:32:59] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:33:20] (03PS1) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [21:33:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [21:33:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2076.codfw.wmnet with OS bullseye [21:33:30] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye [21:33:31] cwhite: wanna close the incident in status page? [21:33:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2077.codfw.wmnet with OS bullseye [21:33:39] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2076.codfw.wmnet with OS bullseye [21:33:41] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2077.codfw.wmnet with OS bullseye [21:33:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2078.codfw.wmnet with OS bullseye [21:33:50] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2078.codfw.wmnet with OS bullseye [21:33:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2079.codfw.wmnet with OS bullseye [21:33:56] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2079.codfw.wmnet with OS bullseye [21:34:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2080.codfw.wmnet with OS bullseye [21:34:08] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2080.codfw.wmnet with OS bullseye [21:36:42] on it [21:37:28] (03CR) 10CI reject: [V: 04-1] Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) (owner: 10Cathal Mooney) [21:37:59] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [21:38:19] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [21:39:02] (03PS2) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [21:39:12] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [21:39:30] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [21:39:35] (03PS8) 10Andrew Bogott: keystone haproxy: increase server timeout for admin service to 10m [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) [21:39:51] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [21:40:22] (03PS3) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [21:40:28] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [21:40:40] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [21:41:06] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [21:44:23] (03CR) 10CI reject: [V: 04-1] Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) (owner: 10Cathal Mooney) [21:45:17] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on lsw1-a8-codfw,lsw1-a8-codfw IPv6 with reason: testing commit confirm check in cookbook [21:45:44] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lsw1-a8-codfw,lsw1-a8-codfw IPv6 with reason: testing commit confirm check in cookbook [21:45:47] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2033.codfw.wmnet with reason: host reimage [21:45:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Improve sre.network.configure-switch-interfaces cookbook error-handling - https://phabricator.wikimedia.org/T353825 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f8c695e1-b7e4-4ad2-a1f2-118a3a1653c9) set by cmooney@... [21:46:13] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [21:46:55] (03PS9) 10Andrew Bogott: keystone haproxy: increase server timeout for admin service to 10m [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) [21:47:38] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [21:47:54] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [21:48:03] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [21:48:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2076.codfw.wmnet with reason: host reimage [21:48:35] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [21:48:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [21:48:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2077.codfw.wmnet with reason: host reimage [21:48:46] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [21:48:47] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2033.codfw.wmnet with reason: host reimage [21:49:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2079.codfw.wmnet with reason: host reimage [21:49:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2080.codfw.wmnet with reason: host reimage [21:52:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2076.codfw.wmnet with reason: host reimage [21:52:56] (03PS4) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [21:53:04] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [21:53:09] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [21:54:03] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [21:54:16] (03PS5) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [21:54:27] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [21:54:33] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [21:54:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2077.codfw.wmnet with reason: host reimage [21:56:23] (03PS6) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [21:56:39] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [21:56:44] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [21:57:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2079.codfw.wmnet with reason: host reimage [21:58:59] (03PS7) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [21:59:04] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [21:59:08] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [21:59:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be2080.codfw.wmnet with reason: host reimage [21:59:47] (03PS8) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [21:59:52] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [21:59:56] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [22:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231220T2200) [22:02:22] (03CR) 10Ryan Kemper: [C: 03+2] search: new hosts need puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/982927 (owner: 10Ryan Kemper) [22:02:30] (03PS9) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [22:02:41] 10SRE-swift-storage, 10Data-Persistence, 10media-backups: Missing original File:Ignatyevo.jpg - https://phabricator.wikimedia.org/T353797 (10ClydeFranklin) A (lower quality) thumbnail has been found and the file has been overwritten, so this is resolved. [22:02:41] PROBLEM - Host ms-be2078 is DOWN: PING CRITICAL - Packet loss = 100% [22:02:50] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [22:03:06] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [22:03:33] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2033.codfw.wmnet with OS bullseye [22:04:11] RECOVERY - Host ms-be2078 is UP: PING OK - Packet loss = 0%, RTA = 32.31 ms [22:05:29] (03PS10) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [22:05:32] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [22:06:13] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2003 [22:06:17] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [22:07:34] (03PS1) 10Ryan Kemper: wdqs: decom wdqs100[6-8] [puppet] - 10https://gerrit.wikimedia.org/r/984644 (https://phabricator.wikimedia.org/T351671) [22:07:43] PROBLEM - Check systemd state on ms-be2078 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:07:52] (03PS1) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/984645 (https://phabricator.wikimedia.org/T351074) [22:08:08] (03PS11) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [22:08:31] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [22:08:47] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2003 [22:08:59] RECOVERY - Check systemd state on ms-be2078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:09:35] (03PS12) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [22:09:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:09:58] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [22:10:12] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2003 [22:10:20] (03PS2) 10Ryan Kemper: wdqs: decom wdqs100[6-8] [puppet] - 10https://gerrit.wikimedia.org/r/984644 (https://phabricator.wikimedia.org/T351671) [22:10:35] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984644 (https://phabricator.wikimedia.org/T351671) (owner: 10Ryan Kemper) [22:11:41] (03CR) 10Bking: [C: 03+1] wdqs: decom wdqs100[6-8] [puppet] - 10https://gerrit.wikimedia.org/r/984644 (https://phabricator.wikimedia.org/T351671) (owner: 10Ryan Kemper) [22:12:12] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: decom wdqs100[6-8] [puppet] - 10https://gerrit.wikimedia.org/r/984644 (https://phabricator.wikimedia.org/T351671) (owner: 10Ryan Kemper) [22:12:41] (03PS13) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [22:12:46] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [22:13:01] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [22:13:17] PROBLEM - Host ms-be2080 is DOWN: PING CRITICAL - Packet loss = 100% [22:14:19] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [22:14:49] RECOVERY - Host ms-be2080 is UP: PING OK - Packet loss = 0%, RTA = 34.23 ms [22:15:19] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:16:28] !log ryankemper@cumin1002 START - Cookbook sre.hosts.decommission for hosts wdqs[1006-1008].eqiad.wmnet [22:16:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:16:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2076.codfw.wmnet with OS bullseye [22:16:38] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:16:40] (03CR) 10CI reject: [V: 04-1] Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) (owner: 10Cathal Mooney) [22:16:42] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2076.codfw.wmnet with OS bullseye completed: - ms-... [22:17:05] (03PS14) 10Cathal Mooney: Parse results of commit operations in Network Junos cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [22:17:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:17:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2078.codfw.wmnet with OS bullseye [22:17:19] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2078.codfw.wmnet with OS bullseye completed: - ms-... [22:17:24] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [22:17:29] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2003 [22:17:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:17:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2077.codfw.wmnet with OS bullseye [22:18:03] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2077.codfw.wmnet with OS bullseye completed: - ms-... [22:18:15] PROBLEM - Check systemd state on ms-be2080 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:33] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [22:18:51] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2003 [22:19:19] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [22:20:56] !log ryankemper@cumin1002 START - Cookbook sre.dns.netbox [22:20:57] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2075'] [22:21:35] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:21:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2075'] [22:22:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2075.mgmt.codfw.wmnet with reboot policy FORCED [22:22:47] RECOVERY - Check systemd state on ms-be2080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:23:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:23:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:23:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2079.codfw.wmnet with OS bullseye [22:23:43] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2079.codfw.wmnet with OS bullseye completed: - ms-... [22:23:49] (03PS15) 10Cathal Mooney: Add basic validation to Junos config command execution flow [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) [22:24:07] dancy, Amir1 curious ... after the rollback from wmf.10 to wmf.9 did you restart php-fpm? [22:24:24] Yes. That happens automatically. [22:24:28] okay. [22:24:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:24:42] !log ryankemper@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs[1006-1008].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1002" [22:24:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2080.codfw.wmnet with OS bullseye [22:24:50] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2080.codfw.wmnet with OS bullseye completed: - ms-... [22:25:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2075.mgmt.codfw.wmnet with reboot policy FORCED [22:25:35] !log ryankemper@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs[1006-1008].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1002" [22:25:35] !log ryankemper@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:25:36] !log ryankemper@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wdqs[1006-1008].eqiad.wmnet [22:40:59] (03PS1) 10Eevans: restbase: set production role and add config for restbase2033 [puppet] - 10https://gerrit.wikimedia.org/r/984647 (https://phabricator.wikimedia.org/T352468) [22:44:08] (03PS1) 10Ryan Kemper: wdqs: graph split hosts don't need categories [puppet] - 10https://gerrit.wikimedia.org/r/984648 [22:44:30] (03PS2) 10Ryan Kemper: wdqs: graph split hosts don't need categories [puppet] - 10https://gerrit.wikimedia.org/r/984648 (https://phabricator.wikimedia.org/T352878) [22:44:32] (03PS13) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [22:44:34] (03PS11) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) [22:45:04] (03PS3) 10Ryan Kemper: wdqs: graph split hosts don't need categories [puppet] - 10https://gerrit.wikimedia.org/r/984648 (https://phabricator.wikimedia.org/T352878) [22:45:13] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984648 (https://phabricator.wikimedia.org/T352878) (owner: 10Ryan Kemper) [22:47:59] (03PS1) 10Bartosz Dziewoński: CommentFormatter: Do not add wrapper if the heading has attributes [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984495 (https://phabricator.wikimedia.org/T353489) [22:48:07] (03PS1) 10Bartosz Dziewoński: CommentFormatter: Do not add wrapper if the heading has attributes [extensions/DiscussionTools] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984496 (https://phabricator.wikimedia.org/T353489) [22:48:45] (Device rebooted) firing: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [22:53:46] (Device rebooted) resolved: Device ps1-d4-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [22:53:59] (PuppetFailure) firing: Puppet has failed on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:58:04] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 18 days, 0:00:00 on wdqs[1020-1024].eqiad.wmnet with reason: T352878 [22:58:09] T352878: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 [22:58:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18 days, 0:00:00 on wdqs[1020-1024].eqiad.wmnet with reason: T352878 [22:59:27] !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for wdqs[1020-1021].eqiad.wmnet [22:59:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs[1020-1021].eqiad.wmnet [23:00:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [23:00:58] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye [23:19:40] !log ryankemper@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1006 [23:19:42] !log ryankemper@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wdqs1006 [23:20:27] (03CR) 10RLazarus: [C: 03+2] Update expected RunSingleJob.php status code [puppet] - 10https://gerrit.wikimedia.org/r/982236 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [23:23:26] ^ Oh, it takes a netmon host as the argument not the host I was trying to decom [23:24:09] s/netmon/netbox [23:24:29] !log ryankemper@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host netbox1002 [23:24:29] !log ryankemper@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host netbox1002 [23:25:36] Nevermind it does want the actual host. I think it failed for wdqs1006 et al because their IPMI doesn't seem to be reachable [23:44:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye [23:44:23] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye executed with er... [23:46:37] 10SRE, 10ops-codfw, 10serviceops: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10Jhancock.wm) I've put in a dispatch request with Dell to replace the CPU. I have power cycled the server and replaced the CMOS battery. which fixed one of the errors. will update when part arrives. [23:47:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [23:47:24] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye [23:58:43] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10Jhancock.wm)