[00:01:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:02:15] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:04:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:06:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:38:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/971430 [00:38:55] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/971430 (owner: 10TrainBranchBot) [00:42:39] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/971430 (owner: 10TrainBranchBot) [01:16:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:28] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [02:38:46] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:46] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:51:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:45:29] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [06:42:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:42:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:43:28] (03PS1) 10KartikMistry: Update cxserver to 2023-11-06-060744-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/971633 (https://phabricator.wikimedia.org/T333969) [06:43:43] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:46:33] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:48:01] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:48:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:08:46] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:45:57] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:46:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:47:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:47:17] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:49:34] (03CR) 10Urbanecm: "recheck" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/971533 (https://phabricator.wikimedia.org/T347157) (owner: 10Urbanecm) [07:51:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:55:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:55:57] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:56:53] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:59:31] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:59:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:00:05] Amir1, Urbanecm, and taavi: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T0800). [08:00:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50862 bytes in 2.816 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:35] * urbanecm steals the window [08:00:49] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "unrelated Wikibase failure" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/971533 (https://phabricator.wikimedia.org/T347157) (owner: 10Urbanecm) [08:01:46] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:971533|Structured mentor list: Make "no mentees" a proper weight (T347157 T347024)]] [08:02:00] T347157: Structured mentor list: Migrate `autoAssigned` into weight - https://phabricator.wikimedia.org/T347157 [08:02:00] T347024: Returning back from the "Away" status, I don't get newcomers assigned to me - https://phabricator.wikimedia.org/T347024 [08:03:21] (03PS1) 10Muehlenhoff: Add component/php74 for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/971881 [08:04:33] (03PS3) 10Kosta Harlan: [WIP] ipoid: Set an initialImport cron job [deployment-charts] - 10https://gerrit.wikimedia.org/r/967245 (https://phabricator.wikimedia.org/T346861) [08:04:45] (03CR) 10Kosta Harlan: [WIP] ipoid: Set an initialImport cron job (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967245 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [08:14:00] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:971533|Structured mentor list: Make "no mentees" a proper weight (T347157 T347024)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:14:05] T347157: Structured mentor list: Migrate `autoAssigned` into weight - https://phabricator.wikimedia.org/T347157 [08:14:05] T347024: Returning back from the "Away" status, I don't get newcomers assigned to me - https://phabricator.wikimedia.org/T347024 [08:14:13] (03CR) 10WMDE-Fisch: "Planed deployment: Nov 22nd." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch) [08:15:14] (SwiftObjectCountSiteDisparity) resolved: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [08:15:56] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host db2189.codfw.wmnet with OS bookworm [08:15:59] !log urbanecm@deploy2002 urbanecm: Continuing with sync [08:18:48] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: route o11y alerts [puppet] - 10https://gerrit.wikimedia.org/r/971357 (owner: 10Filippo Giunchedi) [08:22:49] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/318/con" [puppet] - 10https://gerrit.wikimedia.org/r/971514 (https://phabricator.wikimedia.org/T350478) (owner: 10Ahmon Dancy) [08:23:46] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] remove loki image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971427 (https://phabricator.wikimedia.org/T350366) (owner: 10Cwhite) [08:25:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 2.3050594148563994s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:25:23] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:971533|Structured mentor list: Make "no mentees" a proper weight (T347157 T347024)]] (duration: 23m 37s) [08:25:29] T347157: Structured mentor list: Migrate `autoAssigned` into weight - https://phabricator.wikimedia.org/T347157 [08:25:29] T347024: Returning back from the "Away" status, I don't get newcomers assigned to me - https://phabricator.wikimedia.org/T347024 [08:27:32] (03CR) 10Muehlenhoff: [C: 03+2] Add component/php74 for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/971881 (owner: 10Muehlenhoff) [08:27:57] (03CR) 10Jelto: [V: 03+1 C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/971514 (https://phabricator.wikimedia.org/T350478) (owner: 10Ahmon Dancy) [08:28:09] 10SRE, 10Infrastructure-Foundations, 10User-fgiunchedi: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10fgiunchedi) The "best" solution I could come up with so far is to edit fstab after the fact, e.g. in `late_command` when we detect a `raid0` `/srv` filesystem: ` awk '{if (/UUI... [08:28:35] 10SRE-tools, 10Data-Persistence, 10Spicerack, 10Traffic, 10serviceops: Switch conftool to use the version 3 etcd datastore - https://phabricator.wikimedia.org/T350565 (10Joe) [08:31:00] !log add +80G to prometheus/ops in eqiad [08:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:30] (03PS1) 10Muehlenhoff: Fix MOU date [puppet] - 10https://gerrit.wikimedia.org/r/971883 [08:32:41] (03CR) 10Muehlenhoff: [C: 03+2] Fix MOU date [puppet] - 10https://gerrit.wikimedia.org/r/971883 (owner: 10Muehlenhoff) [08:33:58] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2189.codfw.wmnet with reason: host reimage [08:35:58] (03CR) 10Ayounsi: [C: 03+1] users: add network device access for taavi (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/970850 (https://phabricator.wikimedia.org/T350267) (owner: 10Majavah) [08:36:41] (03PS1) 10Slyngshede: C:idm enable email address update. [puppet] - 10https://gerrit.wikimedia.org/r/971884 [08:36:56] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2189.codfw.wmnet with reason: host reimage [08:37:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/971884 (owner: 10Slyngshede) [08:38:30] (03CR) 10Slyngshede: [C: 03+2] C:idm enable email address update. [puppet] - 10https://gerrit.wikimedia.org/r/971884 (owner: 10Slyngshede) [08:39:09] (03PS1) 10Giuseppe Lavagetto: docker::builder: improvements to update-production-images [puppet] - 10https://gerrit.wikimedia.org/r/971885 [08:40:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.160309181024049s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:49:26] (03PS1) 10Zabe: Initial configuration for bjnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971887 (https://phabricator.wikimedia.org/T350217) [08:51:43] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2189.codfw.wmnet with OS bookworm [08:53:10] jouncebot: nowandnext [08:53:10] For the next 0 hour(s) and 6 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T0800) [08:53:10] In 2 hour(s) and 6 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T1100) [08:54:18] (03PS1) 10Arnaudb: mariadb: hieradata to install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/971431 (https://phabricator.wikimedia.org/T343674) [08:54:25] (03CR) 10Zabe: [C: 03+2] Initial configuration for bjnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971887 (https://phabricator.wikimedia.org/T350217) (owner: 10Zabe) [08:55:07] (03Merged) 10jenkins-bot: Initial configuration for bjnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971887 (https://phabricator.wikimedia.org/T350217) (owner: 10Zabe) [08:56:01] !log create Banjar Wikiquote # T350217 [08:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:05] T350217: Create Banjar Wikiquote - https://phabricator.wikimedia.org/T350217 [08:56:34] !log zabe@deploy2002 Started scap: T350217 [08:57:51] !log zabe@deploy2002 zabe: T350217 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:58:53] !log zabe@deploy2002 zabe: Continuing with sync [08:59:04] (03PS1) 10Majavah: P:toolforge::mailrelay: fix root@ on toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/971890 [08:59:06] (03PS1) 10Majavah: P:toolforge::mailrelay: rewrite maintainers in Python [puppet] - 10https://gerrit.wikimedia.org/r/971891 (https://phabricator.wikimedia.org/T341006) [08:59:08] (03PS1) 10Majavah: P:toolforge::mailrelay: only relay for Toolforge, not Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/971892 [08:59:10] (03PS1) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [08:59:12] (03PS1) 10Majavah: P:toolforge::mailrelay: log mail sent from non-Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971894 (https://phabricator.wikimedia.org/T341004) [09:00:32] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host db2190.codfw.wmnet with OS bookworm [09:02:32] (03PS1) 10Zabe: Initial configuration for zghwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971896 (https://phabricator.wikimedia.org/T350216) [09:02:51] (03CR) 10CI reject: [V: 04-1] P:toolforge::mailrelay: rewrite maintainers in Python [puppet] - 10https://gerrit.wikimedia.org/r/971891 (https://phabricator.wikimedia.org/T341006) (owner: 10Majavah) [09:04:06] (03CR) 10Zabe: [C: 03+2] Initial configuration for zghwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971896 (https://phabricator.wikimedia.org/T350216) (owner: 10Zabe) [09:04:21] !log zabe@deploy2002 Finished scap: T350217 (duration: 07m 47s) [09:04:28] T350217: Create Banjar Wikiquote - https://phabricator.wikimedia.org/T350217 [09:04:35] (03PS2) 10Majavah: P:toolforge::mailrelay: rewrite maintainers in Python [puppet] - 10https://gerrit.wikimedia.org/r/971891 (https://phabricator.wikimedia.org/T341006) [09:04:37] (03PS2) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [09:04:39] (03PS2) 10Majavah: P:toolforge::mailrelay: only relay for Toolforge, not Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/971892 [09:04:41] (03PS2) 10Majavah: P:toolforge::mailrelay: log mail sent from non-Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971894 (https://phabricator.wikimedia.org/T341004) [09:04:52] (03Merged) 10jenkins-bot: Initial configuration for zghwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971896 (https://phabricator.wikimedia.org/T350216) (owner: 10Zabe) [09:06:07] !log create Moroccan Amazigh Wikipedia # T350216 [09:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:14] T350216: Create Moroccan Amazigh Wikipedia - https://phabricator.wikimedia.org/T350216 [09:06:22] (03PS3) 10Stevemunene: Switch druid1004 zookeeper node with druid1009 [puppet] - 10https://gerrit.wikimedia.org/r/965499 (https://phabricator.wikimedia.org/T336042) [09:06:34] !log zabe@deploy2002 Started scap: T350216 [09:06:42] 10SRE, 10Cassandra, 10Data-Persistence: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567 (10MoritzMuehlenhoff) [09:07:28] (03PS1) 10Giuseppe Lavagetto: cert-manager/istio: fix reference to base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971898 (https://phabricator.wikimedia.org/T350366) [09:07:48] !log zabe@deploy2002 zabe: T350216 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:08:06] 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10AndrewTavis_WMDE) Thank you, @EBernhardson! There was [[ https://wikimedia.slack.com/archives/CSV483812/p1698943739549439 | a discussion on Slack ]]... [09:08:37] !log zabe@deploy2002 zabe: Continuing with sync [09:10:49] !log importing openjdk-8 8u392-ga-1~deb11u1 for bullseye-wikimedia to apt.wikimedia.org (latest Java 8 security fixes) [09:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:48] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] cert-manager/istio: fix reference to base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971898 (https://phabricator.wikimedia.org/T350366) (owner: 10Giuseppe Lavagetto) [09:12:10] (03PS1) 10Zabe: Initial configuration for dgawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971899 (https://phabricator.wikimedia.org/T350218) [09:13:07] (03CR) 10Zabe: [C: 03+2] Initial configuration for dgawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971899 (https://phabricator.wikimedia.org/T350218) (owner: 10Zabe) [09:13:50] !log zabe@deploy2002 Finished scap: T350216 (duration: 07m 15s) [09:13:51] (03Merged) 10jenkins-bot: Initial configuration for dgawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971899 (https://phabricator.wikimedia.org/T350218) (owner: 10Zabe) [09:13:56] T350216: Create Moroccan Amazigh Wikipedia - https://phabricator.wikimedia.org/T350216 [09:15:21] !log create Dagaare Wikipedia # T350218 [09:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:24] T350218: Create Dagaare Wikipedia - https://phabricator.wikimedia.org/T350218 [09:15:40] !log zabe@deploy2002 Started scap: T350218 [09:16:55] !log zabe@deploy2002 zabe: T350218 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:17:33] !log zabe@deploy2002 zabe: Continuing with sync [09:18:24] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2190.codfw.wmnet with reason: host reimage [09:19:37] (03PS1) 10Zabe: Initial configuration for bbcwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971901 (https://phabricator.wikimedia.org/T350320) [09:21:02] (03CR) 10Zabe: [C: 03+2] Initial configuration for bbcwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971901 (https://phabricator.wikimedia.org/T350320) (owner: 10Zabe) [09:21:05] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2190.codfw.wmnet with reason: host reimage [09:21:50] (03Merged) 10jenkins-bot: Initial configuration for bbcwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971901 (https://phabricator.wikimedia.org/T350320) (owner: 10Zabe) [09:22:44] !log zabe@deploy2002 Finished scap: T350218 (duration: 07m 04s) [09:22:47] 10SRE, 10Bitu, 10Infrastructure-Foundations: Create a mockup and involve designers - https://phabricator.wikimedia.org/T320802 (10SLyngshede-WMF) [09:22:48] T350218: Create Dagaare Wikipedia - https://phabricator.wikimedia.org/T350218 [09:24:05] !log Toba Batak Wikipedia # T350320 [09:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:09] T350320: Create Toba Batak Wikipedia - https://phabricator.wikimedia.org/T350320 [09:24:35] !log installing openjdk-11 security updates [09:24:36] !log zabe@deploy2002 Started scap: T350320 [09:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:52] !log zabe@deploy2002 zabe: T350320 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:26:02] !log zabe@deploy2002 zabe: Continuing with sync [09:31:05] !log zabe@deploy2002 Finished scap: T350320 (duration: 06m 28s) [09:31:11] T350320: Create Toba Batak Wikipedia - https://phabricator.wikimedia.org/T350320 [09:32:27] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971432 [09:32:29] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971432 (owner: 10Zabe) [09:33:09] !log zabe@deploy2002 Started scap: update interwiki cache [09:33:14] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971432 (owner: 10Zabe) [09:35:02] (03PS4) 10MdsShakil: Add autopatrol to Wikifunctions Staff group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970755 (https://phabricator.wikimedia.org/T350028) [09:35:13] !log installing Tomcat security updates [09:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:17] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2190.codfw.wmnet with OS bookworm [09:39:31] !log zabe@deploy2002 Finished scap: update interwiki cache (duration: 06m 21s) [09:43:12] (03PS1) 10Fabfur: haproxy: enable healthcheck-dedicated backend in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/971907 (https://phabricator.wikimedia.org/T348851) [09:48:10] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/971907 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [09:48:35] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host db2191.codfw.wmnet with OS bookworm [09:56:21] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: provisionning db1236.eqiad.wmnet - T344036 [09:56:27] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [09:56:36] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: provisionning db1236.eqiad.wmnet - T344036 [09:56:39] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: provisionning db1236.eqiad.wmnet - T344036 [09:57:04] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: provisionning db1236.eqiad.wmnet - T344036 [09:59:24] (03CR) 10Vgutierrez: [C: 03+1] haproxy: enable healthcheck-dedicated backend in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/971907 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [09:59:53] (03PS1) 10Jbond: cloud - hiera@ add missing key/value [puppet] - 10https://gerrit.wikimedia.org/r/971909 [10:00:26] (03CR) 10CI reject: [V: 04-1] cloud - hiera@ add missing key/value [puppet] - 10https://gerrit.wikimedia.org/r/971909 (owner: 10Jbond) [10:02:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1136 in db1236 for T344036', diff saved to https://phabricator.wikimedia.org/P53139 and previous config saved to /var/cache/conftool/dbconfig/20231106-100213-arnaudb.json [10:02:18] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [10:03:06] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: enable healthcheck-dedicated backend in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/971907 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [10:05:15] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1136.eqiad.wmnet onto db1236.eqiad.wmnet [10:06:02] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2191.codfw.wmnet with reason: host reimage [10:06:15] (03PS2) 10JMeybohm: Build php7.4 images with icu67 packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971223 (https://phabricator.wikimedia.org/T345561) [10:06:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1136 in db1236 for T344036', diff saved to https://phabricator.wikimedia.org/P53140 and previous config saved to /var/cache/conftool/dbconfig/20231106-100625-arnaudb.json [10:09:15] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2191.codfw.wmnet with reason: host reimage [10:17:04] (03Abandoned) 10Brouberol: Hide skein private key diff in puppet logs [puppet] - 10https://gerrit.wikimedia.org/r/970408 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [10:24:16] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2191.codfw.wmnet with OS bookworm [10:25:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:25:23] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host db2192.codfw.wmnet with OS bookworm [10:32:48] PROBLEM - Check systemd state on ganeti3007 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:40] (03PS1) 10Slyngshede: C:prometheus::ethtool_exporter add -s option [puppet] - 10https://gerrit.wikimedia.org/r/971913 [10:41:41] (03CR) 10Btullis: [C: 03+1] mariadb - analytics: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [10:43:09] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2192.codfw.wmnet with reason: host reimage [10:46:19] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2192.codfw.wmnet with reason: host reimage [10:46:38] (03CR) 10Ladsgroup: [C: 03+1] "It looks correct to me but not sure when/how it should be merged and it could cause averse affect on dbs." [puppet] - 10https://gerrit.wikimedia.org/r/971431 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:50:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:50:13] !log Restarting Jenkins [10:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:56:49] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10BTullis) >>! In T340741#9207405, @jbond wrote: > @BTullis Im not sure if you are the right person... [10:57:31] (03CR) 10Btullis: [C: 03+1] "Looks good to me, with one whitespace nit." [puppet] - 10https://gerrit.wikimedia.org/r/971196 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T1100) [11:00:21] 10SRE, 10Traffic, 10GitLab (Project Migration): Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10LSobanski) Would operations/software/knead-wikidough and operations/software/liberica fit here as well? [11:01:47] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2192.codfw.wmnet with OS bookworm [11:02:42] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host db2193.codfw.wmnet with OS bookworm [11:09:41] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:10:55] (03PS1) 10Muehlenhoff: Update PHP hook to use Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/971921 [11:12:12] (03CR) 10Btullis: [V: 03+1 C: 03+2] Create a new role for analytics_cluster::mariadb and assign it [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [11:15:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:15:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:16:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:16:52] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-mariadb[1001-1002].eqiad.wmnet with reason: Commissioning new database servers [11:17:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-mariadb[1001-1002].eqiad.wmnet with reason: Commissioning new database servers [11:17:08] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2010.codfw.wmnet with OS bookworm [11:17:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:17:45] (03CR) 10Jbond: [C: 03+1] mariadb::ferm: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971452 (owner: 10Muehlenhoff) [11:17:55] (03PS1) 10Fabfur: haproxy: enable healthcheck-dedicated backend in codfw [puppet] - 10https://gerrit.wikimedia.org/r/971922 (https://phabricator.wikimedia.org/T348851) [11:18:34] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:18:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:19:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:19:36] (03PS2) 10Jbond: cloud - hiera@ add missing key/value [puppet] - 10https://gerrit.wikimedia.org/r/971909 [11:20:24] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2193.codfw.wmnet with reason: host reimage [11:21:11] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/971922 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [11:21:52] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) @BTullis thanks, i think morits plans to work with DE to migrate some canary services/host... [11:23:08] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2193.codfw.wmnet with reason: host reimage [11:24:58] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [11:25:48] (03CR) 10Jbond: [C: 03+2] acmechief_host: drop this value as a global [puppet] - 10https://gerrit.wikimedia.org/r/969706 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [11:28:51] (03CR) 10Slyngshede: "We want to collect information about NICs being down, but we want to avoid collecting a lot of zero data about traffic." [puppet] - 10https://gerrit.wikimedia.org/r/971913 (owner: 10Slyngshede) [11:29:47] 10SRE-swift-storage, 10Thumbor: rendering of images in high res sometimes fails and fails permanently - https://phabricator.wikimedia.org/T350548 (10MatthewVernon) [I'm taking the swift tag off, as this seems to me to be straightforwardly a content/thumbor issue rather than a swift problem] [11:30:05] 10SRE, 10Thumbor: rendering of images in high res sometimes fails and fails permanently - https://phabricator.wikimedia.org/T350548 (10MatthewVernon) [11:30:17] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable Reference Previews on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch) [11:31:03] RECOVERY - Check systemd state on ganeti3007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:19] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2193.codfw.wmnet with OS bookworm [11:40:41] !log installing openssl bugfix updates on Bullseye (update to 1.1.1w) [11:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:30] (03CR) 10Jbond: [C: 03+2] cloud - hiera@ add missing key/value [puppet] - 10https://gerrit.wikimedia.org/r/971909 (owner: 10Jbond) [11:42:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Build php7.4 images with icu67 packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971223 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [11:48:58] (03CR) 10Muehlenhoff: [C: 03+2] grafana: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971458 (owner: 10Muehlenhoff) [11:49:49] PROBLEM - WDQS SPARQL on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:50:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:51:39] (03PS1) 10Jbond: gerrit: use prod known hosts [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/971924 [11:53:11] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:55:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:55:31] (03PS1) 10Jbond: sre.puppet-migrate-*: add acmechief_host config [cookbooks] - 10https://gerrit.wikimedia.org/r/971925 [11:56:45] (03CR) 10Volans: [C: 03+1] "LGTM if that's needed :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/971925 (owner: 10Jbond) [11:59:18] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2010.codfw.wmnet with reason: host reimage [12:01:07] (03PS2) 10FNegri: P:openstack:codfw1dev enable prom exporter [puppet] - 10https://gerrit.wikimedia.org/r/971491 (https://phabricator.wikimedia.org/T350154) [12:02:41] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2010.codfw.wmnet with reason: host reimage [12:03:49] (03PS1) 10Jbond: httpd: use a variable for service_name [puppet] - 10https://gerrit.wikimedia.org/r/971927 [12:03:51] (03PS1) 10Jbond: pki: notify apache when the cert is refreshed [puppet] - 10https://gerrit.wikimedia.org/r/971928 [12:05:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/322/con" [puppet] - 10https://gerrit.wikimedia.org/r/971928 (owner: 10Jbond) [12:05:06] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/321/con" [puppet] - 10https://gerrit.wikimedia.org/r/971491 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri) [12:06:14] (03CR) 10FNegri: P:openstack:codfw1dev enable prom exporter [puppet] - 10https://gerrit.wikimedia.org/r/971491 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri) [12:10:03] (03CR) 10Jbond: [C: 03+2] sre.puppet-migrate-*: add acmechief_host config [cookbooks] - 10https://gerrit.wikimedia.org/r/971925 (owner: 10Jbond) [12:10:14] (03PS2) 10Muehlenhoff: pki: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/971118 [12:10:29] (03CR) 10CI reject: [V: 04-1] pki: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/971118 (owner: 10Muehlenhoff) [12:10:51] (03PS3) 10Muehlenhoff: pki: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/971118 [12:12:05] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [12:14:10] !log installing jetty9 security updates [12:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:14] (03Merged) 10jenkins-bot: sre.puppet-migrate-*: add acmechief_host config [cookbooks] - 10https://gerrit.wikimedia.org/r/971925 (owner: 10Jbond) [12:20:22] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2010.codfw.wmnet with OS bookworm [12:25:35] (03PS1) 10Effie Mouzeli: (WIP) ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) [12:26:03] (03CR) 10Jbond: [C: 03+2] httpd: use a variable for service_name [puppet] - 10https://gerrit.wikimedia.org/r/971927 (owner: 10Jbond) [12:26:06] (03CR) 10Jbond: [V: 03+1 C: 03+2] pki: notify apache when the cert is refreshed [puppet] - 10https://gerrit.wikimedia.org/r/971928 (owner: 10Jbond) [12:28:00] 10SRE, 10Infrastructure-Foundations, 10netops: Support Anycast GW on EVPN switches without unique IP - https://phabricator.wikimedia.org/T350579 (10cmooney) p:05Triage→03Medium [12:30:50] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: change how AQS URLs enforce wikimedia.org domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/971456 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan) [12:31:50] (03Merged) 10jenkins-bot: rest-gateway: change how AQS URLs enforce wikimedia.org domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/971456 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan) [12:32:30] (03PS1) 10Cathal Mooney: Change 'anycast_gw' var in int config to represent type of IRB needed [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/971937 (https://phabricator.wikimedia.org/T350579) [12:33:47] (03PS2) 10Cathal Mooney: Change 'anycast_gw' var in int config to represent type of IRB needed [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/971937 (https://phabricator.wikimedia.org/T350579) [12:35:11] (03PS3) 10Cathal Mooney: Change 'anycast_gw' var in int config to represent type of IRB needed [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/971937 (https://phabricator.wikimedia.org/T350579) [12:35:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971118 (owner: 10Muehlenhoff) [12:36:35] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: idm_test [12:37:55] (03PS1) 10Jbond: idm_test: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971938 [12:38:05] (03CR) 10Jbond: [C: 03+2] idm_test: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971938 (owner: 10Jbond) [12:42:17] (03PS1) 10Muehlenhoff: Remove unused profile::pki::server [puppet] - 10https://gerrit.wikimedia.org/r/971939 [12:42:19] (03PS1) 10Hnowlan: rest-gateway: fix regexes [deployment-charts] - 10https://gerrit.wikimedia.org/r/971940 (https://phabricator.wikimedia.org/T348731) [12:42:30] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: idm_test [12:43:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/971939 (owner: 10Muehlenhoff) [12:44:31] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: fix regexes [deployment-charts] - 10https://gerrit.wikimedia.org/r/971940 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan) [12:44:45] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: idm [12:45:52] (03Merged) 10jenkins-bot: rest-gateway: fix regexes [deployment-charts] - 10https://gerrit.wikimedia.org/r/971940 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan) [12:46:23] (03PS1) 10Jbond: idm: convert to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971941 [12:46:59] (03PS1) 10Btullis: Configure the new mariadb servers to be replicas [puppet] - 10https://gerrit.wikimedia.org/r/971942 (https://phabricator.wikimedia.org/T284150) [12:47:11] (03CR) 10Jbond: [C: 03+2] idm: convert to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971941 (owner: 10Jbond) [12:47:26] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971942 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [12:48:03] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:48:14] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:50:32] (03CR) 10Ayounsi: "Nice! Are you going to send the changes upstream?" [puppet] - 10https://gerrit.wikimedia.org/r/971913 (owner: 10Slyngshede) [12:50:53] (03CR) 10Muehlenhoff: [C: 03+2] Remove unused profile::pki::server [puppet] - 10https://gerrit.wikimedia.org/r/971939 (owner: 10Muehlenhoff) [12:51:43] (03CR) 10Slyngshede: C:prometheus::ethtool_exporter add -s option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971913 (owner: 10Slyngshede) [12:52:34] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: idm [12:55:55] (03Abandoned) 10Muehlenhoff: pki: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/971118 (owner: 10Muehlenhoff) [12:57:40] (03PS2) 10Btullis: Configure the new mariadb servers to be replicas [puppet] - 10https://gerrit.wikimedia.org/r/971942 (https://phabricator.wikimedia.org/T284150) [12:59:12] (03PS3) 10Brouberol: Renew skein certificate every month via systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/971196 (https://phabricator.wikimedia.org/T329398) [12:59:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/971486 (owner: 10EoghanGaffney) [12:59:35] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971942 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [12:59:49] (03PS4) 10Brouberol: Renew skein certificate every month via systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/971196 (https://phabricator.wikimedia.org/T329398) [13:00:15] (03PS1) 10Giuseppe Lavagetto: mediawiki: add remote-dc mcrouter pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/971943 (https://phabricator.wikimedia.org/T342201) [13:01:14] 10SRE, 10Growth-Team, 10MW-on-K8s, 10MediaWiki-Platform-Team, and 6 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10Joe) When `mcrouter-primary-dc` is sele... [13:02:09] (03PS1) 10Hnowlan: rest-gateway: correct match section for routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/971944 (https://phabricator.wikimedia.org/T348731) [13:02:14] <_joe_> jouncebot: nowandnext [13:02:14] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [13:02:14] In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T1400) [13:03:07] (03CR) 10Brouberol: [C: 03+2] Renew skein certificate every month via systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/971196 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [13:05:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:05:39] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host db2194.codfw.wmnet with OS bookworm [13:06:09] (03CR) 10JMeybohm: [C: 03+1] mediawiki: add remote-dc mcrouter pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/971943 (https://phabricator.wikimedia.org/T342201) (owner: 10Giuseppe Lavagetto) [13:06:59] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [13:10:04] (03CR) 10Filippo Giunchedi: C:prometheus::ethtool_exporter add -s option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971913 (owner: 10Slyngshede) [13:10:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:10:19] (03CR) 10Filippo Giunchedi: "LGTM, a suggestion in line" [puppet] - 10https://gerrit.wikimedia.org/r/971913 (owner: 10Slyngshede) [13:10:48] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1136.eqiad.wmnet onto db1236.eqiad.wmnet [13:10:49] (03PS1) 10Jbond: prometheus-puppet-agent-stats: this timer somtimes fails [puppet] - 10https://gerrit.wikimedia.org/r/971946 [13:12:15] (03PS2) 10Slyngshede: C:prometheus::ethtool_exporter add -s option [puppet] - 10https://gerrit.wikimedia.org/r/971913 [13:12:39] (03PS3) 10Slyngshede: C:prometheus::ethtool_exporter add -skip-no-link option [puppet] - 10https://gerrit.wikimedia.org/r/971913 [13:12:55] (03CR) 10Slyngshede: C:prometheus::ethtool_exporter add -skip-no-link option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971913 (owner: 10Slyngshede) [13:13:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "Neat! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/971913 (owner: 10Slyngshede) [13:13:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add remote-dc mcrouter pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/971943 (https://phabricator.wikimedia.org/T342201) (owner: 10Giuseppe Lavagetto) [13:13:37] (03CR) 10Slyngshede: [C: 03+2] C:prometheus::ethtool_exporter add -skip-no-link option [puppet] - 10https://gerrit.wikimedia.org/r/971913 (owner: 10Slyngshede) [13:13:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10ayounsi) For cross-sites router to router we use the TTL value to eventually take down the session if the BGP session takes a too long path, it's cl... [13:14:25] (03Merged) 10jenkins-bot: mediawiki: add remote-dc mcrouter pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/971943 (https://phabricator.wikimedia.org/T342201) (owner: 10Giuseppe Lavagetto) [13:14:41] (03CR) 10Ayounsi: [C: 03+1] "As mentioned in the task, lgtm, but make sure BFD re-establishes correctly once this is applied." [homer/public] - 10https://gerrit.wikimedia.org/r/971488 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [13:15:16] (03CR) 10Ayounsi: [C: 03+1] Remove specific TTL values from server BGP groups (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/971488 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [13:15:27] (03PS1) 10Brouberol: Enable monthly skein certificate renewal on airflow launchers [puppet] - 10https://gerrit.wikimedia.org/r/971947 (https://phabricator.wikimedia.org/T329398) [13:15:37] (03CR) 10Ayounsi: [C: 03+1] Change Bird multihop command to use default system TTL [puppet] - 10https://gerrit.wikimedia.org/r/971490 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [13:15:58] (03CR) 10Majavah: [C: 03+1] P:openstack:codfw1dev enable prom exporter [puppet] - 10https://gerrit.wikimedia.org/r/971491 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri) [13:17:27] (03PS1) 10Giuseppe Lavagetto: mw-debug fix mcrouter routes there as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/971948 (https://phabricator.wikimedia.org/T342201) [13:18:12] (03PS2) 10Giuseppe Lavagetto: mw-debug fix mcrouter routes there as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/971948 (https://phabricator.wikimedia.org/T342201) [13:18:34] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mw-debug fix mcrouter routes there as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/971948 (https://phabricator.wikimedia.org/T342201) (owner: 10Giuseppe Lavagetto) [13:20:20] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [13:20:39] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:20:52] (03PS2) 10Brouberol: Enable monthly skein certificate renewal on airflow launchers [puppet] - 10https://gerrit.wikimedia.org/r/971947 (https://phabricator.wikimedia.org/T329398) [13:21:05] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:21:22] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:21:30] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:22:19] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/324/console" [puppet] - 10https://gerrit.wikimedia.org/r/971947 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [13:23:54] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [13:25:17] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/325/console" [puppet] - 10https://gerrit.wikimedia.org/r/971947 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [13:26:04] (03PS3) 10Brouberol: Enable monthly skein certificate renewal on airflow launchers [puppet] - 10https://gerrit.wikimedia.org/r/971947 (https://phabricator.wikimedia.org/T329398) [13:26:06] (03CR) 10Stevemunene: [C: 03+2] Switch druid1004 zookeeper node with druid1009 [puppet] - 10https://gerrit.wikimedia.org/r/965499 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [13:26:08] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: correct match section for routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/971944 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan) [13:26:27] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [13:26:29] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [13:26:51] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [13:26:59] (03Merged) 10jenkins-bot: rest-gateway: correct match section for routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/971944 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan) [13:27:11] (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/971947 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [13:27:17] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [13:27:18] (03CR) 10Brouberol: [C: 03+2] Enable monthly skein certificate renewal on airflow launchers [puppet] - 10https://gerrit.wikimedia.org/r/971947 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [13:27:25] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [13:27:45] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [13:28:08] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [13:28:20] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [13:28:28] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [13:28:39] PROBLEM - Zookeeper Server on druid1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [13:29:35] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:29:43] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [13:29:56] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:30:20] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [13:34:07] (03CR) 10Ayounsi: "The attack surface is indeed very tiny, only for anycast public IP hosts peering with the core routers, need to guess the port. And the at" [homer/public] - 10https://gerrit.wikimedia.org/r/971498 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [13:34:42] (03PS1) 10Muehlenhoff: Failover idp.w.o to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/971952 [13:34:54] (03PS1) 10Kosta Harlan: CheckUser: Set 'debug' log level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971953 (https://phabricator.wikimedia.org/T345591) [13:35:33] (03PS2) 10Kosta Harlan: CheckUser: Set 'debug' log level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971953 (https://phabricator.wikimedia.org/T345591) [13:37:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952457 (owner: 10Muehlenhoff) [13:39:20] (03CR) 10Dreamy Jazz: [C: 04-1] CheckUser: Set 'debug' log level (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971953 (https://phabricator.wikimedia.org/T345591) (owner: 10Kosta Harlan) [13:40:18] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2194.codfw.wmnet with OS bookworm [13:41:52] (03PS3) 10Kosta Harlan: CheckUser: Set 'debug' log level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971953 (https://phabricator.wikimedia.org/T345591) [13:42:04] (03CR) 10Jbond: prometheus-puppet-agent-stats: this timer somtimes fails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [13:42:23] (03CR) 10Kosta Harlan: CheckUser: Set 'debug' log level (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971953 (https://phabricator.wikimedia.org/T345591) (owner: 10Kosta Harlan) [13:45:07] !log asw2-c-eqiad> request system power-off member 8 - T349798 [13:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:11] T349798: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 [13:45:39] PROBLEM - Check systemd state on sretest1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ethtool-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:56] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:48:57] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:49:29] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: cluster::management [13:51:17] PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [13:51:24] (03PS1) 10Jbond: cluster::management: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971954 (https://phabricator.wikimedia.org/T349619) [13:51:39] (03CR) 10Btullis: C:bigtop::hadoop switch to new topology script. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede) [13:52:26] (03CR) 10Jbond: [C: 03+2] cluster::management: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971954 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [13:52:43] RECOVERY - Juniper virtual chassis ports on asw2-c-eqiad is OK: OK: UP: 16 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [13:53:36] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host db2195.codfw.wmnet with OS bookworm [13:53:50] (03PS1) 10Bartosz Dziewoński: Restore OOUI dialog styles for compatibility [skins/Vector] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/971539 (https://phabricator.wikimedia.org/T350544) [13:54:18] (03PS2) 10Jforrester: Restore OOUI dialog styles for compatibility [skins/Vector] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/971539 (https://phabricator.wikimedia.org/T350544) (owner: 10Bartosz Dziewoński) [13:55:43] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10ayounsi) a:03Jclark-ctr [13:57:36] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: cluster::management [13:57:40] !log stevemunene@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [13:59:06] Hi, would it be possible to ask for a backport of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/971955 in the next window? [13:59:32] I see that that window is already pretty full, and this is not just a config change (though still pretty simple) [13:59:56] oh, the window starts in 1 minute 😅 [14:00:00] PROBLEM - aqs endpoints health on aqs1016 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T1400). [14:00:05] MatmaRex, MdsShakil, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:06] PROBLEM - aqs endpoints health on aqs2002 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:00:15] i can deploy today [14:00:16] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10ayounsi) Switch has been removed from config and powered off. All yours to do the remaining steps. I think https://netbox.wikimedia.org/dcim/cables/5708/ are 40G optics,... [14:00:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [14:00:32] PROBLEM - aqs endpoints health on aqs2006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:00:46] PROBLEM - aqs endpoints health on aqs2012 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:00:52] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:00:55] MatmaRex: Dreamy_Jazz: hi, around? [14:00:57] \o [14:01:01] hi [14:01:12] PROBLEM - aqs endpoints health on aqs2009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:01:28] PROBLEM - aqs endpoints health on aqs1018 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:01:30] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:01:56] PROBLEM - aqs endpoints health on aqs1017 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:02:10] PROBLEM - aqs endpoints health on aqs1019 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:02:14] PROBLEM - aqs endpoints health on aqs2010 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:02:16] PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:10] PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:14] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:14] PROBLEM - aqs endpoints health on aqs2009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:30] RECOVERY - aqs endpoints health on aqs1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:30] (03CR) 10Urbanecm: [C: 04-1] CheckUser: Set 'debug' log level (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971953 (https://phabricator.wikimedia.org/T345591) (owner: 10Kosta Harlan) [14:03:32] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:32] RECOVERY - aqs endpoints health on aqs2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:50] RECOVERY - aqs endpoints health on aqs2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:59] 10SRE, 10Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10ayounsi) [14:04:02] Dreamy_Jazz: reviewed the patch. can deploy if you want me to, but do you really want to have debug logs in logstash only? [14:04:05] 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) 05Resolved→03Open When working on something else I noticed that those were still in Netbox: htt... [14:04:06] RECOVERY - aqs endpoints health on aqs1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:12] RECOVERY - aqs endpoints health on aqs2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:14] RECOVERY - aqs endpoints health on aqs1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:18] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:18] RECOVERY - aqs endpoints health on aqs2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:18] RECOVERY - aqs endpoints health on aqs2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:32] also let's wait for the alerts to clear before deploying. [14:04:47] The idea is that I could inspect logstash and then revert that change once this was checked. [14:05:00] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:05:00] RECOVERY - aqs endpoints health on aqs1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:05:12] Dreamy_Jazz: so it's a temp change? [14:05:17] Yep [14:05:20] RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:05:46] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:50] then i guess it doesn't matter much. [14:05:53] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10ayounsi) [14:06:18] RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:28] (03CR) 10Urbanecm: [C: 03+1] "per IRC:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971953 (https://phabricator.wikimedia.org/T345591) (owner: 10Kosta Harlan) [14:06:36] (03CR) 10Urbanecm: [C: 03+2] Restore OOUI dialog styles for compatibility [skins/Vector] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/971539 (https://phabricator.wikimedia.org/T350544) (owner: 10Bartosz Dziewoński) [14:06:40] Sure. As long as I can see the debug log entries in logstash, then it would be good. The reason for this is that I don't think we can make the verbose log work on jobs. [14:06:42] (03CR) 10Urbanecm: [C: 03+2] CheckUser: Set 'debug' log level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971953 (https://phabricator.wikimedia.org/T345591) (owner: 10Kosta Harlan) [14:06:54] *verbose log setting [14:07:18] via the WikimediaDebug extension [14:07:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971953 (https://phabricator.wikimedia.org/T345591) (owner: 10Kosta Harlan) [14:07:52] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:57] makes sense. i'll just sync that w/o mwdebug. [14:07:59] (03Merged) 10jenkins-bot: CheckUser: Set 'debug' log level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971953 (https://phabricator.wikimedia.org/T345591) (owner: 10Kosta Harlan) [14:08:06] 👍 [14:08:12] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:971953|CheckUser: Set 'debug' log level (T345591)]] [14:08:18] T345591: Stop deletion of rows in the cu_useragent_clienthints table - https://phabricator.wikimedia.org/T345591 [14:09:02] Jhs: is that patch in a rush or something? [14:09:16] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:09:36] urbanecm, i need it before i can start importing zghwiki [14:09:47] i guess that's a yes then :D [14:09:52] !log volans@cumin1001 START - Cookbook sre.dns.netbox [14:09:59] !log volans@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [14:10:24] (03CR) 10Urbanecm: [C: 03+2] Don't remove current wiki family from $wgCentralAuthAutoLoginWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967295 (owner: 10Bartosz Dziewoński) [14:10:36] (03CR) 10Urbanecm: [C: 03+2] Clean up $wgCentralAuthAutoLoginWikis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 (owner: 10Bartosz Dziewoński) [14:10:59] oh, crossing fingers that that one ^ will fix the login issues i've been having these past few days 🤞 [14:11:16] hopefully :) [14:11:32] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:11:33] (03Merged) 10jenkins-bot: Don't remove current wiki family from $wgCentralAuthAutoLoginWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967295 (owner: 10Bartosz Dziewoński) [14:11:40] PROBLEM - aqs endpoints health on aqs2004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:11:47] (03PS4) 10Urbanecm: Clean up $wgCentralAuthAutoLoginWikis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 (owner: 10Bartosz Dziewoński) [14:11:48] prrrrobably not. but it depends on the kind of issues [14:11:50] PROBLEM - aqs endpoints health on aqs1019 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:11:51] (03CR) 10Urbanecm: Clean up $wgCentralAuthAutoLoginWikis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 (owner: 10Bartosz Dziewoński) [14:11:54] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:11:54] (03CR) 10Urbanecm: [C: 03+2] Clean up $wgCentralAuthAutoLoginWikis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 (owner: 10Bartosz Dziewoński) [14:11:54] PROBLEM - aqs endpoints health on aqs2009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:11:56] PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:00] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2195.codfw.wmnet with reason: host reimage [14:12:08] PROBLEM - aqs endpoints health on aqs1018 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:09] (03PS1) 10Jbond: Revert "cluster::management: migrate to puppet7" [puppet] - 10https://gerrit.wikimedia.org/r/971540 [14:12:10] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:18] PROBLEM - aqs endpoints health on aqs2008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:18] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:18] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:36] PROBLEM - aqs endpoints health on aqs2012 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:44] PROBLEM - aqs endpoints health on aqs1017 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:47] 10SRE, 10serviceops-radar, 10SRE Observability (FY2023/2024-Q2), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [14:12:47] what is icinga-wm spamming about? [14:12:48] (03Merged) 10jenkins-bot: Clean up $wgCentralAuthAutoLoginWikis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 (owner: 10Bartosz Dziewoński) [14:13:22] PROBLEM - aqs endpoints health on aqs2006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:13:28] PROBLEM - aqs endpoints health on aqs2003 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:13:46] RECOVERY - aqs endpoints health on aqs2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:08] RECOVERY - aqs endpoints health on aqs1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:14] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:14] RECOVERY - aqs endpoints health on aqs2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:14] RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:21] (03CR) 10FNegri: [C: 03+2] P:openstack:codfw1dev enable prom exporter [puppet] - 10https://gerrit.wikimedia.org/r/971491 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri) [14:14:28] RECOVERY - aqs endpoints health on aqs1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:30] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:32] RECOVERY - aqs endpoints health on aqs2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:39] (03PS1) 10Jbond: puppet.puppet.get_puppet_ca_hostname: return hardcoded start [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) [14:14:52] (03CR) 10Jbond: [C: 03+2] Revert "cluster::management: migrate to puppet7" [puppet] - 10https://gerrit.wikimedia.org/r/971540 (owner: 10Jbond) [14:15:01] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:15:01] RECOVERY - aqs endpoints health on aqs1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:15:09] RECOVERY - aqs endpoints health on aqs2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:15:12] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2195.codfw.wmnet with reason: host reimage [14:15:55] MatmaRex: according to -analytics, they're doing a rolling restart [14:16:00] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [14:16:07] (03PS5) 10Ayounsi: Split interface_automation into multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319 [14:16:09] (03PS8) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 [14:16:10] stevemunene: ^^ to confirm. [14:16:11] (03PS5) 10Ayounsi: provision_server: make switch selection optional [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749 [14:16:13] (03PS4) 10Ayounsi: provision_server: don't show servers with a primary IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969752 [14:16:15] (03PS4) 10Ayounsi: Add MoveServersUplinks Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) [14:16:16] so why is it alerting about it? [14:16:17] (03PS1) 10Ayounsi: Remove "Import Interfaces from a JSON blob" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/971958 [14:16:21] PROBLEM - aqs endpoints health on aqs1020 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:17:05] (03CR) 10CI reject: [V: 04-1] Remove "Import Interfaces from a JSON blob" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/971958 (owner: 10Ayounsi) [14:17:09] it's really quite disruptive when it's going on like that [14:17:23] RECOVERY - aqs endpoints health on aqs1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:17:27] urbanecm: roll restart of the druid-public cluster, this should not affect aqs this much [14:17:27] but i don't want to ignore it in case it alerts about something wrong with my deployments in a minute [14:18:22] stevemunene: is it possible to pause with that for a while, so that the alerts don't appear? it's a MW deployment window time, and the alerts make it hard to see whether there's a possibly deployment-related alert. i don't want to get in the habit of ignoring icinga screaming :)) [14:18:25] PROBLEM - aqs endpoints health on aqs2010 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:19:25] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:19:27] RECOVERY - aqs endpoints health on aqs2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:20:01] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1010.eqiad.wmnet, druid1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:20:30] sure urbanecm stopping the restart and any further changes to druid-public for the time cc btullis [14:20:39] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1010.eqiad.wmnet, druid1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:20:52] ty stevemunene . i'll ping you when i'm done with the window! [14:20:58] !log stevemunene@cumin1001 END (ERROR) - Cookbook sre.druid.roll-restart-workers (exit_code=97) for Druid public cluster: Roll restart of Druid jvm daemons. [14:21:10] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2195.codfw.wmnet with OS bookworm [14:21:29] np urbanecm sorry for the troubles [14:21:35] (03CR) 10CI reject: [V: 04-1] puppet.puppet.get_puppet_ca_hostname: return hardcoded start [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:21:41] (03PS1) 10Ayounsi: sre.hosts.reimage: use the new ImportPuppetDB path [cookbooks] - 10https://gerrit.wikimedia.org/r/971959 [14:22:33] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:971953|CheckUser: Set 'debug' log level (T345591)]] (duration: 14m 20s) [14:22:37] T345591: Stop deletion of rows in the cu_useragent_clienthints table - https://phabricator.wikimedia.org/T345591 [14:22:40] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:22:41] Dreamy_Jazz: deployed [14:22:50] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:967295|Don't remove current wiki family from $wgCentralAuthAutoLoginWikis]], [[gerrit:967977|Clean up $wgCentralAuthAutoLoginWikis configuration]] [14:22:51] MatmaRex: going on the first two config patches from you [14:22:52] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:23:00] (03CR) 10Ayounsi: Split interface_automation into multiple files (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319 (owner: 10Ayounsi) [14:23:06] urbanecm: thanks [14:23:19] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:23:36] urbanecm: if we have time after the deployments, could you check on the maintenance scripts in https://phabricator.wikimedia.org/T315510 (i hope the last two finished, enwiki probably still ongoing) [14:23:36] and restart whichever ones are still running (with --from as needed)? subbu asked for it: https://wikimedia.slack.com/archives/C024Z8K9CAU/p1698978381393149?thread_ts=1698952211.675729&cid=C024Z8K9CAU [14:23:36] apparently they've been running for so long, they're using outdated configuration now and accessing the wrong caches [14:24:05] !log urbanecm@deploy2002 matmarex and urbanecm: Backport for [[gerrit:967295|Don't remove current wiki family from $wgCentralAuthAutoLoginWikis]], [[gerrit:967977|Clean up $wgCentralAuthAutoLoginWikis configuration]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:24:07] (errrr, with --start) [14:24:24] (03PS2) 10Effie Mouzeli: (WIP) ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) [14:24:35] testing these on mwdebug now [14:24:47] PROBLEM - Check systemd state on mw2400 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:03] (03PS2) 10Ayounsi: Remove "Import Interfaces from a JSON blob" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/971958 [14:25:14] (03CR) 10Volans: [C: 03+1] "LGTM beside CI" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/971958 (owner: 10Ayounsi) [14:25:18] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1005.eqiad.wmnet with OS bookworm [14:25:22] !log rolling upgrade of HAProxy to version 2.6.15-1~bpo11+1 in eqsin [14:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:31] RECOVERY - aqs endpoints health on aqs2008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:25:45] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Build php7.4 images with icu67 packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971223 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [14:26:00] (03CR) 10jenkins-bot: Remove "Import Interfaces from a JSON blob" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/971958 (owner: 10Ayounsi) [14:26:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10cmooney) 05Open→03Resolved >>! In T350488#9308379, @ayounsi wrote: > For cross-sites router to router we use the TTL value to eventually take do... [14:26:02] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-eqsin and A:cp [14:26:50] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:27:02] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:27:16] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [14:27:24] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:27:29] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: aux_k8s::worker [14:27:30] (03CR) 10Vgutierrez: [C: 03+1] haproxy: enable healthcheck-dedicated backend in codfw [puppet] - 10https://gerrit.wikimedia.org/r/971922 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [14:27:33] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:27:48] (03Merged) 10jenkins-bot: Restore OOUI dialog styles for compatibility [skins/Vector] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/971539 (https://phabricator.wikimedia.org/T350544) (owner: 10Bartosz Dziewoński) [14:27:49] urbanecm: patches look good [14:27:52] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10cmooney) 05Resolved→03Open Eh not sure how I accidentally set this to resolved! [14:28:10] (03PS3) 10Effie Mouzeli: (WIP) ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) [14:28:43] (03PS4) 10Effie Mouzeli: (WIP) ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) [14:28:58] thanks, proceeding [14:28:58] (03PS1) 10Jbond: aux_k8s::worker: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971960 (https://phabricator.wikimedia.org/T349619) [14:29:00] !log urbanecm@deploy2002 matmarex and urbanecm: Continuing with sync [14:29:16] (03PS12) 10Urbanecm: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [14:29:19] (03CR) 10Urbanecm: [C: 03+2] Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [14:29:31] (03CR) 10Jbond: [C: 03+2] aux_k8s::worker: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971960 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:30:24] MatmaRex: [14:30:34] MatmaRex: this is what i see re scripts https://www.irccloud.com/pastebin/ox75X27M/ [14:30:45] (03CR) 10Volans: [C: 04-1] "I think we might need some additional logic" [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:31:05] can't do much more since they're running under Tyler's account. I can probably stop them and start again if you want me to [14:31:51] hmm [14:32:16] (03Merged) 10jenkins-bot: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [14:32:50] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:33:35] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: aux_k8s::worker [14:33:37] so s7 has finished/exited? and the s1 one too? [14:33:40] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10Volans) @jbond I think that the decommission cookbook needs some adjustment too, both because it checks some git checkout on the... [14:33:50] err, s6 has [14:33:51] (03CR) 10JMeybohm: [C: 03+2] Update termbox to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967412 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:33:53] s7 hasn't [14:34:24] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:967295|Don't remove current wiki family from $wgCentralAuthAutoLoginWikis]], [[gerrit:967977|Clean up $wgCentralAuthAutoLoginWikis configuration]] (duration: 11m 34s) [14:34:32] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: aux_k8s::master [14:34:42] (03Merged) 10jenkins-bot: Update termbox to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967412 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:35:16] 10SRE, 10Growth-Team, 10MW-on-K8s, 10MediaWiki-Platform-Team, and 5 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10Joe) Since my deployment of this change... [14:35:25] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:966798|Generalize Meta/Commons exceptions for CentralAuth cookie handling (T257852)]], [[gerrit:971539|Restore OOUI dialog styles for compatibility (T350544)]] [14:35:32] (03PS1) 10Jbond: aux_k8s::master: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971961 (https://phabricator.wikimedia.org/T349619) [14:35:41] T257852: CentralAuth edge login and autologin for some Wikimedia domains broken on mobile - https://phabricator.wikimedia.org/T257852 [14:35:41] T350544: OOUI widgets now have huge fonts and misaligned buttons in some places - https://phabricator.wikimedia.org/T350544 [14:36:03] MatmaRex: look so. i can see only rowiki running, which is s7 [14:36:15] proceeding with backport+last config [14:36:16] (03PS3) 10BPirkle: Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) [14:36:25] (03CR) 10Jbond: [C: 03+2] aux_k8s::master: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971961 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:36:41] !log urbanecm@deploy2002 urbanecm and tgr and matmarex: Backport for [[gerrit:966798|Generalize Meta/Commons exceptions for CentralAuth cookie handling (T257852)]], [[gerrit:971539|Restore OOUI dialog styles for compatibility (T350544)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:50] MatmaRex: please check [14:36:57] (03CR) 10CI reject: [V: 04-1] Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) (owner: 10BPirkle) [14:37:00] looking [14:37:20] PROBLEM - Check systemd state on ganeti1028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:46] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage [14:37:56] (03CR) 10Cathal Mooney: Block incoming packets on the edge for CR loopbacks on TCP 179 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/971498 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [14:37:59] Vector/OOUI change looks good, testing the other [14:38:11] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:20] RECOVERY - aqs endpoints health on aqs2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:38:27] (03Abandoned) 10Cathal Mooney: Block incoming packets on the edge for CR loopbacks on TCP 179 [homer/public] - 10https://gerrit.wikimedia.org/r/971498 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [14:38:51] (03PS4) 10BPirkle: Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) [14:39:30] (03CR) 10CI reject: [V: 04-1] Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) (owner: 10BPirkle) [14:39:31] Jhs: i can do your patch later in the evening. not trying my luck with MW Core CI _and_ an i18n rebuild. [14:39:38] with ~20 mins left [14:40:14] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: aux_k8s::master [14:40:40] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: etcd::v3::aux_k8s_etcd [14:40:45] (03CR) 10Cathal Mooney: Remove specific TTL values from server BGP groups (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/971488 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [14:40:47] (03CR) 10Filippo Giunchedi: "I'd rather make sure this gets invoked after a puppet run, i.e. set this timer as after=puppet-agent-timer, what do you think ?" [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [14:41:01] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage [14:41:16] (03CR) 10Ssingh: [C: 03+1] "I guess we will be careful in rolling this out again as always! Let me know when we plan to and happy to take care of it." [puppet] - 10https://gerrit.wikimedia.org/r/971490 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [14:41:31] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/971959 (owner: 10Ayounsi) [14:42:05] (03CR) 10Volans: [C: 03+1] "LGTM, bike shedding inline :D" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319 (owner: 10Ayounsi) [14:42:19] (03PS5) 10BPirkle: Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) [14:42:29] (03PS1) 10Jbond: etcd::v3::aux_k8s_etcd: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971962 (https://phabricator.wikimedia.org/T349619) [14:42:31] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:42:33] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: enable healthcheck-dedicated backend in codfw [puppet] - 10https://gerrit.wikimedia.org/r/971922 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [14:42:45] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749 (owner: 10Ayounsi) [14:42:48] (03CR) 10Jbond: [C: 03+2] etcd::v3::aux_k8s_etcd: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971962 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:43:05] urbanecm: all look good [14:43:11] ty, proceeding [14:43:25] !log urbanecm@deploy2002 urbanecm and tgr and matmarex: Continuing with sync [14:43:29] urbanecm, sure, no problem [14:43:36] i'll add it to the next window then [14:43:56] sounds good [14:44:15] (03PS1) 10Ottomata: eventgate chart - graceful restart policy with relative timeout/prestop_sleep values [deployment-charts] - 10https://gerrit.wikimedia.org/r/971963 (https://phabricator.wikimedia.org/T349823) [14:44:59] (03PS1) 10Vgutierrez: hiera: Remove digicert-2022 certificates [puppet] - 10https://gerrit.wikimedia.org/r/971964 (https://phabricator.wikimedia.org/T341119) [14:45:12] urbanecm: regarding the maint script, could you stop the one that's running (and share the outputs, if you can access them)? i'll schedule restarting this and starting the next batch for the next window [14:45:40] MatmaRex: is it ok if i post output publicly? [14:45:51] yeah, on the task [14:46:00] it's 3.3MB of logs [14:46:05] i wonder if the s1 script finished or crashed [14:46:08] perfect :D [14:46:32] just within the file size limit [14:46:51] the logs shouldn't have anything worse than page titles and IDs in them [14:46:58] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [14:47:00] (03Abandoned) 10Ottomata: Remove deprecated all_settings streamconfigs param [deployment-charts] - 10https://gerrit.wikimedia.org/r/910471 (https://phabricator.wikimedia.org/T286344) (owner: 10Phuedx) [14:47:09] (03CR) 10Volans: "minor nits inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) (owner: 10Ayounsi) [14:47:11] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/326/con" [puppet] - 10https://gerrit.wikimedia.org/r/971964 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [14:47:33] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/971958 (owner: 10Ayounsi) [14:47:45] !log mwmaint2002: kill persistRevisionThreadItems.php maintenance script for s7 (T315510) [14:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:48] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [14:47:59] 10SRE, 10Thumbor: rendering of images in high res sometimes fails and fails permanently - https://phabricator.wikimedia.org/T350548 (10hnowlan) A cache purge of this image appears to have fixed this image. Unclear as to what initially caused this issue but it seems likely that it was a hangover from T344233 [14:48:07] 10SRE, 10Thumbor: rendering of images in high res sometimes fails and fails permanently - https://phabricator.wikimedia.org/T350548 (10hnowlan) 05Open→03Resolved [14:48:09] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10jbond) [14:48:39] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:48:39] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:966798|Generalize Meta/Commons exceptions for CentralAuth cookie handling (T257852)]], [[gerrit:971539|Restore OOUI dialog styles for compatibility (T350544)]] (duration: 13m 13s) [14:48:44] T257852: CentralAuth edge login and autologin for some Wikimedia domains broken on mobile - https://phabricator.wikimedia.org/T257852 [14:48:44] T350544: OOUI widgets now have huge fonts and misaligned buttons in some places - https://phabricator.wikimedia.org/T350544 [14:48:45] (03CR) 10Ssingh: [C: 03+1] hiera: Remove digicert-2022 certificates [puppet] - 10https://gerrit.wikimedia.org/r/971964 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [14:49:03] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10jbond) @Volans FYI ill update the d3ecomuission cookbook as part of this task, thanks for the pointer [14:49:06] (03CR) 10Marostegui: "If you are going to fully reimage this should work. If you weren't, you'd need to remove the 10.4 package first." [puppet] - 10https://gerrit.wikimedia.org/r/971431 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [14:49:08] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [14:49:10] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Remove digicert-2022 certificates [puppet] - 10https://gerrit.wikimedia.org/r/971964 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [14:49:12] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: etcd::v3::aux_k8s_etcd [14:49:34] MatmaRex: i wouldn't be so sure. File size is too large. See https://www.mediawiki.org/wiki/Phabricator/Help#Uploading_file_attachments [14:49:38] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: idp [14:50:08] urbanecm: ugh, gzip it? [14:50:09] (03CR) 10Marostegui: [C: 03+1] "Even if they don't have data, if they get cloned from a 10.4, you'd still need to run: mysql_upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/971431 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [14:50:29] yup yup, just saying. the other one was much larger. [14:50:48] (03PS1) 10Jbond: idp: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971965 [14:50:54] MatmaRex: https://phabricator.wikimedia.org/T315510#9308702 [14:51:02] patches are deployed. anything else? [14:51:08] (03CR) 10Jbond: [C: 03+2] idp: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971965 (owner: 10Jbond) [14:51:21] MdsShakil: around for your deployment? [14:51:23] if i ever do anything like this again, i will make sure i won't have to deal with multi-MB logs and multi-month scripts [14:51:26] thank you urbanecm [14:51:38] thank you, would be appreciated! [14:51:58] urbanecm yah [14:52:12] (03CR) 10Urbanecm: [C: 03+2] Add autopatrol to Wikifunctions Staff group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970755 (https://phabricator.wikimedia.org/T350028) (owner: 10MdsShakil) [14:52:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970755 (https://phabricator.wikimedia.org/T350028) (owner: 10MdsShakil) [14:52:38] starting [14:52:54] (03Merged) 10jenkins-bot: Add autopatrol to Wikifunctions Staff group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970755 (https://phabricator.wikimedia.org/T350028) (owner: 10MdsShakil) [14:53:07] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:970755|Add autopatrol to Wikifunctions Staff group (T350028)]] [14:53:11] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:13] T350028: Give Wikifunctions Staff group autopatrol - https://phabricator.wikimedia.org/T350028 [14:53:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [14:53:51] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host db2195.codfw.wmnet with OS bookworm [14:54:29] !log urbanecm@deploy2002 urbanecm and mdsshakil: Backport for [[gerrit:970755|Add autopatrol to Wikifunctions Staff group (T350028)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:54:33] (03PS1) 10Fabfur: haproxy: enable healthcheck-dedicated backend in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/971966 (https://phabricator.wikimedia.org/T348851) [14:54:37] MdsShakil: please test :) [14:54:47] (03PS26) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [14:54:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s8 T349053 [14:54:57] !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1:00:00 on 32 hosts with reason: Primary switchover s8 T349053 [14:55:00] T349053: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T349053 [14:55:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc[2015-2016].codfw.wmnet,pc[1015-1016].eqiad.wmnet with reason: Upgrade [14:55:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc[2015-2016].codfw.wmnet,pc[1015-1016].eqiad.wmnet with reason: Upgrade [14:55:34] Ignore the s8 downtime [14:55:49] (03CR) 10Bking: [C: 03+2] admin_ng: Activate flink-operator for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/971221 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [14:55:56] (03CR) 10Bking: admin_ng: Activate flink-operator for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/971221 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [14:56:04] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: idp [14:56:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2015.codfw.wmnet with OS bookworm [14:56:27] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/971966 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [14:56:28] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: rpkivalidator [14:56:31] okay... patch works for me, proceeding [14:56:32] !log urbanecm@deploy2002 urbanecm and mdsshakil: Continuing with sync [14:57:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1015.eqiad.wmnet with OS bookworm [14:57:33] (03PS1) 10ArielGlenn: use virtual db domain for CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) [14:57:56] (03PS1) 10Jbond: rpkivalidator: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971968 (https://phabricator.wikimedia.org/T349619) [14:58:10] (03CR) 10Jbond: [C: 03+2] rpkivalidator: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971968 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:58:12] (03PS1) 10Marostegui: install_server: Do not reimage /srv on db2134 [puppet] - 10https://gerrit.wikimedia.org/r/971969 [14:58:47] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [14:59:50] urbanecm irc873 is not working [15:00:00] (03CR) 10CI reject: [V: 04-1] use virtual db domain for CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [15:00:16] kicked me out [15:00:31] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1005.eqiad.wmnet with OS bookworm [15:00:54] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage /srv on db2134 [puppet] - 10https://gerrit.wikimedia.org/r/971969 (owner: 10Marostegui) [15:01:11] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [15:01:48] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:970755|Add autopatrol to Wikifunctions Staff group (T350028)]] (duration: 08m 41s) [15:01:51] T350028: Give Wikifunctions Staff group autopatrol - https://phabricator.wikimedia.org/T350028 [15:02:54] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:03:38] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:03:43] ^ expected [15:04:45] !log finished upgrading all doh* hosts to dnsdist 1.8.2-1+wmf12u2 12 [15:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: rpkivalidator [15:05:25] (03PS2) 10Bking: admin_ng: Activate flink-operator for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/971221 (https://phabricator.wikimedia.org/T349095) [15:06:44] (03CR) 10JMeybohm: [C: 03+1] "looks reasonable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971963 (https://phabricator.wikimedia.org/T349823) (owner: 10Ottomata) [15:08:02] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply [15:08:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::builder: improvements to update-production-images [puppet] - 10https://gerrit.wikimedia.org/r/971885 (owner: 10Giuseppe Lavagetto) [15:08:56] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [15:09:04] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply [15:09:16] (03PS3) 10Bking: admin_ng: Activate flink-operator for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/971221 (https://phabricator.wikimedia.org/T349095) [15:09:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1015.eqiad.wmnet with reason: host reimage [15:09:45] (03CR) 10DCausse: [C: 03+1] admin_ng: Activate flink-operator for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/971221 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:09:48] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [15:10:18] (03CR) 10Bking: [C: 03+2] admin_ng: Activate flink-operator for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/971221 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:10:32] (03CR) 10Btullis: C:bigtop::hadoop switch to new topology script. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede) [15:10:35] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [15:10:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [15:12:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2195.codfw.wmnet with reason: host reimage [15:12:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1015.eqiad.wmnet with reason: host reimage [15:13:01] (03Merged) 10jenkins-bot: admin_ng: Activate flink-operator for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/971221 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:13:09] 10SRE, 10Growth-Team, 10MW-on-K8s, 10MediaWiki-Platform-Team, and 5 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10Joe) 05Open→03Resolved [15:13:17] (03CR) 10JMeybohm: [C: 03+1] eventgate: Update mesh module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [15:14:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2015.codfw.wmnet with reason: host reimage [15:15:29] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2195.codfw.wmnet with reason: host reimage [15:16:48] PROBLEM - Disk space on druid1004 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1004&var-datasource=eqiad+prometheus/ops [15:16:55] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 (10Joe) >>! In T348122#9240973, @matmarex wrote: > The Kubernetes work so far has caused problems with cross-wiki Echo notifications (see T223413, T342... [15:17:08] MdsShakil31: no worries. Should be live. [15:17:47] !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:17:55] urbanecm Thank you [15:17:59] !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:18:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2015.codfw.wmnet with reason: host reimage [15:18:19] !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:19:42] !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:19:59] (03PS1) 10Volans: sre.hosts.reimage: fix config master cumin query [cookbooks] - 10https://gerrit.wikimedia.org/r/971976 [15:20:46] (03CR) 10JMeybohm: [C: 03+1] Update PHP hook to use Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/971921 (owner: 10Muehlenhoff) [15:21:04] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [15:22:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [15:23:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/971976 (owner: 10Volans) [15:24:01] (03CR) 10FNegri: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/971976 (owner: 10Volans) [15:25:30] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: fix config master cumin query [cookbooks] - 10https://gerrit.wikimedia.org/r/971976 (owner: 10Volans) [15:25:31] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-eqsin and A:cp [15:26:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): puppet7: drop instances of :undef in erb files - https://phabricator.wikimedia.org/T341071 (10jbond) [15:28:47] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-October-December): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10MatthewVernon) The thanos account has been created and is ready to go, the client software needs to be told... [15:28:59] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:02] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2195.codfw.wmnet with OS bookworm [15:30:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1015.eqiad.wmnet with OS bookworm [15:30:29] (03Merged) 10jenkins-bot: sre.hosts.reimage: fix config master cumin query [cookbooks] - 10https://gerrit.wikimedia.org/r/971976 (owner: 10Volans) [15:30:47] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1001.wikimedia.org with OS bookworm [15:30:59] (03CR) 10Arnaudb: [C: 03+2] mariadb: hieradata to install 10.6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971431 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [15:31:23] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2015.codfw.wmnet with OS bookworm [15:32:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2016.codfw.wmnet with OS bookworm [15:32:51] (03CR) 10BPirkle: Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) (owner: 10BPirkle) [15:33:27] (03CR) 10Ayounsi: Add MoveServersUplinks Netbox script (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) (owner: 10Ayounsi) [15:37:07] RECOVERY - Disk space on druid1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1004&var-datasource=eqiad+prometheus/ops [15:37:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1016.eqiad.wmnet with OS bookworm [15:37:31] (03PS1) 10FNegri: P:openstack:codfw1dev fix wrong hostname [puppet] - 10https://gerrit.wikimedia.org/r/971979 (https://phabricator.wikimedia.org/T350154) [15:38:24] (03CR) 10FNegri: "Fixing a stupid mistake in my previous patch :/" [puppet] - 10https://gerrit.wikimedia.org/r/971979 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri) [15:40:37] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [15:41:30] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10User-jbond: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [15:43:20] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1001.wikimedia.org with reason: host reimage [15:44:45] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:44:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Upgrade [15:45:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Upgrade [15:45:16] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:46:00] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1001.wikimedia.org with reason: host reimage [15:47:06] (03CR) 10Majavah: [C: 03+1] P:openstack:codfw1dev fix wrong hostname [puppet] - 10https://gerrit.wikimedia.org/r/971979 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri) [15:49:53] (03CR) 10FNegri: [C: 03+2] P:openstack:codfw1dev fix wrong hostname [puppet] - 10https://gerrit.wikimedia.org/r/971979 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri) [15:50:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2016.codfw.wmnet with reason: host reimage [15:50:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1016.eqiad.wmnet with reason: host reimage [15:51:35] (03CR) 10JMeybohm: (WIP) ipoid: add cronjobs for initialImport and dailyUpdate (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [15:52:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2016.codfw.wmnet with reason: host reimage [15:53:11] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:54:13] !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on pc1016.eqiad.wmnet with reason: host reimage [15:55:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:00:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:02:57] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1001.wikimedia.org with OS bookworm [16:04:16] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [16:04:19] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [16:04:27] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet CI, and 2 others: update pcc with puppet 7 support - https://phabricator.wikimedia.org/T236373 (10jbond) 05Open→03Resolved a:03jbond This is done [16:05:07] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Proposal: Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10jbond) 05Open→03Resolved a:03jbond [16:08:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2016.codfw.wmnet with OS bookworm [16:09:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1016.eqiad.wmnet with OS bookworm [16:10:49] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [16:10:58] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [16:13:47] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971942 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [16:16:05] (03CR) 10Btullis: [C: 03+1] "Drive-by review, but looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [16:17:15] (03PS3) 10Btullis: Configure the new mariadb servers to be replicas [puppet] - 10https://gerrit.wikimedia.org/r/971942 (https://phabricator.wikimedia.org/T284150) [16:17:28] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971942 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [16:20:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:21:05] (03CR) 10Ottomata: [C: 03+2] eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [16:22:14] (03Merged) 10jenkins-bot: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [16:22:35] (03PS2) 10Ottomata: eventgate chart - graceful restart policy with relative timeout/prestop_sleep values [deployment-charts] - 10https://gerrit.wikimedia.org/r/971963 (https://phabricator.wikimedia.org/T349823) [16:22:44] (03CR) 10Ottomata: [C: 03+2] eventgate chart - graceful restart policy with relative timeout/prestop_sleep values [deployment-charts] - 10https://gerrit.wikimedia.org/r/971963 (https://phabricator.wikimedia.org/T349823) (owner: 10Ottomata) [16:23:08] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10User-jbond: wmf-styleguide checks: unable to ignore violations inside roles - https://phabricator.wikimedia.org/T280353 (10jbond) 05Open→03Declined won't fix, this is quite an edge case and currently not worth fixing [16:23:41] (03Merged) 10jenkins-bot: eventgate chart - graceful restart policy with relative timeout/prestop_sleep values [deployment-charts] - 10https://gerrit.wikimedia.org/r/971963 (https://phabricator.wikimedia.org/T349823) (owner: 10Ottomata) [16:25:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:26:05] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Dzahn) a:03Dzahn [16:26:40] (03PS1) 10Ottomata: wgEventServices - add docs about timeout settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971986 (https://phabricator.wikimedia.org/T349823) [16:26:46] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs1014.eqiad.wmnet [16:27:31] (03CR) 10Ottomata: [C: 03+2] wgEventServices - add docs about timeout settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971986 (https://phabricator.wikimedia.org/T349823) (owner: 10Ottomata) [16:27:53] (03PS4) 10Btullis: Configure the new mariadb servers to be replicas [puppet] - 10https://gerrit.wikimedia.org/r/971942 (https://phabricator.wikimedia.org/T284150) [16:28:01] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971942 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [16:28:22] (03Merged) 10jenkins-bot: wgEventServices - add docs about timeout settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971986 (https://phabricator.wikimedia.org/T349823) (owner: 10Ottomata) [16:28:28] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2011.codfw.wmnet with OS bookworm [16:29:50] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [16:29:57] !log btullis@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) [16:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T1630). [16:30:15] o/ [16:30:55] RECOVERY - WDQS SPARQL on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:32:29] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet CI, 10Continuous-Integration-Config: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494 (10Volans) [16:33:06] (03PS1) 10Ottomata: eventgate-main - set prestop_sleep and terminiation timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/971989 (https://phabricator.wikimedia.org/T349823) [16:33:07] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1014.eqiad.wmnet [16:34:36] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Epic, and 2 others: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645 (10jbond) [16:34:51] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Puppet-Core, and 3 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797 (10jbond) 05Open→03Resolved a:03jbond im going to close this task down ii think we should create a new task to add additional checks... [16:35:10] (03PS2) 10Ottomata: eventgate-main - set prestop_sleep and terminiation timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/971989 (https://phabricator.wikimedia.org/T349823) [16:36:02] (03CR) 10Ottomata: [C: 03+2] eventgate-main - set prestop_sleep and terminiation timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/971989 (https://phabricator.wikimedia.org/T349823) (owner: 10Ottomata) [16:36:21] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971990 (https://phabricator.wikimedia.org/T128546) [16:37:01] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971991 (https://phabricator.wikimedia.org/T128546) [16:37:03] (03Merged) 10jenkins-bot: eventgate-main - set prestop_sleep and terminiation timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/971989 (https://phabricator.wikimedia.org/T349823) (owner: 10Ottomata) [16:37:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:38:29] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:39:49] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971991 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:40:42] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971991 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:41:11] !log beginning deployments of eventgate clusters: mesh and cert chart updates, as well as sleep timeout values for graceful envoy+eventgate container termination - T349823 T300033 T346638 [16:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:18] T300033: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 [16:41:18] T349823: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 [16:41:19] T346638: Rename the envoy's uses_ingress option to sets_sni - https://phabricator.wikimedia.org/T346638 [16:41:30] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [16:41:44] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [16:43:29] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [16:43:48] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [16:44:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [16:44:54] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [16:45:19] (03CR) 10Marostegui: [C: 03+1] mariadb::ferm: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971452 (owner: 10Muehlenhoff) [16:45:36] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [16:48:24] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [16:48:52] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [16:49:15] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [16:49:28] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [16:49:28] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:971991| Bumping portals to master (T128546)]] (duration: 05m 53s) [16:49:33] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:49:55] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [16:50:07] (CertAlmostExpired) resolved: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:54:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:54:24] (03PS10) 10Herron: prom-es-exporter: w3c-networkerror include uri_host label [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) [16:55:03] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:971991| Bumping portals to master (T128546)]] (duration: 05m 34s) [16:55:07] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:56:24] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [16:56:26] (03PS2) 10ArielGlenn: use virtual db domain for CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) [16:56:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [16:59:48] (03CR) 10Herron: [C: 03+2] prom-es-exporter: w3c-networkerror include uri_host label [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [16:59:55] (03PS1) 10EoghanGaffney: [apt-staging] Add fake secrets fro rsyncd secrets [labs/private] - 10https://gerrit.wikimedia.org/r/971993 [17:00:27] PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:29] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [17:01:53] (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] [apt-staging] Add fake secrets fro rsyncd secrets [labs/private] - 10https://gerrit.wikimedia.org/r/971993 (owner: 10EoghanGaffney) [17:02:26] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908 (10MoritzMuehlenhoff) [17:02:48] 10SRE, 10Infrastructure-Foundations, 10Puppet CI: Write, publish and deploy puppet-lint plug-in for ensure attribute bareword check - https://phabricator.wikimedia.org/T95377 (10MoritzMuehlenhoff) 05Open→03Declined Given that puppet-lint upstream doesn't flag it, we'll also stick it, marking as declined. [17:03:14] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.wikimedia.org with OS bookworm [17:03:15] RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:16] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1002.wikimedia.org with OS bookworm [17:03:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 (10Jclark-ctr) a:03VRiley-WMF [17:03:46] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2011.codfw.wmnet with reason: host reimage [17:04:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The patch LGTM; we do need to flip the switch and enable prometheus-statsd-exporter on the mw-debug deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [17:04:42] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10Jclark-ctr) @cmooney i have not seen any new faults on this ticket. are you ok closing this ticket? [17:05:58] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [17:07:00] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2011.codfw.wmnet with reason: host reimage [17:07:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:09:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:09:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:11:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:11:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:15:47] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1002.wikimedia.org with reason: host reimage [17:17:38] (03PS1) 10Ottomata: eventgate chart - remove unused value [deployment-charts] - 10https://gerrit.wikimedia.org/r/971995 [17:17:59] (03PS2) 10Ottomata: eventgate chart - remove unused value [deployment-charts] - 10https://gerrit.wikimedia.org/r/971995 [17:19:08] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1002.wikimedia.org with reason: host reimage [17:19:29] (03PS1) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline [puppet] - 10https://gerrit.wikimedia.org/r/971996 [17:20:01] (03CR) 10CI reject: [V: 04-1] [apt-staging] Add rsync endpoint for ci->apt pipeline [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney) [17:22:01] (03PS2) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline [puppet] - 10https://gerrit.wikimedia.org/r/971996 [17:24:28] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2011.codfw.wmnet with OS bookworm [17:28:29] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:27] (03PS1) 10Jforrester: wikifunctions: Bump evaluators to 2023-11-06-164826 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971998 (https://phabricator.wikimedia.org/T281500) [17:29:39] (03PS1) 10Jforrester: wikifunctions: Switch orchestrator to 2023-11-06-172159 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971999 (https://phabricator.wikimedia.org/T297509) [17:30:11] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:35:50] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1002.wikimedia.org with OS bookworm [17:37:09] (03PS1) 10Ottomata: eventgate chart - set default cpu limits to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972000 (https://phabricator.wikimedia.org/T349823) [17:37:52] (03PS2) 10Ottomata: eventgate chart - set default cpu limits to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972000 (https://phabricator.wikimedia.org/T349823) [17:38:04] (03CR) 10Ottomata: [C: 03+2] eventgate chart - remove unused value [deployment-charts] - 10https://gerrit.wikimedia.org/r/971995 (owner: 10Ottomata) [17:39:06] (03Merged) 10jenkins-bot: eventgate chart - remove unused value [deployment-charts] - 10https://gerrit.wikimedia.org/r/971995 (owner: 10Ottomata) [17:39:12] (03CR) 10JMeybohm: [C: 03+1] eventgate chart - set default cpu limits to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972000 (https://phabricator.wikimedia.org/T349823) (owner: 10Ottomata) [17:39:18] (03CR) 10Ottomata: [C: 03+2] eventgate chart - set default cpu limits to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972000 (https://phabricator.wikimedia.org/T349823) (owner: 10Ottomata) [17:40:26] (03Merged) 10jenkins-bot: eventgate chart - set default cpu limits to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972000 (https://phabricator.wikimedia.org/T349823) (owner: 10Ottomata) [17:41:38] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [17:41:41] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [17:44:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:45:33] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:46:19] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudrabbit1003'] [17:46:24] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [17:46:38] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [17:48:27] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [17:48:42] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [17:48:49] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [17:49:07] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:50:01] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [17:51:02] (03CR) 10FNegri: "The PCC looks good, so I'm not gonna check every little detail, but I want to go through all the files quickly before adding my +1." [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [17:52:14] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [17:52:20] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [17:52:26] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [17:54:16] (03CR) 10Jelto: [C: 03+1] "lgtm, I can deploy this tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/971502 (https://phabricator.wikimedia.org/T350478) (owner: 10Ahmon Dancy) [17:55:41] !log milimetric@deploy2002 Started deploy [analytics/refinery@0239c23]: Publishing refinery-source jars at 0.2.24 [17:55:42] jouncebot: nowandnext [17:55:42] No deployments scheduled for the next 0 hour(s) and 4 minute(s) [17:55:42] In 0 hour(s) and 4 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T1800) [17:55:43] In 0 hour(s) and 4 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T1800) [17:56:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudrabbit1003'] [17:58:09] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:10] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudrabbit1003'] [17:59:11] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [17:59:15] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [17:59:15] (03PS5) 10Effie Mouzeli: (WIP) ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) [17:59:21] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T1800) [18:00:07] ryankemper: That opportune time is upon us again. Time for a Wikidata Query Service weekly deploy deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T1800). [18:02:26] (03CR) 10Marostegui: "Is PCC happy with this change?" [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [18:02:50] (03CR) 10Effie Mouzeli: (WIP) ipoid: add cronjobs for initialImport and dailyUpdate (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [18:03:21] !log milimetric@deploy2002 Finished deploy [analytics/refinery@0239c23]: Publishing refinery-source jars at 0.2.24 (duration: 07m 39s) [18:03:54] (03PS6) 10Effie Mouzeli: (WIP) ipoid: add cronjobs for initialImport and dailyUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) [18:03:56] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:04:03] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:04:19] (03CR) 10Effie Mouzeli: (WIP) ipoid: add cronjobs for initialImport and dailyUpdate (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971933 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [18:04:51] (03Abandoned) 10Effie Mouzeli: [WIP] ipoid: Set an initialImport cron job [deployment-charts] - 10https://gerrit.wikimedia.org/r/967245 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [18:05:02] (03Abandoned) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [18:05:22] (03Abandoned) 10Effie Mouzeli: ipoid: Enable the cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/967243 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [18:05:51] (03PS1) 10Ladsgroup: Add pc4 to to list of clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972004 (https://phabricator.wikimedia.org/T350367) [18:09:53] !log milimetric@deploy2002 Started deploy [analytics/refinery@0239c23] (thin): Publishing refinery-source jars at 0.2.24 [18:10:00] !log milimetric@deploy2002 Finished deploy [analytics/refinery@0239c23] (thin): Publishing refinery-source jars at 0.2.24 (duration: 00m 07s) [18:11:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudrabbit1003'] [18:13:29] (03PS13) 10RhinosF1: Revert "wikistats:wikia: pause updates while changes are made to table" [puppet] - 10https://gerrit.wikimedia.org/r/971530 [18:14:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 (10VRiley-WMF) Ran the cable and plugged it into requested ports. [18:15:13] (03CR) 10Marostegui: Add pc4 to to list of clusters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972004 (https://phabricator.wikimedia.org/T350367) (owner: 10Ladsgroup) [18:15:15] (03CR) 10Dzahn: [C: 03+2] Revert "wikistats:wikia: pause updates while changes are made to table" [puppet] - 10https://gerrit.wikimedia.org/r/971530 (owner: 10RhinosF1) [18:18:20] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudrabbit1003.wikimedia.org with OS bookworm [18:19:10] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.wikimedia.org with OS bookworm [18:25:00] (03PS2) 10Ladsgroup: Add pc4 to to list of clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972004 (https://phabricator.wikimedia.org/T350367) [18:25:03] (03CR) 10Ladsgroup: Add pc4 to to list of clusters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972004 (https://phabricator.wikimedia.org/T350367) (owner: 10Ladsgroup) [18:26:17] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) (owner: 10Ayounsi) [18:30:07] (03PS1) 10Bking: rbac: permit deploy-flink user to create flinkdeployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/972005 (https://phabricator.wikimedia.org/T349095) [18:30:31] (03PS3) 10Ladsgroup: Add pc4 to the list of ParserCache clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972004 (https://phabricator.wikimedia.org/T350367) [18:33:00] (03CR) 10Marostegui: Add pc4 to the list of ParserCache clusters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972004 (https://phabricator.wikimedia.org/T350367) (owner: 10Ladsgroup) [18:33:38] (03CR) 10JMeybohm: [C: 03+1] rbac: permit deploy-flink user to create flinkdeployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/972005 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [18:33:54] (03PS4) 10Ladsgroup: Add pc4 to the list of ParserCache clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972004 (https://phabricator.wikimedia.org/T350367) [18:33:58] (03CR) 10Ladsgroup: Add pc4 to the list of ParserCache clusters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972004 (https://phabricator.wikimedia.org/T350367) (owner: 10Ladsgroup) [18:35:15] (03CR) 10DCausse: [C: 03+1] rbac: permit deploy-flink user to create flinkdeployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/972005 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [18:36:35] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 (10matmarex) It seems it was the same cause, as both issues look fixed to me. Thanks! [18:36:51] (03CR) 10Marostegui: [C: 03+1] Add pc4 to the list of ParserCache clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972004 (https://phabricator.wikimedia.org/T350367) (owner: 10Ladsgroup) [18:37:48] jouncebot: nowandnext [18:37:49] For the next 0 hour(s) and 22 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T1800) [18:37:49] In 2 hour(s) and 22 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T2100) [18:38:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972004 (https://phabricator.wikimedia.org/T350367) (owner: 10Ladsgroup) [18:39:22] !log milimetric@deploy2002 Started deploy [airflow-dags/analytics@048362b]: (no justification provided) [18:39:48] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - No response from remote host 185.15.59.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:39:51] (03Merged) 10jenkins-bot: Add pc4 to the list of ParserCache clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972004 (https://phabricator.wikimedia.org/T350367) (owner: 10Ladsgroup) [18:39:52] !log milimetric@deploy2002 Finished deploy [airflow-dags/analytics@048362b]: (no justification provided) (duration: 00m 29s) [18:40:05] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:972004|Add pc4 to the list of ParserCache clusters (T350367)]] [18:40:11] T350367: Add pc4 to mediawiki's parsercache clusters - https://phabricator.wikimedia.org/T350367 [18:41:20] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:972004|Add pc4 to the list of ParserCache clusters (T350367)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:43:54] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [18:44:46] (03CR) 10Bking: [C: 03+2] rbac: permit deploy-flink user to create flinkdeployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/972005 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [18:47:27] !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [18:47:41] !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [18:47:48] !log bking@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:48:54] !log bking@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:49:37] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:972004|Add pc4 to the list of ParserCache clusters (T350367)]] (duration: 09m 32s) [18:49:41] T350367: Add pc4 to mediawiki's parsercache clusters - https://phabricator.wikimedia.org/T350367 [18:49:45] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:49:52] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:53:11] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:04:20] (03PS1) 10Marostegui: mariadb: Enable pc4 notifications [puppet] - 10https://gerrit.wikimedia.org/r/972007 (https://phabricator.wikimedia.org/T350367) [19:04:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:05:04] (03CR) 10Marostegui: "They are all green in icinga" [puppet] - 10https://gerrit.wikimedia.org/r/972007 (https://phabricator.wikimedia.org/T350367) (owner: 10Marostegui) [19:05:13] Amir1: ^ I am going to merge [19:07:17] marostegui: thanks [19:07:21] (03CR) 10Marostegui: [C: 03+2] mariadb: Enable pc4 notifications [puppet] - 10https://gerrit.wikimedia.org/r/972007 (https://phabricator.wikimedia.org/T350367) (owner: 10Marostegui) [19:07:25] sorry I forgot [19:10:44] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Dzahn) I am happy to take this on and create the VMs and my team is ok with being called the owner in puppet. We should j... [19:15:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:24] (03PS1) 10RobH: updating robh ed key [puppet] - 10https://gerrit.wikimedia.org/r/972009 [19:23:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 (10cmooney) Thanks @VRiley-WMF. Right now we can't see the status as the port needs to be enabled for 100G. But that involves resetting PIC 0/1 completely which wil... [19:24:02] (03CR) 10RobH: [C: 03+2] updating robh ed key [puppet] - 10https://gerrit.wikimedia.org/r/972009 (owner: 10RobH) [19:41:12] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:41:19] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:43:53] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudrabbit1003.wikimedia.org with OS bookworm [19:44:50] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.wikimedia.org with OS bookworm [19:47:08] 10ops-esams: ManagementSSHDown - https://phabricator.wikimedia.org/T344593 (10RobH) 05Open→03Resolved a:03RobH These are always false positives, as they are online and responsive when I check. [19:48:45] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [19:48:50] 10SRE, 10ops-esams, 10Documentation: Update on-wiki documentation about esams - https://phabricator.wikimedia.org/T344129 (10RobH) What needs to be updated? The commends listed still work for esams? As far as I can tell there isn't any required updates, since the FQDN didn't shift for use in esams. [19:48:57] 10SRE, 10ops-esams, 10Documentation: Update on-wiki documentation about esams - https://phabricator.wikimedia.org/T344129 (10RobH) 05Open→03Resolved a:03RobH [19:49:20] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10RobH) 05Open→03Resolved a:03RobH [19:49:57] (03PS1) 10Jdlrobson: Avoid nullish coalescing operators [skins/Vector] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972029 (https://phabricator.wikimedia.org/T350519) [19:50:38] 10SRE, 10ops-esams, 10DC-Ops: cp3060 idrac https interface failures - https://phabricator.wikimedia.org/T308797 (10RobH) 05Open→03Declined server replaced and decommissioned [19:51:41] 10SRE, 10ops-esams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH) 05Stalled→03Resolved These were added to the librenms and monitoring in the past without this task being updated. Resolving. [19:53:07] (03PS1) 10Ebernhardson: rdf-streaming-updater: Defined allowed zk clusters for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/972014 [19:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:57:01] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1003.wikimedia.org with reason: host reimage [20:01:38] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1003.wikimedia.org with reason: host reimage [20:02:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:06:27] (03PS27) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [20:07:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:48] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [20:10:00] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [20:10:09] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [20:10:12] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [20:12:10] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [20:12:18] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [20:18:19] PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:24] (03Abandoned) 10JHathaway: Hieradata: format yaml with vinyl [puppet] - 10https://gerrit.wikimedia.org/r/754114 (https://phabricator.wikimedia.org/T236954) (owner: 10JHathaway) [20:20:45] RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:32] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1003.wikimedia.org with OS bookworm [20:26:59] (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/972016 [20:29:31] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [20:29:59] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [20:31:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:32:04] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/972016 (owner: 10Ebernhardson) [20:32:47] (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/972016 (owner: 10Ebernhardson) [20:32:56] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [20:34:49] (03CR) 10Ladsgroup: use virtual db domain for CentralAuth (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [20:38:59] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [20:40:22] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [20:41:53] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [20:42:18] (03PS4) 10Anzx: mznwiki: add project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971529 (https://phabricator.wikimedia.org/T350397) [20:46:08] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [20:46:33] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [20:47:22] (03PS1) 10Jdlrobson: Omit the last modified bar in the HTML rather than hiding it via CSS [skins/MinervaNeue] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972030 (https://phabricator.wikimedia.org/T350515) [20:48:54] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [20:49:49] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [20:53:22] (03PS28) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [20:53:29] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) https://github.com/goccy/go-yaml project does a much better job of retaining formatting when decoding yaml documents. I am going to i... [20:54:30] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:54:39] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:56:38] 10SRE, 10Abstract Wikipedia team, 10serviceops, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) 05In progress→03Resolved [20:59:31] (03PS1) 10Ottomata: eventgate-* - bump image and remove now defunct stream config settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/972046 (https://phabricator.wikimedia.org/T326002) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Your horoscope predicts another unfortunate UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T2100). [21:00:05] Jhs, Jdlrobson, and aanzx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:07] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [21:00:29] (03CR) 10CI reject: [V: 04-1] eventgate-* - bump image and remove now defunct stream config settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/972046 (https://phabricator.wikimedia.org/T326002) (owner: 10Ottomata) [21:00:54] present [21:01:25] o/ [21:01:56] ✋ present [21:03:56] (03PS2) 10Ottomata: eventgate-* - bump image and remove now defunct stream config settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/972046 (https://phabricator.wikimedia.org/T326002) [21:04:12] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [21:06:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:50] hi - is a deployer still needed? i can deploy if so (sorry to be late - just got off a mtg) [21:08:17] cjming: if you could that would be great [21:08:19] (03CR) 10Ottomata: [C: 03+2] eventgate-* - bump image and remove now defunct stream config settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/972046 (https://phabricator.wikimedia.org/T326002) (owner: 10Ottomata) [21:08:21] i have an UBN [21:09:17] (03CR) 10Clare Ming: [C: 03+2] Avoid nullish coalescing operators [skins/Vector] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972029 (https://phabricator.wikimedia.org/T350519) (owner: 10Jdlrobson) [21:09:28] (03Merged) 10jenkins-bot: eventgate-* - bump image and remove now defunct stream config settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/972046 (https://phabricator.wikimedia.org/T326002) (owner: 10Ottomata) [21:09:30] (03CR) 10Clare Ming: [C: 03+2] Omit the last modified bar in the HTML rather than hiding it via CSS [skins/MinervaNeue] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972030 (https://phabricator.wikimedia.org/T350515) (owner: 10Jdlrobson) [21:10:23] will do - Jhs: looks like your change is merged - i'll go ahead and scap it [21:11:28] Jhs: does your change still need to be deployed? [21:11:28] cjming, cool. i'm not quite sure what is necessary for it to go live tbh. urbanecm said something about an i18n rebuild? [21:12:04] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [21:12:09] hmm - i'll ask about it [21:12:12] cjming, yeah, it'd be ideal to have that change live before we start importing stuff into the new zgh.wikipedia [21:12:31] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [21:12:59] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [21:13:15] cjming: Jhs: I'm here (mobile only), lemme look at the change. [21:13:45] urbanecm: thanks - ya, i'm not sure what needs to happen - does a script need to be run? [21:13:48] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [21:14:07] urbanecm: did you already scap backport Jhs's patch? [21:15:01] cjming: the change listed in the calendar is for master rather than the deployment branch. In theory, when scheduling, Jhs should've created a wmf.3 backport and schedule that one. I don't see such cherry pick in Gerrit, so presumably that is yet to be created and deployed. [21:15:22] So no, didn't deploy it, I just code reviewed the patch for master, which is a pre-requirisite for backporting. [21:15:46] got it - Jhs - can you create the cherry pick to wmf.3? [21:16:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:19] cjming, yes, will do [21:16:27] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [21:16:44] Jhs: thanks - if you can also add it to the deployment cal, that'd be great [21:16:47] (03PS1) 10Jon Harald Søby: [Languages] Add namespace translations for zgh [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972031 [21:17:32] cjming: The i18n rebuild thing is just a warning that scap backport might take much longer than usually, because the patch touches i18n, which means scap has to recreate the i18n cache, which can take a long time. [21:17:34] both done [21:18:09] (Or more than a warning, it's the reason why I didn't do it in the previous window :) ) [21:18:28] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [21:18:34] urbanecm: cool - so no maint script or anything -- long time like hours? [21:18:52] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [21:19:07] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [21:19:21] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [21:19:59] cjming: yup, its standard deployment procedure, except the duration. I'd expect it to finish within an hour or so. [21:20:53] ok - then if it's ok, i'll do aanzx's patch next while we wait for Jon's patches to merge and i'll do Jhs's patch last [21:21:22] fine with me 👍 [21:21:28] urbanecm: one more Q while i have you - does a script need to be run for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/971529/? [21:21:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971529 (https://phabricator.wikimedia.org/T350397) (owner: 10Anzx) [21:22:35] cjming, namespaceDupes.php should be run for that one i believe [21:22:53] cjming: tricky question :). Normally, yes, namespaceDupes would be what you'd run. Unfortunately, namespaceDupes is currently broken and cannot be ran on production, cf https://phabricator.wikimedia.org/T350443 and the recent ops-l [21:22:58] *ops-l mail [21:23:13] Oh ye [21:23:19] (03CR) 10Urbanecm: [C: 04-2] mznwiki: add project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971529 (https://phabricator.wikimedia.org/T350397) (owner: 10Anzx) [21:23:36] Me, zabe and i think thcipriani broke s5 with namespaceDupes [21:23:37] there are two pages that would be affected by that patch: https://mzn.wikipedia.org/wiki/%D8%B4%D8%A7:%D9%86%D9%85%D8%A7%DB%8C%D9%87_%D9%BE%DB%8C%D8%B4%D9%88%D9%86%D8%AF%DB%8C/%D9%BE%D8%B1%D9%88%DA%98%D9%87 [21:23:40] I've -2'ed temporarily, to prevent it from merging [21:23:56] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [21:23:58] whoops - i already started scap backport on that one [21:24:11] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [21:24:16] Maybe move those two pages manually to something else first, then move them to the new namespace when it exists? [21:24:20] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) P53144 [21:24:25] Jhs: yes, that's what we need to do. [21:24:47] cjming: you should be able to abort. Jenkins won't merge with urbanecm's -2 [21:24:56] Assuming there are no deleted pages, or that we're fine with the deleted pages being temporarily unaccessible. [21:25:31] cjming: please abort the scap backport, should be harmless at this stage, and let's figure out what we want to actually do with this change. [21:25:34] I don't think deleted pages are a risk [21:25:47] And 2 are easy enough to manually fix [21:26:11] would probably be best if someone with rights to move without leaving a redirect could move them [21:26:20] Jhs: do you have the correct permissions to be able to do the move (including without leaving the redirect)? [21:26:26] You asked that so I guess no [21:26:35] yeah, hehe ^^ [21:26:45] I'm not sure if urbanecm could do with sysadmin / steward hat on [21:26:46] (03Merged) 10jenkins-bot: Avoid nullish coalescing operators [skins/Vector] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972029 (https://phabricator.wikimedia.org/T350519) (owner: 10Jdlrobson) [21:26:46] (by "yeah" i mean "no") [21:26:51] 10-4 -- my terminal is just frozen at "awaiting-backport-merges" - just cntrl-C and carry on? [21:26:54] Nor if he's comfortable [21:26:59] (03Merged) 10jenkins-bot: Omit the last modified bar in the HTML rather than hiding it via CSS [skins/MinervaNeue] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972030 (https://phabricator.wikimedia.org/T350515) (owner: 10Jdlrobson) [21:27:12] cjming: for that change, ye just ctrl+c [21:27:14] !log eventgate-analytics-external - deploy change to remove 'dynamic' stream config support, instead just re-cache stream configs every 60s - https://phabricator.wikimedia.org/T326002 [21:27:15] I'm happy to do that, but I'm on mobile. Give me 20 mins or so and I can help out. [21:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:20] Although maybe file a scap bug [21:27:29] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [21:27:43] We can hold the patch for the end of the window and urbanecm can help when he's on his laptop [21:27:54] I say we, I'm disappearing now [21:28:03] sounds good - then moving onto Jdlrobson's backports [21:28:12] And as RhinosF1 says. I've applied a -2, so it will never merge. Ctrl+C is the only way out at this point :) [21:28:25] roger that [21:28:31] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [21:29:41] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [21:30:02] I left a message about better scap behaviour in -releng [21:30:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:33] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [21:30:35] !log cjming@deploy2002 Started scap: Backport for [[gerrit:972029|Avoid nullish coalescing operators (T350519)]] [21:30:40] T350519: Regression: Syntax error in Vector gadget API for portlets and older browsers - https://phabricator.wikimedia.org/T350519 [21:31:25] (03PS6) 10BPirkle: Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) [21:31:50] !log cjming@deploy2002 jdlrobson and cjming: Backport for [[gerrit:972029|Avoid nullish coalescing operators (T350519)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:31:55] Jdlrobson: can you test? [21:32:30] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [21:33:08] Jdlrobson: i think both of yours might be going out together - hopefully that's not problematic [21:35:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:23] PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:52] (03PS29) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [21:37:02] (03CR) 10CI reject: [V: 04-1] [Languages] Add namespace translations for zgh [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972031 (owner: 10Jon Harald Søby) [21:37:54] (03PS1) 10Ottomata: eventgate-main - use 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/972050 (https://phabricator.wikimedia.org/T347477) [21:37:55] cjming: that's fine [21:38:13] just wrapping up testing [21:38:16] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) No new errors reported since the 1st. I'm not clear on if that means we've fixed the problem or not; there a... [21:38:25] (03CR) 10Ottomata: [C: 03+2] eventgate-main - use 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/972050 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [21:39:13] RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:14] (03PS30) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [21:40:04] cjming: this LGTM please merge [21:40:10] they can both go out together not a problem [21:40:26] cool - syncing [21:40:30] !log cjming@deploy2002 jdlrobson and cjming: Continuing with sync [21:41:34] (03Merged) 10jenkins-bot: eventgate-main - use 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/972050 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [21:42:11] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [21:42:30] cjming Can add two config patches to the schedule (can also be merged together if you want)? [21:42:39] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [21:44:18] cjming: Jhs: fyi, i'm fully around now. looking at the "workaround the script" problem now. [21:44:38] nice [21:45:24] urbanecm: thanks -- do you think it's ok to squeeze in Superpes15's config patches before Jhs' goes out given that will take a while? [21:45:32] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [21:45:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:44] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:972029|Avoid nullish coalescing operators (T350519)]] (duration: 15m 09s) [21:45:46] cjming, also fine with me, yeah :) [21:45:48] T350519: Regression: Syntax error in Vector gadget API for portlets and older browsers - https://phabricator.wikimedia.org/T350519 [21:45:55] thanks cjming for the help! Talk to you in a bit :) [21:46:07] np! yw :) [21:46:26] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [21:46:54] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [21:47:19] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [21:47:30] Superpes15: do you have your patches ready? can you add to cal? [21:47:32] cjming: well, we've 15 mins out of the window, which is immediately followed by secteam's window (2 hours), which is not a lot of time :). dunno if secteam plans to deploy something in their window; if not, we can overflow the window as much as we want to, but if they do, we should probably plan to finish on time [21:47:49] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:48:03] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:48:05] cjming Yep I already add them on deployments [21:48:09] *added [21:48:15] (03PS3) 10Superpes15: [plwiki] Add 'abusefilter-log-private' flag to sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971518 (https://phabricator.wikimedia.org/T350509) [21:48:39] Superpes15: sorry - i spoke too soon - i think we may have to defer your patches [21:48:49] Yep just read no problem [21:49:22] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:49:25] sbassett: are you deploying next ? [21:49:28] so we have time for ~1 normal deployment. i think whether we want to start with Jhs's patch depends on whether secteam has stuff to deploy and whether cjming (or well, m) is okay with staying over [21:50:20] Unfortunately I'm not fine with any deployment window! I'll see if someone else can test the patches during another window in the next days :D [21:50:51] urbanecm: got it - i'll hang out to see what secteam says [21:51:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:41] Superpes: shall we do yours after all? or do you want to reschedule? [21:53:04] (03CR) 10Urbanecm: mznwiki: add project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971529 (https://phabricator.wikimedia.org/T350397) (owner: 10Anzx) [21:53:28] Uhm don't want to create trouble, I can try to reschedule them, no issue :) [21:53:29] cjming: i moved the one offending page on mznwiki, should be deployable now (-2 removed), then i can move back. [21:53:43] (03PS31) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [21:53:57] (-2 removed => removed my -2 at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/971529/) [21:54:00] Superpes: thanks [21:55:06] urbanecm: so i should proceed with aanzx's patch? [21:55:09] correct [21:55:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971529 (https://phabricator.wikimedia.org/T350397) (owner: 10Anzx) [21:56:19] (03Merged) 10jenkins-bot: mznwiki: add project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971529 (https://phabricator.wikimedia.org/T350397) (owner: 10Anzx) [21:56:35] !log cjming@deploy2002 Started scap: Backport for [[gerrit:971529|mznwiki: add project namespace (T350397)]] [21:56:39] T350397: Active the project namespace on Mazandarani Wikipedia - https://phabricator.wikimedia.org/T350397 [21:57:53] !log cjming@deploy2002 cjming and anzx: Backport for [[gerrit:971529|mznwiki: add project namespace (T350397)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:57:56] cjming: checking [21:58:08] ty [22:00:05] Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T2200). [22:00:26] cjming: looks good [22:00:31] great - going live [22:00:37] !log cjming@deploy2002 cjming and anzx: Continuing with sync [22:00:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:46] (03PS32) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [22:05:44] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:971529|mznwiki: add project namespace (T350397)]] (duration: 09m 09s) [22:05:48] T350397: Active the project namespace on Mazandarani Wikipedia - https://phabricator.wikimedia.org/T350397 [22:05:55] aanzx: should be live! [22:06:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:40] urbanecm: should i close the window? i'm not clear yet if we can go over [22:07:33] we're six mins into the secteam hold window, and there seems to be no one from sec team around, so i guess that we'd be fine running over [22:07:34] Thanks cjming , urbanecm can you move back that page now [22:07:37] will do [22:07:43] Thanks [22:08:59] so i'm happy to do Jhs' patch then - i guess i'll go for it unless someone intervenes in the next few mins [22:09:28] sounds good [22:09:55] aanzx: should be done [22:10:22] (03PS2) 10Clare Ming: [Languages] Add namespace translations for zgh [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972031 (owner: 10Jon Harald Søby) [22:13:01] urbanecm: page link shows up as redlink in recent changes, when I open it content appears [22:13:10] i'd say that's cache [22:14:31] Ok, thanks [22:16:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:35] (03CR) 10Jforrester: [C: 03+1] Do not try to use Thumbor on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971623 (https://phabricator.wikimedia.org/T344605) (owner: 10Gergő Tisza) [22:17:40] Jhs: are you still around? sorry it's so late - i can still do yours if you'd like (and if it's ok to highjack the security deployment window) -- waiting for rebase to finish [22:20:04] cjming, yeah, i'm around :) [22:20:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:23:54] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [22:24:04] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [22:25:54] Jhs: alrighty - then forging ahead (i forgot how long CI takes sometimes 😮) - will scap it when it's ready [22:29:01] great [22:29:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972031 (owner: 10Jon Harald Søby) [22:30:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:00] (03PS33) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [22:34:17] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [22:34:26] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [22:35:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:05] (03PS34) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [22:40:33] (03PS1) 10JHathaway: reuse-parts.sh: remove bashisms [puppet] - 10https://gerrit.wikimedia.org/r/972061 [22:41:08] (03PS2) 10JHathaway: reuse-parts.sh: remove bashisms [puppet] - 10https://gerrit.wikimedia.org/r/972061 (https://phabricator.wikimedia.org/T95064) [22:41:17] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [22:41:23] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [22:42:25] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/972061 (https://phabricator.wikimedia.org/T95064) (owner: 10JHathaway) [22:45:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:33] (03Merged) 10jenkins-bot: [Languages] Add namespace translations for zgh [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972031 (owner: 10Jon Harald Søby) [22:46:45] !log cjming@deploy2002 Started scap: Backport for [[gerrit:972031|[Languages] Add namespace translations for zgh]] [22:48:02] !log cjming@deploy2002 cjming and jhsoby: Backport for [[gerrit:972031|[Languages] Add namespace translations for zgh]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:48:16] Jhs: shall i sync? [22:50:51] (03PS35) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [22:50:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:06] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [22:52:12] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [22:52:48] cjming, yeah, sure [22:52:54] !log cjming@deploy2002 cjming and jhsoby: Continuing with sync [22:52:57] (sorry, not noticing the pings easily) [22:54:41] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:58:14] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:972031|[Languages] Add namespace translations for zgh]] (duration: 11m 28s) [22:58:37] Jhs: should be live! [22:59:33] cjming, indeed, looks good [22:59:58] links to special pages in the sidebar don't work any more, but i assume that's just a caching issue of some sort [23:00:09] links to special pages in the tool menu work as they should [23:00:18] (03PS36) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [23:00:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:01] !log end of UTC late backport window [23:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:06] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [23:02:13] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [23:04:52] Jhs: ya - i'm guessing caching [23:06:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:47] (03PS1) 10Andrew Bogott: backup_cinder_volumes: cleanup (including old backups for deleted volumes) [puppet] - 10https://gerrit.wikimedia.org/r/972065 [23:15:36] (03CR) 10Andrew Bogott: [C: 03+2] backup_cinder_volumes: cleanup (including old backups for deleted volumes) [puppet] - 10https://gerrit.wikimedia.org/r/972065 (owner: 10Andrew Bogott) [23:16:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:35:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure