[00:38:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/933673 [00:38:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/933673 (owner: 10TrainBranchBot) [01:00:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/933673 (owner: 10TrainBranchBot) [01:11:01] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:50] (03PS1) 10RLazarus: opentelemetry-collector: Vendor 0.61.0 as 0.61.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934026 (https://phabricator.wikimedia.org/T324117) [01:25:47] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Vendor 0.61.0 as 0.61.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934026 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [01:26:46] (03Merged) 10jenkins-bot: opentelemetry-collector: Vendor 0.61.0 as 0.61.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934026 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [01:32:55] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [01:33:02] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [01:33:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [02:00:11] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:19] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:21:57] PROBLEM - snapshot of s4 in eqiad on backupmon1001 is CRITICAL: snapshot for s4 at eqiad (db1145) taken more than 3 days ago: Most recent backup 2023-06-26 02:50:16 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [05:01:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:16:07] (03PS1) 10Marostegui: dbproxy1027: Remove comments [puppet] - 10https://gerrit.wikimedia.org/r/934037 (https://phabricator.wikimedia.org/T337812) [05:26:55] (03CR) 10Marostegui: [C: 03+2] dbproxy1027: Remove comments [puppet] - 10https://gerrit.wikimedia.org/r/934037 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [05:29:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Marostegui) [05:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:33:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [05:48:15] PROBLEM - snapshot of s5 in eqiad on backupmon1001 is CRITICAL: snapshot for s5 at eqiad (db1145) taken more than 3 days ago: Most recent backup 2023-06-26 05:19:47 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T0600) [06:00:05] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T0600). [06:32:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:45:07] (03PS1) 10Andrea Denisse: alert: Add the alert (icinga + alertmanager) hosts Bookworm node definitions [puppet] - 10https://gerrit.wikimedia.org/r/934245 (https://phabricator.wikimedia.org/T333615) [06:45:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [06:49:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [06:56:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [06:56:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet [06:59:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [07:00:05] Amir1, apergos, and jnuche: How many deployers does it take to do UTC morning backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T0700). [07:01:08] morning! there are o trainees signed up today to learn all the fine ins and outs of deployment [07:01:27] and that's a nice thing because there are no patches scheduled for deployment that they could watch or try their hand at [07:01:41] so... have a nice quiet Friday and a good weekend everybody, see you next time! [07:02:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [07:04:03] (03PS1) 10Muehlenhoff: Extend access for edtadros [puppet] - 10https://gerrit.wikimedia.org/r/934246 [07:05:48] apergos: you said nice and quiet [07:06:04] I think we should touch wood [07:08:43] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for edtadros [puppet] - 10https://gerrit.wikimedia.org/r/934246 (owner: 10Muehlenhoff) [07:08:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet [07:08:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [07:08:49] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [07:10:52] RhinosF1: I leave that to you ;-) [07:15:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [07:16:55] PROBLEM - Host kubernetes2005 is DOWN: PING CRITICAL - Packet loss = 100% [07:17:31] PROBLEM - Host ncredir2001 is DOWN: PING CRITICAL - Packet loss = 100% [07:17:35] PROBLEM - Host netboxdb2002 is DOWN: PING CRITICAL - Packet loss = 100% [07:19:35] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:20:31] RECOVERY - Host ncredir2001 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms [07:20:33] RECOVERY - Host netboxdb2002 is UP: PING OK - Packet loss = 0%, RTA = 32.18 ms [07:20:35] RECOVERY - Host kubernetes2005 is UP: PING OK - Packet loss = 0%, RTA = 31.81 ms [07:21:07] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:22:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [07:22:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [07:22:41] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [07:22:57] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [07:22:59] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [07:24:13] RECOVERY - HTTPS non-canonical-redirect-6 on ncredir2001 is OK: SSL OK - OCSP staple validity for wikipedia.fi has 326147 seconds left:Certificate wikipedia.fi valid until 2023-09-06 10:30:20 +0000 (expires in 69 days) https://wikitech.wikimedia.org/wiki/Ncredir [07:24:29] RECOVERY - HTTPS non-canonical-redirect-2 on ncredir2001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 272130 seconds left:Certificate *.wikimania.com valid until 2023-07-30 13:28:28 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/Ncredir [07:24:31] RECOVERY - HTTPS non-canonical-redirect-4 on ncredir2001 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 534928 seconds left:Certificate *.wikispecies.net valid until 2023-07-30 11:29:53 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/Ncredir [07:29:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:29] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:49] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:49] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:46:29] (03PS1) 10Muehlenhoff: sre.ganeti.drain-vm: Sync DRBD after reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/934248 (https://phabricator.wikimedia.org/T203964) [07:52:01] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.drain-vm: Sync DRBD after reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/934248 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [07:53:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/933976 (owner: 10Ottomata) [07:59:27] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Change s4 and s5 eqiad backup sources to db1150 and db1216 [puppet] - 10https://gerrit.wikimedia.org/r/933895 (https://phabricator.wikimedia.org/T340610) (owner: 10Jcrespo) [08:00:05] brennen and jnuche: How many deployers does it take to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T0800). [08:01:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [08:01:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [08:08:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [08:08:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [08:10:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [08:12:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [08:14:59] PROBLEM - mysqld processes on db1216 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:15:30] ^ checking [08:16:12] it seems down [08:16:36] ah no [08:17:43] jynus: ^ some alerting might need tobe adjusted as we are going from 2 backups to 3 backups I guess? [08:17:55] mmm, must be a race condition [08:18:00] it should have been disabled [08:18:07] ah ok :) [08:18:09] but I haven't touched it [08:18:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [08:18:17] I'd leave it to you then if it is "expected" [08:18:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [08:18:41] well, the alert is unexpected, but yeah probably my fault [08:19:26] yeah, it is a race condition because puppet hasn't run on icinga [08:19:53] and it should have 3 instances now [08:21:10] (03PS1) 10Hashar: rake_modules: mute Ruby 2.7 Pathname deprecation [puppet] - 10https://gerrit.wikimedia.org/r/934255 [08:21:28] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero) [08:22:07] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [08:22:17] (03CR) 10Hashar: "Looks like that mute the deprecation warning when using Ruby 2.7 :]" [puppet] - 10https://gerrit.wikimedia.org/r/934255 (owner: 10Hashar) [08:22:25] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero) 05Open→03Resolved I'm closing this ticket as I believe the reimage is completed and the server is mostly working, barring... [08:25:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet [08:30:28] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [08:32:56] (03CR) 10Jbond: [C: 03+2] README.release: add additional instructions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933971 (owner: 10Jbond) [08:33:07] (03CR) 10Effie Mouzeli: [C: 03+1] mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933469 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [08:33:15] (03CR) 10Effie Mouzeli: [C: 03+1] mw-on-k8s: Redirect office.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933467 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [08:33:24] (03CR) 10Effie Mouzeli: [C: 03+1] mw-on-k8s: Redirect vrt-wiki.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933474 (https://phabricator.wikimedia.org/T340549) (owner: 10Clément Goubert) [08:34:54] (03CR) 10Jbond: Enforce using a node regex without the wmnet tld (032 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway) [08:35:30] (03CR) 10Jbond: "other then the nits looks good to me,m and thanks for adding the fix stanzas 😊" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway) [08:36:41] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) a:03SLyngshede-WMF [08:37:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [08:38:55] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "source" field on user objects - https://phabricator.wikimedia.org/T340717 (10SLyngshede-WMF) [08:39:27] (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/934257 [08:41:06] (03CR) 10Btullis: [C: 03+2] Update the hadoop-worker-canary cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/933432 (https://phabricator.wikimedia.org/T338227) (owner: 10Btullis) [08:44:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [08:44:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet [08:44:56] (03CR) 10David Caro: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:47:13] (03PS1) 10Btullis: datahub: Use new image and fix elasticsearch setup [deployment-charts] - 10https://gerrit.wikimedia.org/r/934259 (https://phabricator.wikimedia.org/T329514) [08:48:50] (03CR) 10Muehlenhoff: [C: 03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/934257 (owner: 10Muehlenhoff) [08:49:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [08:51:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933469 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [08:52:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: Redirect vrt-wiki.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933474 (https://phabricator.wikimedia.org/T340549) (owner: 10Clément Goubert) [08:52:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: Redirect office.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933467 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [08:52:46] (03CR) 10Btullis: [C: 03+2] datahub: Use new image and fix elasticsearch setup [deployment-charts] - 10https://gerrit.wikimedia.org/r/934259 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [08:53:33] (03Merged) 10jenkins-bot: datahub: Use new image and fix elasticsearch setup [deployment-charts] - 10https://gerrit.wikimedia.org/r/934259 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [08:53:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [08:58:33] (03CR) 10Majavah: [C: 04-1] "Few minor things inline." [puppet] - 10https://gerrit.wikimedia.org/r/933973 (owner: 10David Caro) [08:58:48] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [08:59:12] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [08:59:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [09:00:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet [09:02:36] (03PS1) 10Func: CommonSettings-labs: Remove unconditional $wgKartographerNearby flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) [09:03:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet [09:06:22] (03PS6) 10Hashar: contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) [09:06:45] (03CR) 10CI reject: [V: 04-1] contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [09:06:56] (03PS1) 10Slyngshede: Credit logo artist. [software/bitu] - 10https://gerrit.wikimedia.org/r/934265 (https://phabricator.wikimedia.org/T338828) [09:07:59] (03CR) 10Slyngshede: "Thank you for point out the missing licens." [software/bitu] - 10https://gerrit.wikimedia.org/r/934265 (https://phabricator.wikimedia.org/T338828) (owner: 10Slyngshede) [09:08:07] (03PS7) 10Hashar: contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) [09:10:12] (03CR) 10Clément Goubert: noc: Pass ports without ferm-specific service constants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931581 (owner: 10Muehlenhoff) [09:11:00] (03PS1) 10Hashar: contint: rm git safe.directory for dev-images [puppet] - 10https://gerrit.wikimedia.org/r/934268 (https://phabricator.wikimedia.org/T335354) [09:15:34] (03PS1) 10Arturo Borrero Gonzalez: .gitignore: ignore nano swp file [alerts] - 10https://gerrit.wikimedia.org/r/934270 [09:19:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [09:21:45] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp [09:22:17] PROBLEM - Host kubetcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [09:22:20] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp [09:25:06] (03CR) 10Filippo Giunchedi: [C: 04-1] "These are real hw hosts, and I don't think we have the replacements so we'll have to upgrade in place/reimage" [puppet] - 10https://gerrit.wikimedia.org/r/934245 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [09:25:25] RECOVERY - Host kubetcd2004 is UP: PING OK - Packet loss = 0%, RTA = 31.85 ms [09:26:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet [09:26:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet [09:27:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet [09:28:54] (03PS1) 10Arturo Borrero Gonzalez: team-wmcs: refresh openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/934272 (https://phabricator.wikimedia.org/T339152) [09:30:51] !log installing libx11 security updates [09:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:43] (03CR) 10FNegri: "Could this go in a global .gitignore? I have '*.swp' in ~/.gitignore for example, and refer to it in ~/.gitconfig" [alerts] - 10https://gerrit.wikimedia.org/r/934270 (owner: 10Arturo Borrero Gonzalez) [09:34:35] (03CR) 10FNegri: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/934272 (https://phabricator.wikimedia.org/T339152) (owner: 10Arturo Borrero Gonzalez) [09:34:54] (03CR) 10David Caro: .gitignore: ignore nano swp file (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/934270 (owner: 10Arturo Borrero Gonzalez) [09:36:07] !log restarting FPM on mw canaries to pick up libx11 updates [09:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet [09:37:02] (03CR) 10David Caro: [C: 03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/934272 (https://phabricator.wikimedia.org/T339152) (owner: 10Arturo Borrero Gonzalez) [09:38:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] team-wmcs: refresh openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/934272 (https://phabricator.wikimedia.org/T339152) (owner: 10Arturo Borrero Gonzalez) [09:38:09] (03CR) 10Jbond: [C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/934255 (owner: 10Hashar) [09:38:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [09:38:39] (03PS2) 10Arturo Borrero Gonzalez: team-wmcs: refresh openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/934272 (https://phabricator.wikimedia.org/T339152) [09:38:54] (03CR) 10Awight: [C: 03+1] CommonSettings-labs: Remove unconditional $wgKartographerNearby flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) (owner: 10Func) [09:39:35] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/933915 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [09:43:21] (03PS1) 10Elukey: ml-services: update docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/934274 [09:43:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet [09:43:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet [09:43:46] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp [09:44:25] (03PS1) 10Ilias Sarantopoulos: ml-services: ores legacy root redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/934275 [09:45:07] (03CR) 10Elukey: [C: 03+1] ml-services: ores legacy root redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/934275 (owner: 10Ilias Sarantopoulos) [09:45:13] (03Abandoned) 10Elukey: ml-services: update docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/934274 (owner: 10Elukey) [09:46:16] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: ores legacy root redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/934275 (owner: 10Ilias Sarantopoulos) [09:46:37] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp [09:47:16] (03Merged) 10jenkins-bot: ml-services: ores legacy root redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/934275 (owner: 10Ilias Sarantopoulos) [09:47:26] (03CR) 10Jbond: [C: 03+2] puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [09:50:36] (03CR) 10Hashar: "Happy I have managed to figure out something with my limited knowledge of ruby :]" [puppet] - 10https://gerrit.wikimedia.org/r/934255 (owner: 10Hashar) [09:50:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet [09:52:36] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:52:49] (03PS1) 10Jbond: puppetserver: init private repo [puppet] - 10https://gerrit.wikimedia.org/r/934277 (https://phabricator.wikimedia.org/T340635) [09:53:29] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:53:36] (03CR) 10Jbond: [C: 03+2] puppetserver: init private repo [puppet] - 10https://gerrit.wikimedia.org/r/934277 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [09:53:53] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [09:57:02] !log isaranto@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:57:18] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for wikikube-staging masters - jiji@cumin1001" [09:58:01] (03CR) 10Arturo Borrero Gonzalez: .gitignore: ignore nano swp file (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/934270 (owner: 10Arturo Borrero Gonzalez) [09:58:03] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:58:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for wikikube-staging masters - jiji@cumin1001" [09:58:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:58:13] (03Abandoned) 10Arturo Borrero Gonzalez: .gitignore: ignore nano swp file [alerts] - 10https://gerrit.wikimedia.org/r/934270 (owner: 10Arturo Borrero Gonzalez) [09:59:56] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp [10:00:03] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp [10:00:05] mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1000). [10:00:05] claime: May I have your attention please! MediaWiki infrastucture (UTC mid-day). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1000) [10:00:42] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implementation of request flow - https://phabricator.wikimedia.org/T335474 (10SLyngshede-WMF) [10:00:44] Let's go [10:00:46] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [10:01:11] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [10:01:15] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: User offboarding - https://phabricator.wikimedia.org/T335476 (10SLyngshede-WMF) [10:02:04] 10SRE, 10Bitu, 10Infrastructure-Foundations: Validate managers for permission approval - https://phabricator.wikimedia.org/T335484 (10SLyngshede-WMF) [10:02:08] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [10:02:14] !log Redirect www.mediawiki.org to mw-on-k8s - T337490 [10:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:19] T337490: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 [10:02:22] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933469 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [10:02:48] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) p:05Triage→03Medium [10:03:04] (03CR) 10Jbond: [C: 03+2] puppetserver: Add new puppet server to block [puppet] - 10https://gerrit.wikimedia.org/r/933639 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [10:03:15] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [10:03:31] !log Running puppet on cp-text trafficservers - T337490 [10:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:07] 10SRE, 10Bitu, 10Infrastructure-Foundations: Create a mockup and involve designers - https://phabricator.wikimedia.org/T320802 (10SLyngshede-WMF) [10:05:09] (03PS1) 10Jbond: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/934281 [10:05:32] (03CR) 10Jbond: [V: 03+2 C: 03+2] test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/934281 (owner: 10Jbond) [10:05:34] (03CR) 10David Caro: replica_cnf_api: refactor to use multiple backends (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (owner: 10David Caro) [10:06:00] !log jiji@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubestagemaster1002.eqiad.wmnet [10:06:01] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [10:07:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet [10:08:19] (03CR) 10Majavah: [C: 04-1] replica_cnf_api: refactor to use multiple backends (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (owner: 10David Caro) [10:08:22] (03PS1) 10Jbond: puppetmaster: Correct config path [puppet] - 10https://gerrit.wikimedia.org/r/934282 (https://phabricator.wikimedia.org/T340635) [10:08:38] !log jiji@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:08:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Jclark-ctr) @akosiaris Server Ran in cpu stress test for in total 5 days with no errors prior to running firmware was updated. At this t... [10:08:48] !log jiji@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host kubestagemaster1002.eqiad.wmnet [10:08:57] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement a staging setup for the IDM - https://phabricator.wikimedia.org/T320795 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [10:08:59] 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [10:09:28] (03CR) 10Jbond: [C: 03+2] puppetmaster: Correct config path [puppet] - 10https://gerrit.wikimedia.org/r/934282 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [10:09:45] !log www.mediawiki.org now hosted on mw-on-k8s - T337490 [10:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:49] !log puppetserver1001 added back to puppet-merge [10:09:49] T337490: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 [10:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:53] And ExtensionDistributor works :p [10:10:05] claime: did you test after purging its cache? [10:10:05] PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:10:06] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookwork - https://phabricator.wikimedia.org/T340721 (10SLyngshede-WMF) [10:10:11] PROBLEM - Host kubetcd2006 is DOWN: PING CRITICAL - Packet loss = 100% [10:10:33] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:10:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:10:44] taavi: ExtensionDistributor's cache? [10:10:49] Give me a sec [10:11:13] https://phabricator.wikimedia.org/T340483#8965645 to purge the root Special:ExtensionDistributor page cache, ftr [10:11:27] yep, on it, ty <3 [10:11:38] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42122/console" [puppet] - 10https://gerrit.wikimedia.org/r/934282 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [10:11:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10akosiaris) a:05Jclark-ctr→03akosiaris >>! In T339340#8975564, @Jclark-ctr wrote: > @akosiaris Server Ran in cpu stress test for in tot... [10:12:04] 10SRE, 10Bitu, 10Infrastructure-Foundations: Update IDM servers to Bookworm - https://phabricator.wikimedia.org/T340722 (10SLyngshede-WMF) [10:12:20] taavi: confirmed working after cache purge [10:12:28] awesome, thank you [10:12:32] 10SRE, 10Bitu, 10Infrastructure-Foundations: Update IDM servers to Bookworm - https://phabricator.wikimedia.org/T340722 (10SLyngshede-WMF) p:05Triage→03Medium [10:12:33] wheee [10:13:51] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) [10:14:09] (03PS1) 10Jbond: Revert "puppetserver: add ssh known_hosts entries for new puppetserver" [puppet] - 10https://gerrit.wikimedia.org/r/934014 (https://phabricator.wikimedia.org/T340635) [10:14:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet [10:15:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet [10:15:16] (03PS2) 10Clément Goubert: mw-on-k8s: Redirect office.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933467 (https://phabricator.wikimedia.org/T337490) [10:15:22] (03CR) 10Jbond: [C: 03+2] Revert "puppetserver: add ssh known_hosts entries for new puppetserver" [puppet] - 10https://gerrit.wikimedia.org/r/934014 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [10:15:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [10:15:33] RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 33.95 ms [10:15:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:15:43] RECOVERY - Host kubetcd2006 is UP: PING OK - Packet loss = 0%, RTA = 32.01 ms [10:17:45] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [10:17:54] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp [10:17:58] !log Redirect office.wikimedia.org to mw-on-k8s - T337490 [10:18:00] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect office.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933467 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [10:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:02] T337490: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 [10:18:32] (03PS1) 10Hnowlan: trafficserver: add lua script for gateway routing [puppet] - 10https://gerrit.wikimedia.org/r/934015 (https://phabricator.wikimedia.org/T324678) [10:18:52] !log Running puppet on cp-text trafficservers - T337490 [10:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:27] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: add lua script for gateway routing [puppet] - 10https://gerrit.wikimedia.org/r/934015 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [10:19:43] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host parse1002.eqiad.wmnet with OS buster [10:19:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host parse1002.eqiad.wmnet with OS buster [10:19:56] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Delete records created by accident - jiji@cumin1001" [10:20:00] claime: ^^ puppet might be disabled on some cp nodes.. [10:20:06] fabfur: how's the update going? [10:20:15] vgutierrez: ack [10:20:36] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Delete records created by accident - jiji@cumin1001" [10:20:37] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:20:58] failed on cp1079.eqiad.wmnet investigating if puppet is disabled [10:21:04] !log jiji@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubestagemaster1002.eqiad.wmnet [10:21:05] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [10:22:05] vgutierrez: Didn't have errors on my first run, I'm running with your suggested query 'A:cp-text and P{P:trafficserver::backend}', 15 hosts batch [10:22:26] claime: ack [10:23:08] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1002.eqiad.wmnet - jiji@cumin1001" [10:23:47] (03PS1) 10Jbond: puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635) [10:23:54] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1002.eqiad.wmnet - jiji@cumin1001" [10:23:54] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:23:54] !log jiji@cumin1001 START - Cookbook sre.dns.wipe-cache kubestagemaster1002.eqiad.wmnet on all recursors [10:23:57] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster1002.eqiad.wmnet on all recursors [10:24:23] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1002.eqiad.wmnet - jiji@cumin1001" [10:25:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1002.eqiad.wmnet - jiji@cumin1001" [10:25:16] (03CR) 10CI reject: [V: 04-1] puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [10:25:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet [10:25:43] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubestagemaster1002.eqiad.wmnet with OS bullseye [10:25:50] !log office.wikimedia.org now hosted on mw-on-k8s - T337490 [10:25:53] 10SRE, 10vm-requests: eqiad and codfw: 1 VM each requested for wikikube-staging - https://phabricator.wikimedia.org/T329940 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubestagemaster1002.eqiad.wmnet with OS bullseye [10:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:57] T337490: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 [10:28:19] PROBLEM - Host kubestagetcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:28:37] (03PS2) 10Jbond: puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635) [10:29:29] (03PS1) 10Btullis: datahub: Retain the setup jobs after execution to help with debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/934286 (https://phabricator.wikimedia.org/T336286) [10:30:11] (03CR) 10CI reject: [V: 04-1] puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [10:30:29] RECOVERY - Host kubestagetcd2001 is UP: PING OK - Packet loss = 0%, RTA = 32.03 ms [10:30:33] PROBLEM - etcd service on kubestagetcd2001 is CRITICAL: CRITICAL - Expecting active but unit etcd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:31:21] (03CR) 10Btullis: [C: 03+2] datahub: Retain the setup jobs after execution to help with debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/934286 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis) [10:31:29] (03PS3) 10Jbond: puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635) [10:31:33] RECOVERY - etcd service on kubestagetcd2001 is OK: OK - etcd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:31:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet [10:31:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [10:32:05] (03Merged) 10jenkins-bot: datahub: Retain the setup jobs after execution to help with debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/934286 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis) [10:32:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT clusterissuers) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:32:31] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage [10:32:52] !log Redirect vrt-wiki.wikimedia.org to mw-on-k8s - T340549 [10:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:56] T340549: Migrate group1 to Kubernetes - https://phabricator.wikimedia.org/T340549 [10:32:57] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect vrt-wiki.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933474 (https://phabricator.wikimedia.org/T340549) (owner: 10Clément Goubert) [10:34:19] !log Running puppet on cp-text trafficservers - T340549 [10:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 23 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42126/console" [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [10:35:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2018.codfw.wmnet [10:35:35] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage [10:37:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT clusterissuers) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:37:30] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp [10:39:01] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:40:53] !log vrt-wiki.wikimedia.org now hosted on mw-on-k8s - T340549 [10:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:57] T340549: Migrate group1 to Kubernetes - https://phabricator.wikimedia.org/T340549 [10:41:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet [10:43:59] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [10:44:36] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [10:44:48] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) 05In progress→03Resolved [10:46:31] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group1 to Kubernetes - https://phabricator.wikimedia.org/T340549 (10Clement_Goubert) [10:46:43] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [10:46:55] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group1 to Kubernetes - https://phabricator.wikimedia.org/T340549 (10Clement_Goubert) 05Open→03In progress [10:47:41] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=ats-be,name=cp2037.codfw.wmnet [10:48:04] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [10:48:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet [10:48:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2018.codfw.wmnet [10:49:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2017.codfw.wmnet [10:50:19] (03CR) 10Hnowlan: [C: 03+2] trafficserver: add lua script for gateway routing [puppet] - 10https://gerrit.wikimedia.org/r/934015 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [10:51:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:52:21] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:53:50] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol: introduce cloud-private support for memcached [puppet] - 10https://gerrit.wikimedia.org/r/934288 (https://phabricator.wikimedia.org/T340488) [10:56:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:57:25] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:57:50] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:58:06] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:58:22] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:59:45] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:59:46] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "Forgot my username" feature - https://phabricator.wikimedia.org/T340636 (10SLyngshede-WMF) p:05Triage→03Low [10:59:51] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:59:56] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "update email" functionality - https://phabricator.wikimedia.org/T340637 (10SLyngshede-WMF) p:05Triage→03Medium [11:00:08] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "source" field on user objects - https://phabricator.wikimedia.org/T340717 (10SLyngshede-WMF) p:05Triage→03Low [11:00:17] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [11:00:36] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookwork - https://phabricator.wikimedia.org/T340721 (10SLyngshede-WMF) p:05Triage→03Medium [11:01:01] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) 05Open→03In progress p:05Triage→03Medium [11:01:06] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [11:01:14] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppet-merge: add new puppetserveres to puppet merge - https://phabricator.wikimedia.org/T340635 (10jbond) 05Open→03Resolved a:03jbond puppet-merge and the private repo post-commit hooks are bot... [11:01:22] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "update email" functionality - https://phabricator.wikimedia.org/T340637 (10SLyngshede-WMF) 05Open→03In progress [11:01:24] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [11:02:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet [11:02:36] !log installing Java 8 security updates [11:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:55] !log akosiaris@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1001" [11:04:10] (03PS1) 10Jbond: pki::multiroot: update the client auth file to include new puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/934291 (https://phabricator.wikimedia.org/T340557) [11:05:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42127/console" [puppet] - 10https://gerrit.wikimedia.org/r/934291 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [11:06:14] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: pick up Java 8 sec updates - jmm@cumin2002 [11:08:32] (03CR) 10Jbond: [V: 03+1 C: 03+2] pki::multiroot: update the client auth file to include new puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/934291 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [11:08:50] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:09:16] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:09:47] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:10:06] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:10:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet [11:11:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2017.codfw.wmnet [11:12:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2016.codfw.wmnet [11:13:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (3) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [11:18:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (4) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [11:19:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet [11:19:42] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] CommonSettings-labs: Remove unconditional $wgKartographerNearby flag (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) (owner: 10Func) [11:20:53] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubestagemaster1002.eqiad.wmnet with OS bullseye [11:20:53] !log jiji@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host kubestagemaster1002.eqiad.wmnet [11:20:59] 10SRE, 10vm-requests: eqiad and codfw: 1 VM each requested for wikikube-staging - https://phabricator.wikimedia.org/T329940 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubestagemaster1002.eqiad.wmnet with OS bullseye executed with errors: - kubestagemaster1002... [11:21:04] (03PS1) 10Hnowlan: Revert "trafficserver: add lua script for gateway routing" [puppet] - 10https://gerrit.wikimedia.org/r/934020 [11:21:26] (03CR) 10Vgutierrez: [C: 03+1] Revert "trafficserver: add lua script for gateway routing" [puppet] - 10https://gerrit.wikimedia.org/r/934020 (owner: 10Hnowlan) [11:21:33] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cloudservices2005-dev.wikimedia.org [11:21:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cloudservices2005-dev.wikimedia.org [11:21:49] PROBLEM - Host kubestagetcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:21:59] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:22:42] (03PS2) 10Func: CommonSettings-labs: Remove unconditional $wgKartographerNearby flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) [11:22:49] (03CR) 10Func: CommonSettings-labs: Remove unconditional $wgKartographerNearby flag (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) (owner: 10Func) [11:24:04] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/933675 [11:25:01] (03CR) 10Hnowlan: [C: 03+2] Revert "trafficserver: add lua script for gateway routing" [puppet] - 10https://gerrit.wikimedia.org/r/934020 (owner: 10Hnowlan) [11:25:33] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 33.34 ms [11:25:35] RECOVERY - Host kubestagetcd2003 is UP: PING OK - Packet loss = 0%, RTA = 33.54 ms [11:28:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet [11:28:07] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/934302 [11:28:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2016.codfw.wmnet [11:28:18] (03CR) 10Clément Goubert: [C: 03+1] Disable PC writes for parsoid endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933453 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [11:28:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2015.codfw.wmnet [11:31:25] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=ats-be,name=cp2037.codfw.wmnet [11:34:52] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/934288/42129/" [puppet] - 10https://gerrit.wikimedia.org/r/934288 (https://phabricator.wikimedia.org/T340488) (owner: 10Arturo Borrero Gonzalez) [11:39:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2015.codfw.wmnet [11:41:09] (03PS1) 10Matthias Mullie: Only send 1 suggestion per section [extensions/ImageSuggestions] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934021 [11:42:10] (03PS1) 10Jbond: pki::client: use the wmf-ca-certificats bundle for ca auth [puppet] - 10https://gerrit.wikimedia.org/r/934307 (https://phabricator.wikimedia.org/T340557) [11:44:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42130/console" [puppet] - 10https://gerrit.wikimedia.org/r/934307 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [11:44:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] pki::client: use the wmf-ca-certificats bundle for ca auth [puppet] - 10https://gerrit.wikimedia.org/r/934307 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond) [11:46:53] (03PS1) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) [11:47:17] (03CR) 10CI reject: [V: 04-1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [11:47:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2015.codfw.wmnet [11:47:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2015.codfw.wmnet [11:47:57] (03PS1) 10Arturo Borrero Gonzalez: profile::memcached::instance: allow to specify srange [puppet] - 10https://gerrit.wikimedia.org/r/934310 (https://phabricator.wikimedia.org/T340488) [11:48:19] (03PS2) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) [11:48:32] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) 05In progress→03Resolved a:03jbond This has now been corrected systems on the new puppet infrastruct... [11:48:36] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [11:48:45] (03CR) 10CI reject: [V: 04-1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [11:49:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet [11:50:42] (03PS3) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) [11:51:17] (03CR) 10CI reject: [V: 04-1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [11:52:02] (03PS4) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) [11:52:35] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC is a NOOP for core resources https://puppet-compiler.wmflabs.org/output/934310/42131/" [puppet] - 10https://gerrit.wikimedia.org/r/934310 (https://phabricator.wikimedia.org/T340488) (owner: 10Arturo Borrero Gonzalez) [11:52:47] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/934302 (owner: 10Muehlenhoff) [11:52:49] (03PS1) 10QChris: Add .gitreview [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934311 [11:52:51] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934311 (owner: 10QChris) [11:53:32] (03PS1) 10Jbond: puppetserver: move default db back to 443 [puppet] - 10https://gerrit.wikimedia.org/r/934312 [11:53:47] (03CR) 10Jbond: [C: 03+2] puppetserver: move default db back to 443 [puppet] - 10https://gerrit.wikimedia.org/r/934312 (owner: 10Jbond) [11:53:58] (03PS1) 10Hnowlan: rest-gateway: tighten proton group capture criteria [deployment-charts] - 10https://gerrit.wikimedia.org/r/934313 (https://phabricator.wikimedia.org/T334611) [11:54:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet [11:54:07] (03CR) 10CI reject: [V: 04-1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [11:57:31] (03CR) 10CI reject: [V: 04-1] Only send 1 suggestion per section [extensions/ImageSuggestions] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934021 (owner: 10Matthias Mullie) [11:59:24] (03PS5) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) [11:59:56] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet::agent: set manage_puppet_ca_file false [puppet] - 10https://gerrit.wikimedia.org/r/933968 (owner: 10Jbond) [12:00:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet [12:00:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2014.codfw.wmnet [12:01:33] (03CR) 10CI reject: [V: 04-1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [12:03:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2013.codfw.wmnet [12:03:59] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 5551 bytes in 0.306 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [12:04:19] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 5551 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [12:08:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet [12:10:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:11:09] (03CR) 10Muehlenhoff: [C: 03+2] Remove urldownloader role from old buster servers [puppet] - 10https://gerrit.wikimedia.org/r/933904 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff) [12:13:26] (03PS1) 10Jbond: puppedb1003: migrate to new infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/934315 (https://phabricator.wikimedia.org/T338811) [12:15:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:15:40] (03PS2) 10Jbond: puppedb1003: migrate to new infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/934315 (https://phabricator.wikimedia.org/T338811) [12:15:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet [12:15:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2013.codfw.wmnet [12:16:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2012.codfw.wmnet [12:16:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42133/console" [puppet] - 10https://gerrit.wikimedia.org/r/934315 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:17:00] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppedb1003: migrate to new infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/934315 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:18:53] 10SRE, 10ops-eqiad: Decom cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T338459 (10Jclark-ctr) 05Open→03Resolved Disconnected and removed from Rack updated netbox [12:19:21] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10serviceops-collab: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10Jclark-ctr) [12:21:13] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10Jclark-ctr) [12:21:20] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10Jclark-ctr) 05Open→03Resolved [12:22:01] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10serviceops-collab: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10Jclark-ctr) 05Open→03Resolved [12:22:06] 10SRE, 10Gerrit: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10Jclark-ctr) [12:23:36] (03CR) 10Matthias Mullie: "recheck" [extensions/ImageSuggestions] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934021 (owner: 10Matthias Mullie) [12:24:37] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T339100 (10Jclark-ctr) [12:24:49] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T339100 (10Jclark-ctr) 05Open→03Resolved [12:28:11] (03PS3) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) [12:28:13] (03CR) 10David Caro: replica_cnf_api: refactor to use multiple backends (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [12:31:36] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [12:32:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet [12:33:38] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Creat cookbook to migrate serveres from the puppetmnasters to puppetservers - https://phabricator.wikimedia.org/T340739 (10jbond) [12:34:39] PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: pick up Java 8 sec updates - jmm@cumin2002 [12:34:43] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:36:13] (03PS2) 10D3r1ck01: proton: Deploy latest Proton image - 2023-06-29-120130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934316 (https://phabricator.wikimedia.org/T340630) [12:38:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet [12:39:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2012.codfw.wmnet [12:39:21] (03PS3) 10D3r1ck01: proton: Deploy latest Proton image - 2023-06-29-120130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934316 (https://phabricator.wikimedia.org/T340630) [12:39:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Jclark-ctr) @Dwisehaupt i verified this server they are connected to port 24 on both switches [12:39:47] jouncebot: nowandnext [12:39:48] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [12:39:48] In 0 hour(s) and 20 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1300) [12:39:48] In 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1300) [12:40:04] (03PS1) 10Jbond: puppetdb::bookworm: pouplate hiera config [puppet] - 10https://gerrit.wikimedia.org/r/934323 (https://phabricator.wikimedia.org/T338811) [12:40:24] (03CR) 10Jgiannelos: [C: 03+1] proton: Deploy latest Proton image - 2023-06-29-120130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934316 (https://phabricator.wikimedia.org/T340630) (owner: 10D3r1ck01) [12:40:33] RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 33.59 ms [12:40:33] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.36 ms [12:41:33] (03CR) 10Jbond: [C: 03+2] puppetdb::bookworm: pouplate hiera config [puppet] - 10https://gerrit.wikimedia.org/r/934323 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:42:10] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [12:42:23] (03PS1) 10Elukey: ml-services: update ores-legacy's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934324 [12:42:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [12:43:13] (03PS1) 10Jbond: puppetdb: on reflection lets just use the puppet cluster [puppet] - 10https://gerrit.wikimedia.org/r/934325 (https://phabricator.wikimedia.org/T338811) [12:43:16] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [12:43:39] (03CR) 10Elukey: [C: 03+2] ml-services: update ores-legacy's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934324 (owner: 10Elukey) [12:44:00] (03CR) 10Jbond: [C: 03+2] puppetdb: on reflection lets just use the puppet cluster [puppet] - 10https://gerrit.wikimedia.org/r/934325 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:45:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2011.codfw.wmnet [12:46:34] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:46:59] !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:47:58] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:48:03] !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:48:06] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [12:49:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet [12:50:43] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: pick up Java 8 sec updates - jmm@cumin2002 [12:53:22] !log btullis@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [12:53:45] (03CR) 10Jforrester: "Aha, thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) (owner: 10Func) [12:54:15] (03CR) 10Jforrester: [C: 03+2] CommonSettings-labs: Remove unconditional $wgKartographerNearby flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) (owner: 10Func) [12:54:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42134/console" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:55:07] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1001" [12:55:07] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1002.eqiad.wmnet with OS buster [12:55:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host parse1002.eqiad.wmnet with OS buster comp... [12:56:09] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=parse1002.eqiad.wmnet [12:57:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet [12:57:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2011.codfw.wmnet [12:58:05] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=parse1002.eqiad.wmnet [12:59:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10akosiaris) 05Open→03Resolved I 've just re-imaged the server and set it back in conftool as active. Made sure to scap pull too. I am g... [13:00:05] xSavitar: May I have your attention please! Mobileapps/RESTBase/Wikifeeds. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1300). [13:00:05] Func, matthiasmullie, and duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] o/ [13:00:10] !log derick@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [13:00:12] o/ [13:00:13] o/ [13:00:27] !log derick@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [13:00:39] o/ [13:00:50] (03Merged) 10jenkins-bot: CommonSettings-labs: Remove unconditional $wgKartographerNearby flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) (owner: 10Func) [13:00:57] o/ [13:01:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2010.codfw.wmnet [13:01:09] xSavitar: hmm I don't see any patches from you listed? [13:01:12] (03CR) 10D3r1ck01: [C: 03+2] proton: Deploy latest Proton image - 2023-06-29-120130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934316 (https://phabricator.wikimedia.org/T340630) (owner: 10D3r1ck01) [13:01:18] James_F: are you deploying already? [13:01:42] xSavitar: oh right you have a separate window at the same time. sorry for the confusion [13:01:55] taavi there is a RB deploy window going on currently - https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_June_29 [13:02:10] taavi, no worries! [13:02:13] taavi: I just pushed out a beta-cluster-only patch, no deploy. [13:02:25] (03Merged) 10jenkins-bot: proton: Deploy latest Proton image - 2023-06-29-120130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934316 (https://phabricator.wikimedia.org/T340630) (owner: 10D3r1ck01) [13:02:27] * duesen wibbles [13:02:28] aha [13:02:39] (And CI took 10 minutes not 30 seconds.) [13:02:47] :/ [13:03:03] duesen: iirc your patches need some monitoring afterwards, so ok if I let you self-deploy after the other patches are done? [13:03:12] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:03:32] (03CR) 10Majavah: [C: 03+2] "deploying" [extensions/ImageSuggestions] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934021 (owner: 10Matthias Mullie) [13:04:05] !log derick@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [13:04:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933671 (https://phabricator.wikimedia.org/T340697) (owner: 10Func) [13:04:47] (03PS2) 10Arturo Borrero Gonzalez: profile::memcached::instance: allow to specify srange [puppet] - 10https://gerrit.wikimedia.org/r/934310 (https://phabricator.wikimedia.org/T340488) [13:04:57] (03Merged) 10jenkins-bot: Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933671 (https://phabricator.wikimedia.org/T340697) (owner: 10Func) [13:05:16] taavi: ok. Ping me. [13:05:30] will do! [13:05:41] !log derick@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [13:05:43] !log taavi@deploy1002 Started scap: Backport for [[gerrit:933671|Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace (T340697)]] [13:05:48] T340697: Remove $wgNamespacesWithSubpages overrides for the MediaWiki namespace in production - https://phabricator.wikimedia.org/T340697 [13:06:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet [13:06:05] !log derick@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [13:07:15] !log derick@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [13:07:31] !log taavi@deploy1002 taavi and func: Backport for [[gerrit:933671|Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace (T340697)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:07:45] testing... [13:07:55] !log derick@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [13:08:25] PROBLEM - Host kubestagetcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:08:43] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/934310/42135/" [puppet] - 10https://gerrit.wikimedia.org/r/934310 (https://phabricator.wikimedia.org/T340488) (owner: 10Arturo Borrero Gonzalez) [13:08:51] taavi: looks good [13:09:07] Func: thanks, the logs look clean too so syncing [13:09:30] (03Abandoned) 10Arturo Borrero Gonzalez: cloudcontrol: introduce cloud-private support for memcached [puppet] - 10https://gerrit.wikimedia.org/r/934288 (https://phabricator.wikimedia.org/T340488) (owner: 10Arturo Borrero Gonzalez) [13:10:15] !log derick@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [13:10:46] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:11:07] RECOVERY - Host kubestagetcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms [13:12:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet [13:12:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2010.codfw.wmnet [13:13:02] (03PS1) 10Fabfur: varnish: Remove http/https redirection [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) [13:14:48] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:933671|Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace (T340697)]] (duration: 09m 05s) [13:14:53] T340697: Remove $wgNamespacesWithSubpages overrides for the MediaWiki namespace in production - https://phabricator.wikimedia.org/T340697 [13:14:57] Func: and done! [13:15:04] thanks [13:15:05] matthiasmullie: yours is up next, just waiting for the CI on that [13:15:19] taavi: thanks; mine can skip mwdebug testing - there's nothing to test, it only affects a maint script [13:15:37] ack [13:16:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2009.codfw.wmnet [13:16:50] (03CR) 10Vgutierrez: "you should remove the associated 13-tls-redirect.vtc test as well" [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [13:18:37] taavi: hum, I noticed something went wrong, subpage counts become [INVALID] when I am not on mwdebug hosts: https://zh.wikibooks.org/w/index.php?title=MediaWiki:Anontalkpagetext&action=info&uselang=en [13:19:21] (03Merged) 10jenkins-bot: Only send 1 suggestion per section [extensions/ImageSuggestions] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934021 (owner: 10Matthias Mullie) [13:19:59] Func: where do you see that? even when logged in (so edge caches should not be a problem) it says 'Number of subpages of this page: 6 (0 redirects; 6 non-redirects)' [13:20:04] (03PS2) 10Fabfur: varnish: Remove http/https redirection [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) [13:20:19] (03CR) 10Fabfur: varnish: Remove http/https redirection (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [13:20:28] !log taavi@deploy1002 Started scap: Backport for [[gerrit:934021|Only send 1 suggestion per section]] [13:20:49] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:21:07] (03PS1) 10Reedy: Fix trying to get a PageRecord for a non-existent page [extensions/VisualEditor] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934023 (https://phabricator.wikimedia.org/T340568) [13:21:44] (03CR) 10Elukey: [V: 03+1 C: 04-1] "Precautionary -1 since the script seems to lead to different results in my tests:" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [13:22:06] !log taavi@deploy1002 mlitn and taavi: Backport for [[gerrit:934021|Only send 1 suggestion per section]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:22:13] syncing [13:22:17] taavi: ^^ want me to +2 it as it's gonna take ages to merge [13:23:01] (03PS1) 10Jgiannelos: mobileapps: bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934330 [13:23:05] duesen: do you see a reasonable risk of having to revert your patch? [13:23:48] your turn is after the currently-syncing patch, btw [13:24:08] taavi: unlikely, we already tested the same change for enwiki+dewiki+frwiki. If we do have to revert, we will probably not find out for a couple of hours [13:24:22] perfect, thanks [13:24:23] taavi: weird, I saw `Number of subpages of this page [INVALID] ([INVALID] redirects; [INVALID] non-redirects)` how can I know which server is serving my request? [13:24:29] Reedy: go for it [13:24:35] ok [13:24:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet [13:24:40] (03CR) 10Reedy: [C: 03+2] Fix trying to get a PageRecord for a non-existent page [extensions/VisualEditor] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934023 (https://phabricator.wikimedia.org/T340568) (owner: 10Reedy) [13:24:58] Func: `X-Cache` and `Server` response headers [13:25:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933453 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [13:25:36] (03PS2) 10Daniel Kinzler: Disable PC writes for parsoid endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933453 (https://phabricator.wikimedia.org/T339867) [13:25:40] (03CR) 10TrainBranchBot: "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933453 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [13:25:49] waiting for it to merge [13:26:01] claime, effie: deploying now [13:26:15] ack [13:26:28] my deploy is still running.. [13:26:58] (03CR) 10AikoChou: [C: 03+2] changeprop: update page_change_kind for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/933060 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [13:27:32] (03Merged) 10jenkins-bot: Disable PC writes for parsoid endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933453 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler) [13:27:37] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:934021|Only send 1 suggestion per section]] (duration: 07m 08s) [13:27:41] aaand I'm done. duesen: floor is yours [13:27:59] (03Merged) 10jenkins-bot: changeprop: update page_change_kind for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/933060 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [13:28:02] !log daniel@deploy1002 Started scap: Backport for [[gerrit:933453|Disable PC writes for parsoid endpoints (T339867)]] [13:28:06] T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867 [13:28:15] taavi: thanks! [13:28:28] taavi: sorry, I misread your message earlier. I thought you had told me to go ahead. But you were talking to Reedly... [13:29:32] !log daniel@deploy1002 daniel: Backport for [[gerrit:933453|Disable PC writes for parsoid endpoints (T339867)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:30:03] yeah, no worries. scap's locking system would have stopped you from syncing at the same time I think [13:30:42] taavi: Did you see something like `Invalid parameter for message "{msgkey}": {param}` or `Invalid list type for message` in `Bug58676` log channel? I searched the core and found this may be relevant. [13:30:59] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42138/console" [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [13:31:07] (03CR) 10Vgutierrez: [C: 04-1] "varnish upload tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [13:31:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet [13:31:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2009.codfw.wmnet [13:31:54] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934330 (owner: 10Jgiannelos) [13:32:06] (03PS6) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) [13:32:13] !log failover ganeti master in codfw to ganeti2020 [13:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:36] (03Merged) 10jenkins-bot: mobileapps: bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934330 (owner: 10Jgiannelos) [13:33:20] Func: I didn't see those when testing, but after the merge there have been ~100 on zhwikibooks [13:33:54] yeah, mwdebug hosts served me correctly [13:35:03] (03PS1) 10Btullis: Bump the version of the datahub image in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/934332 (https://phabricator.wikimedia.org/T329514) [13:35:10] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:933453|Disable PC writes for parsoid endpoints (T339867)]] (duration: 07m 07s) [13:35:13] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:35:15] (03CR) 10JHathaway: Enforce using a node regex without the wmnet tld (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway) [13:35:17] T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867 [13:35:49] PROBLEM - ganeti-wconfd running on ganeti2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:35:51] (03CR) 10Ottomata: [C: 03+1] page-content-change: fix error sink stream name. [deployment-charts] - 10https://gerrit.wikimedia.org/r/933914 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena) [13:35:56] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:36:17] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [13:36:25] (03CR) 10Gmodena: [C: 03+2] page-content-change: fix error sink stream name. [deployment-charts] - 10https://gerrit.wikimedia.org/r/933914 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena) [13:36:33] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:36:55] (03CR) 10Btullis: [C: 03+2] Bump the version of the datahub image in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/934332 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:37:24] (03Merged) 10jenkins-bot: page-content-change: fix error sink stream name. [deployment-charts] - 10https://gerrit.wikimedia.org/r/933914 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena) [13:37:46] (03Merged) 10jenkins-bot: Bump the version of the datahub image in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/934332 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:38:12] !log installing bind9 security updates (tools/libs only) [13:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:17] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [13:38:21] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:39:20] claime, effie: ok, parsoid pc cache writes are now disabled for parsoid endpoints, we are fully relying on the background jobs todo the parsing. [13:39:35] Let's see how this goes for as couple of days [13:39:57] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:39:58] duesen: I'm preparing the patches and reimage to add more jobrunners to be prudent [13:40:05] duesen: so what are tha parsoid* servers now left with? [13:40:23] claime: thank you! [13:40:51] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:40:52] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [13:40:55] Reedy: when your patch merges, will you deploy it or do you want me to do it? [13:40:59] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:41:10] effie: at the moment, they still parse, they just don't cache. Eventually, we will turn off pre-generation in restbase, at that point, the parsoid cluster will no longer be hit [13:41:32] The current experiment just makes sure that when we do that, the jobrunners don't fall over. [13:41:47] duesen: we need to decide to which channel we will sync :) [13:42:01] we should move to -serviceops tbh [13:42:10] sure sure [13:42:41] joined. [13:42:44] taavi: Any chance you could do it please? I've gotta go out for an appointment soon... MatmaRex should be around to test it [13:42:56] sure, not a problem [13:43:01] (03CR) 10Btullis: [C: 03+2] Upgrade the analytics airflow instance to 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933087 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis) [13:43:03] (hi) [13:43:09] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:43:17] cheers [13:44:02] taavi: i'm done [13:44:05] thx [13:44:40] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:44:50] taavi: It seems the InfoAction::pageCounts() method's cache didn't take config change into account, it now accessing unset array keys. should we revert my patch for now? [13:44:54] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [13:44:58] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:45:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10JArguello-WMF) [13:45:44] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10JArguello-WMF) [13:46:04] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10JArguello-WMF) [13:46:33] (03PS7) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) [13:46:50] Func: that sounds like the safest option to me [13:46:58] (03CR) 10CI reject: [V: 04-1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [13:47:14] (03Merged) 10jenkins-bot: Fix trying to get a PageRecord for a non-existent page [extensions/VisualEditor] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934023 (https://phabricator.wikimedia.org/T340568) (owner: 10Reedy) [13:47:33] (03PS1) 10Majavah: Revert "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934024 [13:47:47] (03CR) 10Majavah: [C: 03+2] Revert "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934024 (owner: 10Majavah) [13:47:58] aand the VE change merged. I'll sync both at the same time [13:48:56] (03Merged) 10jenkins-bot: Revert "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934024 (owner: 10Majavah) [13:49:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934024 (owner: 10Majavah) [13:49:27] !log taavi@deploy1002 Started scap: Backport for [[gerrit:934023|Fix trying to get a PageRecord for a non-existent page (T340568)]], [[gerrit:934024|Revert "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace"]] [13:49:32] T340568: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent an existing page: Property Simplify - https://phabricator.wikimedia.org/T340568 [13:50:20] (03PS8) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) [13:50:58] !log taavi@deploy1002 taavi and reedy: Backport for [[gerrit:934023|Fix trying to get a PageRecord for a non-existent page (T340568)]], [[gerrit:934024|Revert "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:51:10] MatmaRex: please test [13:51:19] and Func, if you can test the revert [13:51:59] looking [13:52:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:53:03] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [13:53:35] tavvi: confirmed restored to no subpage count cell [13:54:14] oh sorry [13:55:27] taavi: somehow i can't reproduce the bug this is supposed to fix (when not on mwdebug), but i can say that at least nothing gets worse [13:55:38] ok [13:55:57] RECOVERY - snapshot of s5 in eqiad on backupmon1001 is OK: Last snapshot for s5 at eqiad (db1216) taken on 2023-06-29 12:10:16 (571 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [13:56:31] oh, never mind, i can reproduce it [13:56:43] and the patch does fix it [13:56:44] ^ s4 should recover soon too, as I rerun them [13:56:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet [13:57:08] you have to be creating a new article, add an external URL, and get a CAPTCHA. previously you'd get an error instead of the CAPTCHA [14:00:00] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-worker1003.eqiad.wmnet with OS bullseye [14:00:32] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:00:59] (03PS1) 10JMeybohm: envoy-future: Update to 1.26.1 and add draining script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934335 (https://phabricator.wikimedia.org/T300324) [14:01:28] (03PS1) 10Elukey: role::cache::text: set pass for ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/934336 [14:01:29] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:934023|Fix trying to get a PageRecord for a non-existent page (T340568)]], [[gerrit:934024|Revert "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace"]] (duration: 12m 01s) [14:01:33] T340568: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent an existing page: Property Simplify - https://phabricator.wikimedia.org/T340568 [14:01:46] ok done [14:02:02] !log UTC afternoon backports done [14:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:12] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [14:02:49] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [14:03:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (4) Device cr1-codfw.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [14:03:37] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [14:04:03] (03CR) 10Ilias Sarantopoulos: [C: 03+1] role::cache::text: set pass for ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/934336 (owner: 10Elukey) [14:04:06] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [14:04:09] !log imported envoyproxy 1.26.1 to component/envoy-future in buster-wikimedia - T300324 [14:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:13] T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 [14:04:36] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/934337 (https://phabricator.wikimedia.org/T329366) (owner: 10Clément Goubert) [14:07:06] !log jiji@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2002.codfw.wmnet [14:07:07] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [14:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:07] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001" [14:10:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet [14:10:50] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001" [14:10:50] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:10:50] !log jiji@cumin1001 START - Cookbook sre.dns.wipe-cache kubestagemaster2002.codfw.wmnet on all recursors [14:10:54] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2002.codfw.wmnet on all recursors [14:11:18] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [14:12:26] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: tighten proton group capture criteria [deployment-charts] - 10https://gerrit.wikimedia.org/r/934313 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [14:12:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:13:05] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:13:15] (03Merged) 10jenkins-bot: rest-gateway: tighten proton group capture criteria [deployment-charts] - 10https://gerrit.wikimedia.org/r/934313 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [14:14:17] (03CR) 10Vgutierrez: [C: 03+1] role::cache::text: set pass for ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/934336 (owner: 10Elukey) [14:16:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet [14:16:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet [14:16:53] (03CR) 10Elukey: [C: 03+2] role::cache::text: set pass for ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/934336 (owner: 10Elukey) [14:17:27] (03PS1) 10Jcrespo: mariadb: Reenable db1216 notifications [puppet] - 10https://gerrit.wikimedia.org/r/934338 (https://phabricator.wikimedia.org/T340610) [14:17:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:17:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:15] (03Abandoned) 10MVernon: swift: roll object_expirer into cluster_info [puppet] - 10https://gerrit.wikimedia.org/r/933478 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon) [14:18:29] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001" [14:18:54] (03CR) 10MVernon: "[this is the version puppet office hours thought was better, but please do review :)]" [puppet] - 10https://gerrit.wikimedia.org/r/933471 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon) [14:19:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001" [14:19:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:19:15] !log jiji@cumin1001 START - Cookbook sre.dns.wipe-cache kubestagemaster2002.codfw.wmnet on all recursors [14:19:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2002.codfw.wmnet on all recursors [14:19:25] !log jiji@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host kubestagemaster2002.codfw.wmnet [14:20:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] conftool: Add more servers to jobrunner cluster [puppet] - 10https://gerrit.wikimedia.org/r/934337 (https://phabricator.wikimedia.org/T329366) (owner: 10Clément Goubert) [14:20:55] !log Depooling mw148[2-6].eqiad.wmnet from api_appserver to move them to jobrunners - T329366 [14:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:59] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [14:21:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: pick up Java 8 sec updates - jmm@cumin2002 [14:21:40] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw148[2-6].eqiad.wmnet [14:22:06] (03PS1) 10Btullis: Disable the datahub upgrade job [deployment-charts] - 10https://gerrit.wikimedia.org/r/934339 (https://phabricator.wikimedia.org/T329514) [14:22:22] (03CR) 10Clément Goubert: [C: 03+2] conftool: Add more servers to jobrunner cluster [puppet] - 10https://gerrit.wikimedia.org/r/934337 (https://phabricator.wikimedia.org/T329366) (owner: 10Clément Goubert) [14:25:36] (03CR) 10Btullis: [C: 03+2] Disable the datahub upgrade job [deployment-charts] - 10https://gerrit.wikimedia.org/r/934339 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:26:21] (03Merged) 10jenkins-bot: Disable the datahub upgrade job [deployment-charts] - 10https://gerrit.wikimedia.org/r/934339 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:28:15] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:30:13] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:31:14] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host mw1482.eqiad.wmnet with OS buster [14:31:19] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host mw1483.eqiad.wmnet with OS buster [14:31:25] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host mw1484.eqiad.wmnet with OS buster [14:31:30] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host mw1484.eqiad.wmnet with OS buster [14:31:33] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host mw1485.eqiad.wmnet with OS buster [14:31:37] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host mw1486.eqiad.wmnet with OS buster [14:31:47] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1484.eqiad.wmnet with OS buster [14:37:54] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy-future: Update to 1.26.1 and add draining script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934335 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [14:40:58] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable db1216 notifications [puppet] - 10https://gerrit.wikimedia.org/r/934338 (https://phabricator.wikimedia.org/T340610) (owner: 10Jcrespo) [14:41:39] (03PS1) 10JMeybohm: envoy-future: Update to 1.26.1 and add draining script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934340 (https://phabricator.wikimedia.org/T300324) [14:41:58] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy-future: Update to 1.26.1 and add draining script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934340 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [14:44:03] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1482.eqiad.wmnet with reason: host reimage [14:44:10] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1483.eqiad.wmnet with reason: host reimage [14:44:24] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1484.eqiad.wmnet with reason: host reimage [14:44:26] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1486.eqiad.wmnet with reason: host reimage [14:44:31] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1485.eqiad.wmnet with reason: host reimage [14:45:08] (03PS1) 10Jcrespo: mariadb: Reenable db1145 notifications [puppet] - 10https://gerrit.wikimedia.org/r/934341 (https://phabricator.wikimedia.org/T340610) [14:45:25] (03CR) 10Jcrespo: [C: 04-1] "Not ready." [puppet] - 10https://gerrit.wikimedia.org/r/934341 (https://phabricator.wikimedia.org/T340610) (owner: 10Jcrespo) [14:45:31] (03CR) 10CI reject: [V: 04-1] mariadb: Reenable db1145 notifications [puppet] - 10https://gerrit.wikimedia.org/r/934341 (https://phabricator.wikimedia.org/T340610) (owner: 10Jcrespo) [14:46:43] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1482.eqiad.wmnet with reason: host reimage [14:46:59] !log published image docker-registry.discovery.wmnet/envoy-future:1.26.1-1 - T300324 [14:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:03] T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 [14:49:02] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1483.eqiad.wmnet with reason: host reimage [14:51:25] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1485.eqiad.wmnet with reason: host reimage [14:51:36] RECOVERY - snapshot of s4 in eqiad on backupmon1001 is OK: Last snapshot for s4 at eqiad (db1150) taken on 2023-06-29 10:52:47 (1851 GiB, -2.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [14:52:16] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:52:35] 10SRE-Access-Requests: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Urbanecm) [14:52:37] 10SRE-Access-Requests: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340753 (10Urbanecm) [14:53:55] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1484.eqiad.wmnet with reason: host reimage [14:54:26] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1486.eqiad.wmnet with reason: host reimage [14:55:27] 10SRE-Access-Requests: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Urbanecm) [14:55:29] 10SRE-Access-Requests: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340753 (10Urbanecm) [14:58:53] (03CR) 10Dzahn: [C: 03+2] logspam-watch: Add a fox emoji [puppet] - 10https://gerrit.wikimedia.org/r/921050 (owner: 10Samtar) [14:59:56] (03CR) 10Dzahn: "Eoghan isn't here at the moment to deploy this." [puppet] - 10https://gerrit.wikimedia.org/r/932435 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [15:00:05] Daimona: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Create new tables for the CampaignEvents extension deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1500). [15:04:07] (03CR) 10Dzahn: "I would like to keep using just actual host names and not be forced to use regexes even for a single host." [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:05:09] (03CR) 10Dzahn: "wait, are you saying the site.pp is now used in cloud??" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:06:26] !log Creating new DB tables for the CampaignEvents extension in x1.testwiki, x1.test2wiki, x1.officewiki, and x1.wikishared # T340000 [15:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:31] T340000: Create the tables for participant questions in prod - https://phabricator.wikimedia.org/T340000 [15:07:16] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:09:06] (03PS1) 10Urbanecm: admin: Add SSH key for urbanecm [puppet] - 10https://gerrit.wikimedia.org/r/934366 (https://phabricator.wikimedia.org/T340752) [15:11:15] (03CR) 10Dzahn: "let's merge then?" [puppet] - 10https://gerrit.wikimedia.org/r/902738 (owner: 10Meno25) [15:11:35] (03CR) 10Dzahn: [C: 03+2] miscweb: lower certificate_expiry_days to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/932181 (https://phabricator.wikimedia.org/T339862) (owner: 10Jelto) [15:13:06] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T337591 is resolved. so is this not needed anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/923631 (https://phabricator.wikimedia.org/T337591) (owner: 10RhinosF1) [15:13:09] 10SRE-Access-Requests, 10Patch-For-Review: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Urbanecm) Key verification: Uploaded the key as `bast3006.wikimedia.org:T340752_key.txt`. Happy to also confirm in a different way if needed. [15:14:32] (03Abandoned) 10RhinosF1: admin: rename neilpquinn-wmf to nshahquinn-wmf [puppet] - 10https://gerrit.wikimedia.org/r/923631 (https://phabricator.wikimedia.org/T337591) (owner: 10RhinosF1) [15:14:54] mutante: ty for reminder [15:15:03] yw [15:16:14] !log installing Java 8 security updates on sessionstore/codfw [15:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:28] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1482.eqiad.wmnet with OS buster [15:16:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Urbanecm) [15:18:02] (03CR) 10Dzahn: "This change is open since 2020 and has not received a single comment from reviewers. what can we do to fix this problem?" [puppet] - 10https://gerrit.wikimedia.org/r/631888 (https://phabricator.wikimedia.org/T109950) (owner: 10Dereckson) [15:19:04] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1483.eqiad.wmnet with OS buster [15:19:15] (03CR) 10Dzahn: "is this still needed? What was it for btw?" [puppet] - 10https://gerrit.wikimedia.org/r/895722 (owner: 10Jbond) [15:19:39] (03CR) 10Ahmon Dancy: [C: 03+1] contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [15:19:51] (03CR) 10Dzahn: "How can we get cluster redirects reviewed?" [puppet] - 10https://gerrit.wikimedia.org/r/527912 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix) [15:21:19] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1485.eqiad.wmnet with OS buster [15:23:43] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Cookbooks for Ganeti maintenance tasks - https://phabricator.wikimedia.org/T283319 (10MoritzMuehlenhoff) [15:23:53] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 4 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10MoritzMuehlenhoff) [15:23:53] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1484.eqiad.wmnet with OS buster [15:24:14] 10SRE, 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, and 2 others: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10MoritzMuehlenhoff) 05Open→03Resolved This has been implemented with the new sre.ganeti.drain-node cookbook, which I've use... [15:24:59] !log cgoubert@cumin1001 conftool action : set/weight=10; selector: name=mw148[2-6].eqiad.wmnet [15:25:08] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: name=mw148[2-6].eqiad.wmnet [15:26:33] (03PS4) 10Dzahn: durum: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:27:19] !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cgoubert@cumin1001" [15:27:24] (03CR) 10Dzahn: [C: 03+1] "it's been a while. and only comments. good to go!?" [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:28:48] (03PS2) 10Dzahn: wikidough: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863295 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:29:05] (03CR) 10Dzahn: [C: 03+1] "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/863295 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:29:09] (03CR) 10Ssingh: "This on me for delaying it, thanks for the reminder. I will merge." [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:29:43] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw148[2-6].eqiad.wmnet,cluster=jobrunner [15:30:59] !log Pooled mw148[2-6].eqiad.wmnet as jobrunners - T329366 [15:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:05] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [15:32:39] (03CR) 10Dzahn: "needs manual rebase ." [puppet] - 10https://gerrit.wikimedia.org/r/842884 (owner: 10Legoktm) [15:33:44] (03PS2) 10Dzahn: extdist: Remove pre-bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/842884 (owner: 10Legoktm) [15:34:32] (03CR) 10Dzahn: "rebased.. which shows most of this was already done" [puppet] - 10https://gerrit.wikimedia.org/r/842884 (owner: 10Legoktm) [15:34:42] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1482.eqiad.wmnet [15:34:42] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1482.eqiad.wmnet [15:34:43] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1483.eqiad.wmnet [15:34:43] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1483.eqiad.wmnet [15:34:45] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1484.eqiad.wmnet [15:34:45] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1484.eqiad.wmnet [15:34:47] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1485.eqiad.wmnet [15:34:47] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1485.eqiad.wmnet [15:34:48] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1486.eqiad.wmnet [15:34:49] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1486.eqiad.wmnet [15:35:03] (03PS1) 10Elukey: ml-services: update the ores-legacy Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934370 [15:35:31] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/842884/42144/" [puppet] - 10https://gerrit.wikimedia.org/r/842884 (owner: 10Legoktm) [15:35:50] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/863294/42145/durum1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:35:53] (03CR) 10Ssingh: [C: 03+2] durum: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:37:11] (03CR) 10Ssingh: [C: 03+2] "Thanks moritz for the patch and dzahn for the reminder!" [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:37:19] (03CR) 10Dzahn: [C: 04-1] "looks like it can be abandoned. old and there is only node /^cloudnet100[5-6]\.eqiad\.wmnet$/ now" [puppet] - 10https://gerrit.wikimedia.org/r/838793 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [15:37:40] (03PS2) 10Effie Mouzeli: Add kubestagemaster[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/890008 (https://phabricator.wikimedia.org/T329827) (owner: 10JMeybohm) [15:39:01] (03CR) 10Elukey: [C: 03+2] ml-services: update the ores-legacy Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934370 (owner: 10Elukey) [15:39:26] (03PS1) 10Muehlenhoff: profile::java: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/934371 [15:39:28] (03CR) 10Arturo Borrero Gonzalez: cloudnet1003: decom host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838793 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [15:39:34] (03Abandoned) 10Arturo Borrero Gonzalez: cloudnet1003: decom host [puppet] - 10https://gerrit.wikimedia.org/r/838793 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [15:41:58] (03CR) 10Dzahn: [C: 03+2] etherpad: Add a link to CoC in the defaultPadText [puppet] - 10https://gerrit.wikimedia.org/r/827512 (https://phabricator.wikimedia.org/T136744) (owner: 10Alexandros Kosiaris) [15:42:47] (03CR) 10Dzahn: "thanks! just going through old open puppet patches to clean up a bit" [puppet] - 10https://gerrit.wikimedia.org/r/838793 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [15:43:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, note inline" [puppet] - 10https://gerrit.wikimedia.org/r/890008 (https://phabricator.wikimedia.org/T329827) (owner: 10JMeybohm) [15:46:12] (03CR) 10Dzahn: "@Legoktm How do you feel about this in 2023" [puppet] - 10https://gerrit.wikimedia.org/r/806473 (https://phabricator.wikimedia.org/T67270) (owner: 10Thcipriani) [15:46:51] (03CR) 10Dzahn: "are we still doing this or is it meanwhile "move to alertmanager" anyways?" [puppet] - 10https://gerrit.wikimedia.org/r/773279 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:47:32] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [15:48:23] (03CR) 10Dzahn: [C: 03+1] profile::java: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/934371 (owner: 10Muehlenhoff) [15:48:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (3) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [15:48:56] !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [15:49:02] (03CR) 10Dzahn: [C: 03+1] codesearch: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826864 (owner: 10Muehlenhoff) [15:49:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [15:49:16] (03CR) 10Filippo Giunchedi: "Can we do prometheus checks please? 🙃" [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff) [15:49:42] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp [15:49:45] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp [15:50:04] (03PS3) 10Effie Mouzeli: Add kubestagemaster[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/890008 (https://phabricator.wikimedia.org/T329827) (owner: 10JMeybohm) [15:54:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/890008 (https://phabricator.wikimedia.org/T329827) (owner: 10JMeybohm) [15:55:05] (03CR) 10Effie Mouzeli: [C: 03+2] Add kubestagemaster[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/890008 (https://phabricator.wikimedia.org/T329827) (owner: 10JMeybohm) [15:59:31] !log jiji@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2002.codfw.wmnet [15:59:33] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [15:59:36] dancy: 👋 the change lgtm, are you going to want a puppet run anywhere? [16:00:04] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1600). [16:00:05] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:06] yes.. lemme find out which hosts... [16:00:13] 👍 [16:00:43] so.. the WMCS gitlab-runners... [16:01:37] runner-1029.gitlab-runners.eqiad1.wikimedia.cloud and friends. [16:02:10] looks like that's runner-1021 through runner-1030 [16:03:29] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001" [16:03:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (4) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [16:03:47] !log klausman@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [16:04:02] !log klausman@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:04:21] !log klausman@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [16:04:41] !log klausman@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [16:10:03] !log systemctl restart bird.service on doh2002 [16:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:15] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp [16:12:55] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp [16:13:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (4) Device cr1-codfw.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [16:14:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10JArguello-WMF) [16:15:03] 2620:0:860:2:208:80:153:38 is doh2002, restarted bird [16:15:30] (recovered) [16:15:39] (03Abandoned) 10Dzahn: reload apache after config change [puppet] - 10https://gerrit.wikimedia.org/r/263745 (owner: 10JanZerebecki) [16:16:11] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001" [16:16:11] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:16:11] !log jiji@cumin1001 START - Cookbook sre.dns.wipe-cache kubestagemaster2002.codfw.wmnet on all recursors [16:16:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2002.codfw.wmnet on all recursors [16:16:41] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001" [16:17:01] (03CR) 10Bking: [C: 03+2] query_service: align all hiera configuration to the same order [puppet] - 10https://gerrit.wikimedia.org/r/930191 (owner: 10Gehel) [16:17:24] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001" [16:18:14] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubestagemaster2002.codfw.wmnet with OS bullseye [16:18:19] 10SRE, 10vm-requests: eqiad and codfw: 1 VM each requested for wikikube-staging - https://phabricator.wikimedia.org/T329940 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubestagemaster2002.codfw.wmnet with OS bullseye [16:18:41] !log releases1003 - re-enabling puppet after recent webserver debugging [16:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [16:20:17] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) 05Resolved→03Open Unfortunately i closed this too soon. things work fine on the puppetserver but now... [16:20:48] (03PS1) 10Ssingh: Release dnsdist 1.8.0-1+wmf11u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 [16:21:49] !log klausman@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:22:14] !log klausman@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:23:32] (03CR) 10Dzahn: [C: 03+1] "thanks! there is also https://gerrit.wikimedia.org/r/c/operations/puppet/+/863295/1" [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:27:01] (03CR) 10JHathaway: site.pp: Drop wmnet domain and always use regexes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:27:54] (03CR) 10Ssingh: "Adding CI for this repo in I8a37327241230b3af4c19cefd3900fee52c5dabf" [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 (owner: 10Ssingh) [16:28:26] (03PS1) 10Ilias Sarantopoulos: ml-services: remove nsfw model [deployment-charts] - 10https://gerrit.wikimedia.org/r/934380 (https://phabricator.wikimedia.org/T331416) [16:30:31] (03CR) 10Dzahn: "Sorry, you should mostly ignore my comment then. I have no context about dev environment or how that works. Is there like a master / track" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:32:31] (03CR) 10Ssingh: "Thanks for the reminder! thiThis will require restarting dnsdist as the conf file will change, so I will do that on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/863295 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:32:34] (03CR) 10JHathaway: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:32:36] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/863295/42148/doh1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/863295 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [20:13:30] (03CR) 10BryanDavis: [C: 03+1] apt: Ensure sources.list is updated before apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/934409 (owner: 10Majavah) [20:13:30] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS bullseye [20:15:20] (03CR) 10Dzahn: [C: 03+2] "https://phabricator.wikimedia.org/T340788" [puppet] - 10https://gerrit.wikimedia.org/r/932435 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [20:18:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [20:22:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:23:43] 10SRE, 10Add-Link, 10Growth-Team, 10GrowthExperiments-NewcomerTasks, 10serviceops: linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Marostegui) Thank you!! [20:24:43] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10Dzahn) @Superpes15 @SLyngshede-WMF @MatthewVernon If i read the ticket right then access to NDA is done and access to deployment is postpone... [20:28:58] 10SRE, 10SRE-Access-Requests, 10serviceops: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10Dzahn) [20:35:01] (03PS4) 10JHathaway: site.pp: Drop wmnet domain and always use regexes [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) [20:37:19] (03PS3) 10JHathaway: Enforce using a node regex without the wmnet tld [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 [20:37:41] 10SRE, 10SRE-Access-Requests, 10serviceops: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10Dzahn) Yes, the overlap in people is small, and at first it does seem to make sense to merge them. But the groups have prett... [20:37:46] (03CR) 10CI reject: [V: 04-1] Enforce using a node regex without the wmnet tld [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway) [20:39:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Dzahn) a:03Arnoldokoth +1 - uploaded key owned by urbanecm, matches gerrit [20:39:34] (03CR) 10Eevans: [C: 03+1] swift: roll object_expirer into cluster_info (remove profile) [puppet] - 10https://gerrit.wikimedia.org/r/933471 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon) [20:39:48] (03PS4) 10JHathaway: Enforce using a node regex without the wmnet tld [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 [20:40:38] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Dzahn) [20:40:43] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for gengh - https://phabricator.wikimedia.org/T340614 (10Dzahn) 05Open→03In progress a:03Arnoldokoth [20:41:39] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Dzahn) 05Open→03In progress a:03CCoxwell-WMF Access for @gengh is handled by T340614. For @CCoxwell-WMF we stil... [20:41:40] (03CR) 10JHathaway: "@dcaro, @jbond, @mutante, I think this is ready for review. I don't think it is a perfect solution, but I think it is worth trying." [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [20:42:25] (03CR) 10JHathaway: Enforce using a node regex without the wmnet tld (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway) [20:43:08] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [20:45:46] 10SRE, 10SRE-Access-Requests, 10serviceops: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10RhinosF1) If deploy-service is used for k8s deploys, surely everyone in 'deployment' needs it with MediaWiki moving to k8s.... [20:46:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10Dzahn) Since we are not sure how much longer it will take for T329360 and because this ticket would now sit in "stalled" and be checked by a different perso... [20:50:02] 10SRE, 10SRE-Access-Requests, 10serviceops: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10Dzahn) basically "deployment" is "mediawiki scap deployers" and deploy-service is "any service k8 deployers" and started as "... [20:50:45] (03CR) 10Hashar: "On Bullseye the issue comes from cloud-init `/etc/cloud/templates/sources.list.debian.tmpl` file which has:" [puppet] - 10https://gerrit.wikimedia.org/r/934409 (owner: 10Majavah) [20:53:55] (03CR) 10JHathaway: "Rob I would love your review as well" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [20:54:08] 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10Dzahn) Hi, this ticket seems resolved. Is it? [20:54:56] 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10Dzahn) a:05BTullis→03cjming Clare, does "it" work and we can close this? [20:55:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10Dzahn) a:03MatthewVernon [20:56:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10Dzahn) a:03ItamarWMDE [20:59:24] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10Dzahn) This ticket seems technically resolved. What, if any, would be the next step here? Is it still in discussion? [21:00:23] 10SRE, 10SRE-Access-Requests, 10serviceops: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10taavi) >>! In T340165#8978137, @Dzahn wrote: > basically "deployment" is "mediawiki scap deployers" and deploy-service is "an... [21:06:34] (03PS2) 10Samtar: IS: Phonos, reorder and enable for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934391 (https://phabricator.wikimedia.org/T336763) [21:07:12] jouncebot: nowandnext [21:07:12] No deployments scheduled for the next 8 hour(s) and 52 minute(s) [21:07:12] In 8 hour(s) and 52 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230630T0600) [21:08:51] Going to deploy a config change, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/934391 [21:09:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934391 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [21:09:25] 10SRE, 10SRE-Access-Requests, 10serviceops: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10taavi) > Should we drop this group? From my count[0] there are only three users in that group and not deployment, so it's jus... [21:09:42] (SystemdUnitFailed) firing: (28) prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:09:53] (03PS1) 10Majavah: Drop deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165) [21:09:56] (03Merged) 10jenkins-bot: IS: Phonos, reorder and enable for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934391 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [21:10:12] !log samtar@deploy1002 Started scap: Backport for [[gerrit:934391|IS: Phonos, reorder and enable for mediawikiwiki (T336763)]] [21:10:17] T336763: Enable PhonosInlineAudioPlayerMode on all projects - https://phabricator.wikimedia.org/T336763 [21:10:22] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Patch-For-Review: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10Dzahn) per ` role/common/deployment_server/kubernetes.yaml` ` profile::admin::groups: - deployment... [21:11:40] !log samtar@deploy1002 samtar: Backport for [[gerrit:934391|IS: Phonos, reorder and enable for mediawikiwiki (T336763)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:11:53] * TheresNoTime testing [21:13:10] * TheresNoTime syncing [21:13:32] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908 (10Dzahn) The only thing that changed since 2015 is the numbers are now higher :) ` ~/repos/puppet$ git grep "ensure *=> *\([a-z]\+\)" | wc -l 3027... [21:14:42] (SystemdUnitFailed) firing: (28) prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:31] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2020 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:17:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:18:38] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:934391|IS: Phonos, reorder and enable for mediawikiwiki (T336763)]] (duration: 08m 26s) [21:18:43] T336763: Enable PhonosInlineAudioPlayerMode on all projects - https://phabricator.wikimedia.org/T336763 [21:19:08] * TheresNoTime done [21:22:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:23:03] PROBLEM - WDQS SPARQL on wdqs2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:23:09] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:23:29] PROBLEM - Query Service HTTP Port on wdqs2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:23:41] PROBLEM - Check systemd state on wdqs2020 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service,wdqs-blazegraph.service,wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:45] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:24:13] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2020 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:24:36] hmm [21:25:05] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage [21:26:19] ryankemper looks like the cookbook must be removing downtimes? I guess we should add a flag for that myabe [21:28:06] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage [21:28:40] same for 2020 and 2021? [21:30:46] (03CR) 10Dzahn: "This should be reviewed by serviceops team (Giuseppe/Alex team)" [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165) (owner: 10Majavah) [21:37:40] mutante: we data transferred from 2022 to 2020 so the cookbook would have removed both those downtimes. 2021 was getting reimaged by inflatador [21:37:58] inflatador: yeah, a --no-remove-downtime flag would be nice [21:45:59] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Patch-For-Review: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10Jdlrobson) Fine with me to be removed from that group. [21:48:19] 10SRE-OnFire, 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10JArguello-WMF) [21:50:37] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Engineering, 10Event-Platform Value Stream, and 2 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10JArguello-WMF) [21:51:45] 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10cjming) hi @Dzahn - yes, all is good - thanks! [21:52:13] 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops-radar: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10JArguello-WMF) [21:52:37] ryankemper: ah! gotcha, thanks [21:53:43] 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10Dzahn) 05Open→03Resolved great :) [21:58:30] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2020 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:00:02] 10SRE, 10Data Pipelines, 10Traffic: Add a rolled-up cache_status field to druid webrequest_sampled_128 - https://phabricator.wikimedia.org/T319344 (10JArguello-WMF) [22:03:08] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:03:10] 10SRE, 10Data Pipelines, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10JArguello-WMF) [22:14:07] (03CR) 10Dzahn: "I feel like this should maybe be a post to an SRE mailing list." [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [22:57:36] 10SRE, 10Data-Engineering, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10JArguello-WMF) [23:15:53] (03PS1) 10RLazarus: opentelemetry-collector: Switch off unused default receivers and ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/934420 (https://phabricator.wikimedia.org/T320564) [23:28:18] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service,wdqs-blazegraph.service,wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:30] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:28:38] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:28:54] PROBLEM - WDQS SPARQL on wdqs2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:29:26] PROBLEM - Query Service HTTP Port on wdqs2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [23:29:26] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:29:26] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook