[00:03:54] bots seem down [00:23:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:28:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:12] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:21:46] PROBLEM - Check systemd state on ms-be2059 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:52] RECOVERY - Check systemd state on ms-be2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:21] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) Logs show that the issue started at 2023-03-07 14:23:35 and it fixed itself shortly after 202... [05:29:55] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) SAL entry associated with the end of the incident: ` 17:16 mvernon@cumin1001: END (PASS) - C... [05:43:47] (03CR) 10Marostegui: [C: 03+1] Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [05:47:36] (03PS9) 10KartikMistry: WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [05:53:41] (03PS1) 10Marostegui: mariadb: Migrate db1106 to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/899191 (https://phabricator.wikimedia.org/T322294) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T0600) [06:03:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Migrate db1106 to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/899191 (https://phabricator.wikimedia.org/T322294) (owner: 10Marostegui) [06:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:09:02] (03PS10) 10KartikMistry: WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [06:12:09] (03PS1) 10Gergő Tisza: LevelingUpManager: Ensure that $suggestions is a TaskSet [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/898869 [06:19:01] (03CR) 10Marostegui: "This requires manual drop from the database" [puppet] - 10https://gerrit.wikimedia.org/r/898800 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond) [06:19:04] (03CR) 10Marostegui: [C: 03+2] P:mariadb: drop pki2001 from grants [puppet] - 10https://gerrit.wikimedia.org/r/898800 (https://phabricator.wikimedia.org/T332018) (owner: 10Jbond) [06:20:24] !log Remove pki2001 from m1 grants T332018 [06:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:29] T332018: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 [06:21:42] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations, 10decommission-hardware, 10Patch-For-Review: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10Marostegui) I have removed the grants from m1 database after merging the above patch. [06:26:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105 (s1,s2) T331874', diff saved to https://phabricator.wikimedia.org/P45870 and previous config saved to /var/cache/conftool/dbconfig/20230315-062643-root.json [06:26:49] T331874: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 [06:30:34] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:34] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:39:26] (03CR) 10Samwilson: [C: 03+1] InitialiseSettings.php: Undeploy Phonos from afwiktionary, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898749 (https://phabricator.wikimedia.org/T332006) (owner: 10Samtar) [06:45:25] (03PS1) 10Jameel Kaisar: Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214 [06:47:30] (03PS2) 10Jameel Kaisar: Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214 [06:47:45] (03PS1) 10Marostegui: db1105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/899215 (https://phabricator.wikimedia.org/T331874) [06:49:05] (03CR) 10Marostegui: [C: 03+2] db1105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/899215 (https://phabricator.wikimedia.org/T331874) (owner: 10Marostegui) [07:00:04] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T0700). [07:00:04] tgr: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:24] o/ [07:01:30] I can deploy [07:02:01] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 222.3k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [07:02:57] (03PS2) 10Gergő Tisza: [beta] GrowthExperiments: Short leveling up notification delay for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898246 (https://phabricator.wikimedia.org/T330358) [07:03:18] (03CR) 10Gergő Tisza: [C: 03+2] LevelingUpManager: Ensure that $suggestions is a TaskSet [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/898869 (owner: 10Gergő Tisza) [07:03:45] (03CR) 10Gergő Tisza: [C: 03+2] [beta] GrowthExperiments: Short leveling up notification delay for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898246 (https://phabricator.wikimedia.org/T330358) (owner: 10Gergő Tisza) [07:04:11] (03CR) 10Gergő Tisza: [C: 03+2] [beta] GrowthExperiments: Short leveling up notification delay for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898246 (https://phabricator.wikimedia.org/T330358) (owner: 10Gergő Tisza) [07:04:57] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: Short leveling up notification delay for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898246 (https://phabricator.wikimedia.org/T330358) (owner: 10Gergő Tisza) [07:14:18] (03PS3) 10Jameel Kaisar: Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214 (https://phabricator.wikimedia.org/T332025) [07:15:02] (03CR) 10Ayounsi: "Nice! some aesthetics comments then lgtm if it works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [07:17:01] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 200k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [07:21:22] (03Merged) 10jenkins-bot: LevelingUpManager: Ensure that $suggestions is a TaskSet [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/898869 (owner: 10Gergő Tisza) [07:24:04] (03PS1) 10Marostegui: db1117: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/899399 (https://phabricator.wikimedia.org/T322294) [07:24:52] (03CR) 10Marostegui: [C: 03+2] db1117: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/899399 (https://phabricator.wikimedia.org/T322294) (owner: 10Marostegui) [07:26:46] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:28:56] !log tgr@deploy2002 Started scap: Backport for [[gerrit:898869|LevelingUpManager: Ensure that $suggestions is a TaskSet]] [07:30:37] !log tgr@deploy2002 tgr: Backport for [[gerrit:898869|LevelingUpManager: Ensure that $suggestions is a TaskSet]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [07:36:04] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:36:51] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:898869|LevelingUpManager: Ensure that $suggestions is a TaskSet]] (duration: 07m 54s) [07:39:58] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ms-be2067.codfw.wmnet [07:40:15] !log UTC morning deploys done [07:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:20] (03PS1) 10Marostegui: m5: Replace the standby host [puppet] - 10https://gerrit.wikimedia.org/r/899402 (https://phabricator.wikimedia.org/T331877) [07:40:50] (03CR) 10Marostegui: [C: 03+2] m5: Replace the standby host [puppet] - 10https://gerrit.wikimedia.org/r/899402 (https://phabricator.wikimedia.org/T331877) (owner: 10Marostegui) [07:52:07] (03CR) 10Muehlenhoff: [C: 04-1] "The Python 2 packages in Bullseye are entirely unsupported and only included to build packages, not run them (specifically for Chromium/Qt" [puppet] - 10https://gerrit.wikimedia.org/r/898982 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [07:59:39] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10MoritzMuehlenhoff) Agreed, I think we can simply resolve task. [08:01:44] (03PS1) 10Majavah: P:wmcs::nfs: maintain_dbusers: fix monitoring on inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/899476 [08:04:19] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40126/console" [puppet] - 10https://gerrit.wikimedia.org/r/899476 (owner: 10Majavah) [08:07:44] 10SRE, 10DBA, 10Striker, 10Toolhub, and 3 others: Switchover m5 master (db1176 -> db1106) - https://phabricator.wikimedia.org/T331877 (10Marostegui) I have tested the future master (db1106) which runs 10.6 on the proxies. It gets detected just fine. [08:11:28] (03PS1) 10Marostegui: misc.my.cnf: Add thread_pool_size [puppet] - 10https://gerrit.wikimedia.org/r/899479 [08:13:21] (03CR) 10Marostegui: [C: 03+2] misc.my.cnf: Add thread_pool_size [puppet] - 10https://gerrit.wikimedia.org/r/899479 (owner: 10Marostegui) [08:14:44] (03PS1) 10Marostegui: Revert "m5: Replace the standby host" [puppet] - 10https://gerrit.wikimedia.org/r/898881 [08:15:03] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo [08:15:23] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=1) rolling upgrade of HAProxy on A:cp-upload_ulsfo [08:15:52] (03CR) 10Marostegui: [C: 03+2] Revert "m5: Replace the standby host" [puppet] - 10https://gerrit.wikimedia.org/r/898881 (owner: 10Marostegui) [08:16:10] (03CR) 10Muehlenhoff: install_server: use second pair of disks for /srv/gitlab-backup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898791 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [08:18:18] PROBLEM - puppet last run on kubetcd2005 is CRITICAL: CRITICAL: Puppet has been disabled for 605016 seconds, message: T329717 - jayme, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:18:30] PROBLEM - puppet last run on kubetcd2004 is CRITICAL: CRITICAL: Puppet has been disabled for 605028 seconds, message: T329717 - jayme, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:18:44] 10SRE, 10DBA, 10Striker, 10Toolhub, and 2 others: Switchover m5 master (db1176 -> db1106) - https://phabricator.wikimedia.org/T331877 (10Marostegui) I also pushed this applying it to misc: https://gerrit.wikimedia.org/r/c/operations/puppet/+/899479 [08:19:12] PROBLEM - puppet last run on kubetcd2006 is CRITICAL: CRITICAL: Puppet has been disabled for 605070 seconds, message: T329717 - jayme, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:19:34] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:49] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/899481 [08:20:00] (03CR) 10CI reject: [V: 04-1] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/899481 (owner: 10Muehlenhoff) [08:21:28] 10SRE, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10MoritzMuehlenhoff) [08:21:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:22:02] (03PS1) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Fix apt-get cmd [cookbooks] - 10https://gerrit.wikimedia.org/r/899482 [08:23:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:24:52] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10MoritzMuehlenhoff) >>! In T331647#8691230, @Ottomata wrote: > Hm, that group (as well as analytics-research-admins) gives some sudo rights to a system user (analytics-platform-eng) that does have analytics-... [08:25:07] (03PS8) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) [08:25:34] (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [08:26:12] (03PS2) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/899481 [08:28:16] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/899481 (owner: 10Muehlenhoff) [08:28:53] (03CR) 10Filippo Giunchedi: [C: 03+2] netops: split routinator from ping offload [alerts] - 10https://gerrit.wikimedia.org/r/898776 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:28:56] (03CR) 10Filippo Giunchedi: [C: 03+2] search-platform: deploy alerts to specific Prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/898765 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:28:59] (03CR) 10Filippo Giunchedi: [C: 03+2] perf: deploy to 'ext' instance [alerts] - 10https://gerrit.wikimedia.org/r/898754 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:29:02] (03CR) 10CI reject: [V: 04-1] netops: split routinator from ping offload [alerts] - 10https://gerrit.wikimedia.org/r/898776 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:30:50] (03CR) 10CI reject: [V: 04-1] search-platform: deploy alerts to specific Prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/898765 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:32:25] (03PS2) 10Filippo Giunchedi: perf: deploy to 'ext' instance [alerts] - 10https://gerrit.wikimedia.org/r/898754 (https://phabricator.wikimedia.org/T309182) [08:32:27] (03PS3) 10Filippo Giunchedi: search-platform: deploy alerts to specific Prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/898765 (https://phabricator.wikimedia.org/T309182) [08:32:29] (03PS2) 10Filippo Giunchedi: netops: split routinator from ping offload [alerts] - 10https://gerrit.wikimedia.org/r/898776 (https://phabricator.wikimedia.org/T309182) [08:32:30] (Emergency syslog message) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [08:32:48] (03CR) 10CI reject: [V: 04-1] perf: deploy to 'ext' instance [alerts] - 10https://gerrit.wikimedia.org/r/898754 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:34:38] 10ops-eqiad, 10serviceops: Broken PSU on mw1435 - https://phabricator.wikimedia.org/T332117 (10MoritzMuehlenhoff) [08:34:49] interesting, ^ failed because CI couldn't clone from contint2001 [08:34:51] 10ops-eqiad, 10serviceops: Broken PSU on mw1435 - https://phabricator.wikimedia.org/T332117 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:34:58] https://integration.wikimedia.org/ci/job/alerts-pipeline-test/824/console [08:35:44] seems temporary though, other jobs succeeded [08:35:51] (03CR) 10Filippo Giunchedi: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/898754 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:37:30] (Emergency syslog message) resolved: Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [08:37:31] (03Merged) 10jenkins-bot: perf: deploy to 'ext' instance [alerts] - 10https://gerrit.wikimedia.org/r/898754 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:38:07] (03Merged) 10jenkins-bot: search-platform: deploy alerts to specific Prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/898765 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:38:26] 10ops-codfw, 10serviceops: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10MoritzMuehlenhoff) [08:38:29] 10ops-codfw, 10serviceops: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:38:45] (03Merged) 10jenkins-bot: netops: split routinator from ping offload [alerts] - 10https://gerrit.wikimedia.org/r/898776 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:39:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10karapayneWMDE) approved / signed off by me [08:40:07] !log hashar@deploy2002 Started deploy [integration/docroot@5abe9c6]: Link Groovy doc of PipelineLib - T222199 [08:40:13] T222199: Post generated docs for pipelinelib - https://phabricator.wikimedia.org/T222199 [08:40:26] !log hashar@deploy2002 Finished deploy [integration/docroot@5abe9c6]: Link Groovy doc of PipelineLib - T222199 (duration: 00m 19s) [08:41:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10MoritzMuehlenhoff) [08:41:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:42:21] yeah that test is never run cause in the jobs we run phpcs first which fails [08:42:29] so the jobs are shortcircuited [08:42:40] which should make Zuul to report earlier [08:42:50] but the Selenium tests are NOT shortcircuited [08:43:01] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10MoritzMuehlenhoff) @odimitrijevic, @Ottomata This needs your approval for analytics-privatedata-users access @dr0ptp4kt This needs your manager approval [08:43:20] so still have to wait for them to complete before having a report that the various jobs that failed all failed due to phpcs (which we thus run several time) [08:43:27] GRR [08:43:30] wrong channel [08:43:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:55:12] 10SRE, 10Infrastructure-Foundations, 10netops: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) 05Declined→03Open > We do not support forwarding status in ipfix message. > However, you may use ‘report-zero-oif-gw-on-discard’ in which Jflow can be forced to repo... [08:55:45] (03PS1) 10Filippo Giunchedi: search-platform: deploy blazegraph/cirrus/pipelines alerts to eqiad/codfw only [alerts] - 10https://gerrit.wikimedia.org/r/899503 (https://phabricator.wikimedia.org/T309182) [08:55:47] (03CR) 10Jelto: "thanks for the review! I answered in-line." [puppet] - 10https://gerrit.wikimedia.org/r/898791 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [08:56:17] (03CR) 10Volans: [C: 03+1] "LGTM, FYI I'll be replacing these apt-get calls shortly with the new spicerack apt module that is now merged but not yet released:" [cookbooks] - 10https://gerrit.wikimedia.org/r/899482 (owner: 10Vgutierrez) [08:56:19] (03PS1) 10Ayounsi: Add report-zero-oif-gw-on-discard for netflow [homer/public] - 10https://gerrit.wikimedia.org/r/899504 (https://phabricator.wikimedia.org/T331707) [08:56:44] (03CR) 10DCausse: [C: 03+1] search-platform: deploy blazegraph/cirrus/pipelines alerts to eqiad/codfw only [alerts] - 10https://gerrit.wikimedia.org/r/899503 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:57:06] (03CR) 10Vgutierrez: [C: 03+2] sre.cdn.roll-upgrade-haproxy: Fix apt-get cmd (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/899482 (owner: 10Vgutierrez) [08:59:30] RECOVERY - puppet last run on kubetcd2005 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:01:00] (03PS14) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [09:01:30] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [09:02:13] (03PS1) 10Filippo Giunchedi: perf: fix webperf metric names [alerts] - 10https://gerrit.wikimedia.org/r/899506 (https://phabricator.wikimedia.org/T309182) [09:02:16] (03CR) 10Volans: [C: 03+1] "Looks good to me." [dns] - 10https://gerrit.wikimedia.org/r/899214 (https://phabricator.wikimedia.org/T332025) (owner: 10Jameel Kaisar) [09:02:31] (03PS11) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) [09:03:12] (03CR) 10Filippo Giunchedi: [C: 03+2] search-platform: deploy blazegraph/cirrus/pipelines alerts to eqiad/codfw only [alerts] - 10https://gerrit.wikimedia.org/r/899503 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:03:57] (03CR) 10JMeybohm: [C: 03+1] Refactor and centralize BGPpeer config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [09:04:12] (03PS1) 10Marostegui: instances.yaml: Remove db1106 [puppet] - 10https://gerrit.wikimedia.org/r/899509 (https://phabricator.wikimedia.org/T331875) [09:04:37] (03Merged) 10jenkins-bot: search-platform: deploy blazegraph/cirrus/pipelines alerts to eqiad/codfw only [alerts] - 10https://gerrit.wikimedia.org/r/899503 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:04:51] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1106 [puppet] - 10https://gerrit.wikimedia.org/r/899509 (https://phabricator.wikimedia.org/T331875) (owner: 10Marostegui) [09:05:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1106 from dbctl T331875', diff saved to https://phabricator.wikimedia.org/P45872 and previous config saved to /var/cache/conftool/dbconfig/20230315-090515-root.json [09:05:21] T331875: Move db1106 to m5 - https://phabricator.wikimedia.org/T331875 [09:05:34] RECOVERY - puppet last run on kubetcd2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:06:12] RECOVERY - puppet last run on kubetcd2006 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:06:28] (03CR) 10Muehlenhoff: [C: 03+1] install_server: use second pair of disks for /srv/gitlab-backup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898791 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [09:07:34] (03CR) 10JMeybohm: [C: 03+2] Remove default-network-policy-conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/893019 (https://phabricator.wikimedia.org/T275035) (owner: 10JMeybohm) [09:10:17] (03CR) 10Filippo Giunchedi: [C: 03+2] perf: fix webperf metric names [alerts] - 10https://gerrit.wikimedia.org/r/899506 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:11:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:11:33] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:11:37] (03Merged) 10jenkins-bot: perf: fix webperf metric names [alerts] - 10https://gerrit.wikimedia.org/r/899506 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:12:03] (03PS1) 10JMeybohm: cert-manager: Update cert-manager to 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/899515 (https://phabricator.wikimedia.org/T325292) [09:13:02] (03CR) 10Giuseppe Lavagetto: [C: 03+1] contint: manage dsh target from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893483 (owner: 10Hashar) [09:13:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10MoritzMuehlenhoff) @odimitrijevic, @Ottomata: This needs your approval for analytics-privatedata-users [09:13:09] (03Merged) 10jenkins-bot: Remove default-network-policy-conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/893019 (https://phabricator.wikimedia.org/T275035) (owner: 10JMeybohm) [09:14:32] (03CR) 10CI reject: [V: 04-1] cert-manager: Update cert-manager to 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/899515 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:15:48] (03PS1) 10Ayounsi: nfacctd: export next-hop IP and outbound interface [puppet] - 10https://gerrit.wikimedia.org/r/899516 (https://phabricator.wikimedia.org/T331707) [09:16:03] (03PS2) 10JMeybohm: cert-manager: Update cert-manager to 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/899515 (https://phabricator.wikimedia.org/T325292) [09:22:01] !log installing gnutls28 security updates [09:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:40] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: add pint for thanos-rule [puppet] - 10https://gerrit.wikimedia.org/r/898701 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:23:36] (03PS1) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 [09:23:47] (03CR) 10JMeybohm: [C: 03+2] cert-manager: Update cert-manager to 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/899515 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:26:44] !log rolling restart of FPM/Apache to pick up gnutls28 security updates [09:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:25] 10SRE, 10Wikimedia-Mailing-lists: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10MoritzMuehlenhoff) [09:29:53] (03Merged) 10jenkins-bot: cert-manager: Update cert-manager to 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/899515 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:30:42] 10SRE, 10Wikimedia-Mailing-lists: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10MoritzMuehlenhoff) >>! In T331706#8684669, @Legoktm wrote: > The packages were initially backports of the bullseye versions, but we have a bunch of random patches on top. On T286217#85729... [09:31:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) Next steps here are: # Check with data-engineering ( @BTullis ?) if it's ok to add those 3 new keys (and what changes are needed in druid/turnilo)... [09:33:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10dr0ptp4kt) Approved. [09:34:19] (03PS1) 10Hashar: build: add local typos check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) [09:36:14] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) I tried to backport FNM 1.2.4, it does some tricky things with Boost and so far I couldn't force cmake to accept Bullseye's Boost libs. Given that Bookworm is close (unstable... [09:36:38] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo [09:37:04] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) [09:37:12] (03CR) 10Hashar: build: add local typos check (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar) [09:38:54] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ayounsi) That's fine for me! [09:39:28] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo [09:42:06] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) >>! In T330884#8696816, @ayounsi wrote: > That's fine for me! Is it as easy as a re-image? It'll take a bit to sort out the setup, my proposal would be to add an additional... [09:45:48] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply [09:46:04] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply [09:46:05] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/blubberoid: apply [09:46:17] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [09:46:18] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [09:46:53] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [09:46:54] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [09:49:25] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [09:49:27] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [09:50:15] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [09:50:16] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [09:50:35] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [09:50:36] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [09:50:49] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [09:50:51] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:51:00] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:51:01] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:51:45] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [09:51:46] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [09:52:12] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [09:52:13] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [09:52:24] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [09:52:25] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [09:52:56] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [09:52:57] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [09:53:05] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [09:53:06] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [09:53:14] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [09:53:15] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [09:53:30] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [09:53:31] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [09:53:34] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [09:53:35] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:54:10] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:54:12] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply [09:54:21] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply [09:54:22] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [09:54:33] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [09:54:34] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [09:54:49] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [09:54:50] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/mathoid: apply [09:55:04] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [09:55:05] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [09:55:18] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [09:55:19] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:55:23] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo [09:55:34] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [09:55:35] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [09:55:49] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [09:55:50] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [09:55:59] (03CR) 10Jbond: "lgtm but wonder if we should also remove install_from_component" [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) (owner: 10Ssingh) [09:56:05] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [09:56:06] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [09:56:29] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [09:56:30] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [09:56:46] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [09:56:48] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [09:56:50] (03PS4) 10Clément Goubert: Assign mediawiki roles to mw2420-mw2451 [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) [09:56:55] (03CR) 10Clément Goubert: Assign mediawiki roles to mw2420-mw2451 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) (owner: 10Clément Goubert) [09:57:04] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [09:57:05] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [09:57:15] (03CR) 10CI reject: [V: 04-1] Assign mediawiki roles to mw2420-mw2451 [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) (owner: 10Clément Goubert) [09:57:19] 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10MoritzMuehlenhoff) >>! In T320794#8692786, @bd808 wrote: >> - First name (sn) (optional) >> - Given name (givenName) (optional) > > What are these values expect... [09:57:33] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [09:57:34] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [09:58:00] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [09:58:01] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/similar-users: apply [09:58:15] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/similar-users: apply [09:58:16] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [09:58:34] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [09:58:36] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [09:59:23] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T1000) [10:01:52] 10SRE, 10DBA, 10Striker, 10Toolhub, and 2 others: Switchover m5 master (db1176 -> db1106) - https://phabricator.wikimedia.org/T331877 (10Marostegui) [10:02:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10BTullis) Seems like a great idea to me. I don't have any concerns about the addition of the three new keys, or the boolean if it ends up being a computed va... [10:03:07] 10SRE, 10DBA, 10Striker, 10Toolhub, and 2 others: Switchover m5 master (db1176 -> db1106) - https://phabricator.wikimedia.org/T331877 (10Marostegui) [10:06:10] (03PS1) 10Filippo Giunchedi: thanos: exclude promql/rate pint check [puppet] - 10https://gerrit.wikimedia.org/r/899525 (https://phabricator.wikimedia.org/T309182) [10:08:02] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40128/console" [puppet] - 10https://gerrit.wikimedia.org/r/899525 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [10:08:46] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:08:48] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [10:09:36] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [10:09:37] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [10:10:14] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [10:10:15] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [10:10:23] (03PS5) 10Clément Goubert: Assign mediawiki roles to mw2420-mw2451 [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) [10:10:24] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [10:13:23] (03PS15) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [10:15:34] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [10:15:46] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:16:39] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [10:16:44] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:18:16] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [10:18:30] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:19:56] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [10:20:42] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:21:27] !log jayme@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:21:49] (03CR) 10JMeybohm: [C: 03+1] device-analytics: add missing mesh port [deployment-charts] - 10https://gerrit.wikimedia.org/r/898820 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [10:22:03] !log jayme@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:22:22] (03CR) 10Hnowlan: [C: 03+2] device-analytics: add missing mesh port [deployment-charts] - 10https://gerrit.wikimedia.org/r/898820 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [10:22:34] !log jayme@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:23:17] !log jayme@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:23:54] (03PS2) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 [10:24:29] !log jayme@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:25:00] !log jayme@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:26:37] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:27:06] (03Merged) 10jenkins-bot: device-analytics: add missing mesh port [deployment-charts] - 10https://gerrit.wikimedia.org/r/898820 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [10:28:31] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:28:34] (03CR) 10Muehlenhoff: [C: 03+1] "Not familiar with the pint details, but looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/899525 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [10:29:38] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:30:31] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:33:01] (03CR) 10Vgutierrez: trafficserver: move testwikidata to kubernetes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) (owner: 10Clément Goubert) [10:34:00] (03PS2) 10Jbond: service:catalogue: Add pki as an active active service [puppet] - 10https://gerrit.wikimedia.org/r/895758 (https://phabricator.wikimedia.org/T331523) [10:34:35] (03PS5) 10Clément Goubert: trafficserver: move testwikidata to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) [10:34:50] (CertManagerCertExpirySoon) firing: Certificate istio-system/device-analytics in is about to expire (k8s@eqiad) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=eqiad&var-cluster=k8s&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertExpirySoon [10:35:10] CertManagerCertExpirySoon is me, taking care of it [10:36:58] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Hsarrazin) Hello, this problem causes difficulties on Wikisource, due to the impossibility to access page image while proofreading... [10:37:24] (03CR) 10Clément Goubert: "All fixed in next patch" [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) (owner: 10Clément Goubert) [10:37:35] (03PS6) 10Clément Goubert: trafficserver: move testwikidata to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) [10:38:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:39:26] (CertManagerCertNotReady) firing: Certificate istio-system/device-analytics is not in a ready state (k8s@eqiad) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=eqiad&var-cluster=k8s&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [10:39:50] (CertManagerCertExpirySoon) firing: (2) Certificate istio-system/device-analytics in is about to expire (k8s@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertExpirySoon [10:41:02] (03PS3) 10Jbond: service:catalogue: Add pki as an active active service [puppet] - 10https://gerrit.wikimedia.org/r/895758 (https://phabricator.wikimedia.org/T331523) [10:43:49] (03PS1) 10David Caro: maintain_dbusers: replace icinga check with prometheus one [puppet] - 10https://gerrit.wikimedia.org/r/899532 (https://phabricator.wikimedia.org/T303663) [10:43:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:44:08] (03PS1) 10Jbond: pki.discovery.wmnet: convert to active/active discovery record [dns] - 10https://gerrit.wikimedia.org/r/899533 (https://phabricator.wikimedia.org/T331523) [10:45:28] (03CR) 10CI reject: [V: 04-1] pki.discovery.wmnet: convert to active/active discovery record [dns] - 10https://gerrit.wikimedia.org/r/899533 (https://phabricator.wikimedia.org/T331523) (owner: 10Jbond) [10:46:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40130/console" [puppet] - 10https://gerrit.wikimedia.org/r/895758 (https://phabricator.wikimedia.org/T331523) (owner: 10Jbond) [10:47:00] (03PS4) 10Jbond: service:catalogue: Add pki as an active active service [puppet] - 10https://gerrit.wikimedia.org/r/895758 (https://phabricator.wikimedia.org/T331523) [10:48:08] 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [10:48:17] 10SRE, 10Infrastructure-Foundations: Implement OAuth account validation for linking an account to a wiki account - https://phabricator.wikimedia.org/T320807 (10SLyngshede-WMF) 05In progress→03Resolved a:03SLyngshede-WMF [10:48:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:52:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40131/console" [puppet] - 10https://gerrit.wikimedia.org/r/895758 (https://phabricator.wikimedia.org/T331523) (owner: 10Jbond) [10:53:15] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: exclude promql/rate pint check [puppet] - 10https://gerrit.wikimedia.org/r/899525 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [10:54:26] (CertManagerCertNotReady) resolved: Certificate istio-system/device-analytics is not in a ready state (k8s@eqiad) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=eqiad&var-cluster=k8s&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [10:54:50] (CertManagerCertExpirySoon) resolved: (2) Certificate istio-system/device-analytics in is about to expire (k8s@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertExpirySoon [10:57:18] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: move testwikidata to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) (owner: 10Clément Goubert) [10:59:34] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Patafisik) I'm experiencing some problems with files not displayed and broken thumbnails too, here are my screenshots (from [[https://commons.wikimedia.org/w/index.php?title=Commons%3AVilla... [11:00:32] !log Redirecting test.wikidata.org to mw-on-k8s - T331268/25 [11:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:37] T331268: Migrate testwikidata to Kubernetes - https://phabricator.wikimedia.org/T331268 [11:01:15] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move testwikidata to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) (owner: 10Clément Goubert) [11:04:50] :o [11:05:22] Lucas_WMDE: :) [11:07:02] Lucas_WMDE: It'll move to mw-on-k8s as puppet runs happen on cp hosts [11:07:11] cool [11:07:56] as long as the wikidata team isn’t blamed on issues that come from a server move that we weren’t involved with, that sounds good and exciting to me ;) [11:08:02] You can already test it with the X-Wikimedia-Debug extension by pointing it to k8s-experimental [11:08:52] Lucas_WMDE: We'll take the possible fallout, don't worry. Your team was tagged in the task though [11:08:56] Hi operators, I wish to deploy a proton service patch: https://gerrit.wikimedia.org/r/c/mediawiki/services/chromium-render/+/889211 on our k8s environment. Is now a good time to do so? [11:09:22] (03PS5) 10Jbond: service:catalogue: Add pki as an active active service [puppet] - 10https://gerrit.wikimedia.org/r/895758 (https://phabricator.wikimedia.org/T331523) [11:09:24] (03PS1) 10Jbond: wmflib::service::probe::module_options: simplify function and add tests [puppet] - 10https://gerrit.wikimedia.org/r/899542 [11:09:38] ah, I missed that tag [11:09:41] thanks :) [11:09:43] xSavitar: Yeah, jayme seems to be done with the certs stuff, I think you're good [11:09:44] (03PS6) 10Jbond: service:catalogue: Add pki as an active active service [puppet] - 10https://gerrit.wikimedia.org/r/895758 (https://phabricator.wikimedia.org/T331523) [11:10:05] claime, thanks! [11:10:19] I'm not, but it won't interfere - go ahead :) [11:10:52] Ah, I saw the alerts resolving and thought you were done wrecking chaos upon cert-manager :P [11:11:10] jayme, okay! [11:12:01] (03CR) 10CI reject: [V: 04-1] wmflib::service::probe::module_options: simplify function and add tests [puppet] - 10https://gerrit.wikimedia.org/r/899542 (owner: 10Jbond) [11:12:53] it kind of auto-resolved actually...seems like it takes some time for network-policy changes to take effect [11:13:24] 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Migrate testwikidata to Kubernetes - https://phabricator.wikimedia.org/T331268 (10Clement_Goubert) 05In progress→03Resolved `test.wikidata.org` will now be progressively moved to mw-on-k8s as puppet runs happen. Feel free to open subtasks for any... [11:13:42] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [11:14:09] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [11:15:05] (03PS1) 10Filippo Giunchedi: search-platform: remove CirrusSearchJobQueueLagTooHigh from 'ops', moved to 'k8s' [alerts] - 10https://gerrit.wikimedia.org/r/899549 [11:15:56] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) Hi, >>! In T330693#8696219, @Eevans wrote: Regarding the use of Fl... [11:16:01] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [11:16:04] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [11:16:08] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [11:16:24] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [11:17:23] (03CR) 10Hnowlan: [C: 03+2] cassandra: fix device_analytics creation syntax [puppet] - 10https://gerrit.wikimedia.org/r/898824 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [11:20:26] (03CR) 10Jgiannelos: [C: 03+2] proton: Deploying 2023-03-14-172615-production to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/899548 (https://phabricator.wikimedia.org/T331013) (owner: 10D3r1ck01) [11:20:48] !log imported packages into thirdparty/ceph-quincy [11:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:03] Lucas_WMDE: FYI, I'm seeing a perf regression on the first request ending up on a pod, which is expected, then the generation times are more or less the same as bare metal (for Main_Page) [11:21:25] nice [11:21:26] ~1.2s for first request, then ~200ms [11:21:35] (03PS3) 10JMeybohm: Migrate away from deprecated topology annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/896130 (https://phabricator.wikimedia.org/T325066) [11:21:38] and e.g. config changes and scap stuff all still work the same way right? [11:21:48] (since this isn’t the first test wiki on k8s anyways ^^) [11:21:57] scap deploys to mw-on-k8s, so yes :) [11:22:07] good :) [11:23:02] (03CR) 10Filippo Giunchedi: [C: 03+2] search-platform: remove CirrusSearchJobQueueLagTooHigh from 'ops', moved to 'k8s' [alerts] - 10https://gerrit.wikimedia.org/r/899549 (owner: 10Filippo Giunchedi) [11:23:11] If you have a test suite you run on test.wikidata.org or something, please don't hesitate to do so starting in around half an hour so we're sure all cp hosts have the new config [11:26:06] (03Merged) 10jenkins-bot: proton: Deploying 2023-03-14-172615-production to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/899548 (https://phabricator.wikimedia.org/T331013) (owner: 10D3r1ck01) [11:26:43] !log derick@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [11:27:32] !log derick@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [11:30:40] !log derick@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [11:31:14] maybe we can run the browser tests against test.wikidata.org, not sure [11:32:03] !log derick@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [11:34:30] (03PS1) 10Muehlenhoff: Remove LDAP access for ahollender [puppet] - 10https://gerrit.wikimedia.org/r/899551 [11:34:40] !log derick@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [11:35:07] (03CR) 10JMeybohm: [C: 03+2] Migrate away from deprecated topology annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/896130 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm) [11:36:20] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for ahollender [puppet] - 10https://gerrit.wikimedia.org/r/899551 (owner: 10Muehlenhoff) [11:36:26] !log derick@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [11:37:27] (03PS2) 10Jbond: wmflib::service::probe::module_options: simplify function and add tests [puppet] - 10https://gerrit.wikimedia.org/r/899542 [11:39:50] (03Merged) 10jenkins-bot: Migrate away from deprecated topology annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/896130 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm) [11:43:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40132/console" [puppet] - 10https://gerrit.wikimedia.org/r/899542 (owner: 10Jbond) [11:46:15] (03PS12) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) [11:46:51] (ProbeDown) firing: (29) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:06] RECOVERY - MariaDB Replica IO: matomo on db1108 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:47:26] RECOVERY - MariaDB Replica SQL: matomo on db1108 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:48:12] (03PS10) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) [11:49:18] 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10Clement_Goubert) p:05Triage→03Medium [11:49:24] 10SRE, 10serviceops: Migrate dragonfly-supernodes to bullseye - https://phabricator.wikimedia.org/T332011 (10Clement_Goubert) p:05Triage→03Medium [11:50:30] (03PS1) 10JMeybohm: admin_ng: Remove chart version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/899563 (https://phabricator.wikimedia.org/T306649) [11:51:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) p:05Triage→03Low a:03JAllemandou Thanks! Moving the task over to Joseph [11:51:35] btullis: ^ the alert recovered [11:51:51] (ProbeDown) firing: (34) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:02] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Language-Team (Language-2023-January-March), 10Service-deployment-requests: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10Clement_Goubert) p:05Triage→03Medium [11:53:22] (03CR) 10JMeybohm: admin_ng: Remove chart version pinning (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/899563 (https://phabricator.wikimedia.org/T306649) (owner: 10JMeybohm) [11:53:46] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Clement_Goubert) [11:54:47] (03CR) 10Jelto: [C: 04-1] "thanks for the reviews. I'll -1 this one until new disk arrive" [puppet] - 10https://gerrit.wikimedia.org/r/898791 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [11:56:14] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10Ahecht) https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Thumbnails_not_generating [11:57:29] (03CR) 10Ayounsi: [C: 03+1] "lgtm based on my limited k8s knowledge" [deployment-charts] - 10https://gerrit.wikimedia.org/r/899563 (https://phabricator.wikimedia.org/T306649) (owner: 10JMeybohm) [11:59:07] (03CR) 10JMeybohm: [C: 03+1] Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [11:59:19] (03CR) 10JMeybohm: [C: 03+1] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [12:02:48] (03CR) 10JMeybohm: [C: 03+2] "CI diff looks like a downgrade from 0.2.7 to 0.2.6 but that's not actually true. I've merged 0.2.7 a couple of minutes ago and never deplo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/899563 (https://phabricator.wikimedia.org/T306649) (owner: 10JMeybohm) [12:05:03] (03PS11) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [12:05:41] 10SRE, 10serviceops: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10Clement_Goubert) p:05Triage→03Medium [12:05:57] (03CR) 10CI reject: [V: 04-1] WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [12:07:29] (03Merged) 10jenkins-bot: admin_ng: Remove chart version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/899563 (https://phabricator.wikimedia.org/T306649) (owner: 10JMeybohm) [12:08:32] 10SRE, 10serviceops-radar: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10Clement_Goubert) [12:08:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [12:08:40] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [12:11:05] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) [12:11:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: m5 master switch T331877 [12:12:01] 10SRE, 10DBA, 10Striker, 10Toolhub, and 2 others: Switchover m5 master (db1176 -> db1106) - https://phabricator.wikimedia.org/T331877 (10Marostegui) @bd808 I am going to do this now, so tomorrow morning I can revert it. [12:12:04] T331877: Switchover m5 master (db1176 -> db1106) - https://phabricator.wikimedia.org/T331877 [12:12:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: m5 master switch T331877 [12:12:22] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: introduce cloud-private IP address [puppet] - 10https://gerrit.wikimedia.org/r/899569 (https://phabricator.wikimedia.org/T324992) [12:12:50] (03PS1) 10Marostegui: db1106: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/899570 (https://phabricator.wikimedia.org/T331877) [12:13:23] (03CR) 10Marostegui: [C: 03+2] db1106: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/899570 (https://phabricator.wikimedia.org/T331877) (owner: 10Marostegui) [12:13:32] (03PS2) 10Arturo Borrero Gonzalez: cloudlb: introduce cloud-private IP address [puppet] - 10https://gerrit.wikimedia.org/r/899569 (https://phabricator.wikimedia.org/T324992) [12:15:17] (03PS3) 10Arturo Borrero Gonzalez: cloudlb: introduce cloud-private IP address [puppet] - 10https://gerrit.wikimedia.org/r/899569 (https://phabricator.wikimedia.org/T324992) [12:15:32] (03PS1) 10Marostegui: mariadb: Promote db1106 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/899571 (https://phabricator.wikimedia.org/T331877) [12:15:47] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team: Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10Volans) To recap from an IRC chat, we need to define where should the automatic SAL log that spicerack emits on START/END of cook... [12:16:40] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1106 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/899571 (https://phabricator.wikimedia.org/T331877) (owner: 10Marostegui) [12:17:03] 10SRE, 10DBA, 10Striker, 10Toolhub, and 3 others: Switchover m5 master (db1176 -> db1106) - https://phabricator.wikimedia.org/T331877 (10Marostegui) [12:17:16] (03PS4) 10Arturo Borrero Gonzalez: cloudlb: introduce cloud-private IP address [puppet] - 10https://gerrit.wikimedia.org/r/899569 (https://phabricator.wikimedia.org/T324992) [12:17:22] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:17:24] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:17:56] 10SRE, 10DBA, 10Striker, 10Toolhub, and 3 others: Switchover m5 master (db1176 -> db1106) - https://phabricator.wikimedia.org/T331877 (10Marostegui) [12:18:05] !log Failover m5 from db1176 to db1106 - T331877 [12:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:10] T331877: Switchover m5 master (db1176 -> db1106) - https://phabricator.wikimedia.org/T331877 [12:20:01] 10SRE, 10DBA, 10Striker, 10Toolhub, and 3 others: Switchover m5 master (db1176 -> db1106) - https://phabricator.wikimedia.org/T331877 (10Marostegui) [12:20:25] 10SRE, 10DBA, 10Striker, 10Toolhub, and 3 others: Switchover m5 master (db1176 -> db1106) - https://phabricator.wikimedia.org/T331877 (10Marostegui) 05Open→03Resolved This was done, RO time was around 15 seconds. [12:20:54] (03PS12) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [12:21:50] (03PS5) 10Arturo Borrero Gonzalez: cloudlb: introduce cloud-private IP address [puppet] - 10https://gerrit.wikimedia.org/r/899569 (https://phabricator.wikimedia.org/T324992) [12:22:06] (03PS1) 10Marostegui: db1176: Migrate it to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/899575 (https://phabricator.wikimedia.org/T322294) [12:22:27] (03PS7) 10Jbond: service:catalogue: Add pki as an active active service [puppet] - 10https://gerrit.wikimedia.org/r/895758 (https://phabricator.wikimedia.org/T331523) [12:22:32] (03CR) 10Marostegui: [C: 03+2] db1176: Migrate it to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/899575 (https://phabricator.wikimedia.org/T322294) (owner: 10Marostegui) [12:24:24] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [12:24:30] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [12:24:50] 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Migrate testwikidata to Kubernetes - https://phabricator.wikimedia.org/T331268 (10Lucas_Werkmeister_WMDE) FWIW, I tried running our browser test suite against Wikidata. ` $ nvm use 14 $ node --version v14.19.1 $ MW_SERVER=https://test.wikidata.org M... [12:27:32] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:27:55] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:27:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40133/console" [puppet] - 10https://gerrit.wikimedia.org/r/895758 (https://phabricator.wikimedia.org/T331523) (owner: 10Jbond) [12:27:58] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:28:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:28:28] (03CR) 10Cathal Mooney: "LGTM! Eqiad will be a little trickier given multiple racks/vlans but this should be fine for codfw I think." [puppet] - 10https://gerrit.wikimedia.org/r/899569 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:28:37] 10SRE-tools, 10Infrastructure-Foundations: sync firmware between cumin hosts - https://phabricator.wikimedia.org/T332158 (10jbond) 05Open→03In progress p:05Triage→03Medium [12:32:39] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump workers, reduce cpu, increase haproxy queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/898728 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [12:32:59] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/899569/40134/" [puppet] - 10https://gerrit.wikimedia.org/r/899569 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:34:28] (03PS1) 10Marostegui: db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/899581 [12:35:14] (03CR) 10Marostegui: [C: 03+2] db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/899581 (owner: 10Marostegui) [12:37:24] (03Merged) 10jenkins-bot: thumbor: bump workers, reduce cpu, increase haproxy queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/898728 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [12:43:46] (03PS1) 10Hnowlan: service: move device-analytics to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/899607 (https://phabricator.wikimedia.org/T320967) [12:45:14] (03PS1) 10Hnowlan: service: move device-analytics to production [puppet] - 10https://gerrit.wikimedia.org/r/899608 (https://phabricator.wikimedia.org/T320967) [12:45:16] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) >>! In T330495#8693088, @MoritzMuehlenhoff wrote: > Most of the installer logic has been adapted for Bookworm, but there's one puzzling issue impacting the retrieval... [12:48:01] (03PS1) 10Cathal Mooney: Adjust OSPF Icinga check to ignore OSPFv3 if zero ints configured [puppet] - 10https://gerrit.wikimedia.org/r/899609 (https://phabricator.wikimedia.org/T332080) [12:48:38] (03CR) 10CI reject: [V: 04-1] Adjust OSPF Icinga check to ignore OSPFv3 if zero ints configured [puppet] - 10https://gerrit.wikimedia.org/r/899609 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [12:52:20] (03PS2) 10Cathal Mooney: Adjust OSPF Icinga check to ignore OSPFv3 if zero ints configured [puppet] - 10https://gerrit.wikimedia.org/r/899609 (https://phabricator.wikimedia.org/T315053) [12:53:09] (03PS1) 10Jaime Nuche: docker-gc: remove image from repository [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/899611 [12:57:10] (03PS1) 10Hnowlan: thumbor: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/899613 [12:57:29] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol: introduce cloudlb support [puppet] - 10https://gerrit.wikimedia.org/r/899614 (https://phabricator.wikimedia.org/T332153) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T1300). [13:00:05] MatmaRex and duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:34] (unable today) [13:02:25] (or can in maybe 15m) [13:02:34] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/899613 (owner: 10Hnowlan) [13:02:55] jouncebot: now [13:02:55] For the next 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T1300) [13:02:59] hi [13:03:07] (03CR) 10Herron: [C: 03+2] profile::kafka::broker::monitoring: remove under replicated icinga check [puppet] - 10https://gerrit.wikimedia.org/r/898793 (https://phabricator.wikimedia.org/T309010) (owner: 10Herron) [13:03:23] I also can’t deploy I’m afraid [13:03:24] i have next ~10 minutes only unfortunately :-( [13:03:35] sorry i'm late, hope you didn't cancel yet :) [13:03:46] MatmaRex: we're short of deployers so far [13:03:54] TheresNoTime said they can deploy in ~15 minutes [13:04:05] i guess i'll wait for daniel and hope he can ship it [13:04:11] or, or TheresNoTime [13:04:17] thanks. i can wait :) [13:04:44] (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: add route to other similar subnets [puppet] - 10https://gerrit.wikimedia.org/r/899616 (https://phabricator.wikimedia.org/T324992) [13:04:51] I can deploy [13:05:08] (03CR) 10CI reject: [V: 04-1] cloud_private_subnet: add route to other similar subnets [puppet] - 10https://gerrit.wikimedia.org/r/899616 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [13:05:20] MatmaRex: can I do all three of yours at the same time or do they need some other ordering? [13:05:45] yes. no ordering needed [13:05:55] thanks taavi [13:06:15] (03PS2) 10Majavah: Disable visual enhancements on newsectionlink pages initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897912 (https://phabricator.wikimedia.org/T331635) (owner: 10Esanders) [13:06:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898843 (https://phabricator.wikimedia.org/T331313) (owner: 10Bartosz Dziewoński) [13:06:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898844 (https://phabricator.wikimedia.org/T329407) (owner: 10Bartosz Dziewoński) [13:06:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897912 (https://phabricator.wikimedia.org/T331635) (owner: 10Esanders) [13:07:25] for the record, the third one is a no-op (config for upcoming feature), i can test the other two [13:07:29] (03Merged) 10jenkins-bot: thumbor: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/899613 (owner: 10Hnowlan) [13:07:58] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [13:08:04] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [13:08:11] (03Merged) 10jenkins-bot: Enable new Vector (2022) "Add topic" button at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898843 (https://phabricator.wikimedia.org/T331313) (owner: 10Bartosz Dziewoński) [13:08:17] (03Merged) 10jenkins-bot: Enable DiscussionTools usability improvements at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898844 (https://phabricator.wikimedia.org/T329407) (owner: 10Bartosz Dziewoński) [13:08:19] (03Merged) 10jenkins-bot: Disable visual enhancements on newsectionlink pages initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897912 (https://phabricator.wikimedia.org/T331635) (owner: 10Esanders) [13:08:54] !log taavi@deploy2002 Started scap: Backport for [[gerrit:898843|Enable new Vector (2022) "Add topic" button at cswiki, huwiki (T331313)]], [[gerrit:898844|Enable DiscussionTools usability improvements at cswiki, huwiki (T329407)]], [[gerrit:897912|Disable visual enhancements on newsectionlink pages initially (T331635)]] [13:09:02] T329407: [Config] Offer Usability Improvements as default-on features at partner wikis (desktop) - https://phabricator.wikimedia.org/T329407 [13:09:03] T331313: [Config Change] Enable Vector (2022) "Add topic" button at partner wikis - https://phabricator.wikimedia.org/T331313 [13:09:03] T331635: Enable topic containers and other visual enhancements on pages using __NEWSECTIONLINK__ - https://phabricator.wikimedia.org/T331635 [13:10:31] !log taavi@deploy2002 matmarex and taavi and esanders: Backport for [[gerrit:898843|Enable new Vector (2022) "Add topic" button at cswiki, huwiki (T331313)]], [[gerrit:898844|Enable DiscussionTools usability improvements at cswiki, huwiki (T329407)]], [[gerrit:897912|Disable visual enhancements on newsectionlink pages initially (T331635)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebu [13:10:31] g1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [13:10:34] 10SRE-tools, 10Infrastructure-Foundations: sync firmware between cumin hosts - https://phabricator.wikimedia.org/T332158 (10Volans) One of the simplest option could be to scp/rsync the single files right after downloading them using the keyholder ssh key for cumin. [13:10:44] MatmaRex: please test [13:10:49] looking [13:12:01] taavi: looks good [13:12:07] thanks, syncing [13:12:10] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/899609 (https://phabricator.wikimedia.org/T315053) (owner: 10Cathal Mooney) [13:12:28] (03CR) 10Jforrester: [C: 03+1] "Aha, I should do this for the CI images too just in case." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/899585 (https://phabricator.wikimedia.org/T330270) (owner: 10Clément Goubert) [13:12:28] !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1003.eqiad.wmnet with OS bullseye [13:15:01] (03PS3) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 [13:16:51] (ProbeDown) resolved: (34) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:17:56] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:898843|Enable new Vector (2022) "Add topic" button at cswiki, huwiki (T331313)]], [[gerrit:898844|Enable DiscussionTools usability improvements at cswiki, huwiki (T329407)]], [[gerrit:897912|Disable visual enhancements on newsectionlink pages initially (T331635)]] (duration: 09m 01s) [13:17:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:18:00] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [13:18:03] T329407: [Config] Offer Usability Improvements as default-on features at partner wikis (desktop) - https://phabricator.wikimedia.org/T329407 [13:18:04] T331313: [Config Change] Enable Vector (2022) "Add topic" button at partner wikis - https://phabricator.wikimedia.org/T331313 [13:18:04] T331635: Enable topic containers and other visual enhancements on pages using __NEWSECTIONLINK__ - https://phabricator.wikimedia.org/T331635 [13:18:27] MatmaRex: all done! [13:18:40] thanks! [13:19:50] ah thanks taavi [13:20:39] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:21:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, comment inline" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/899585 (https://phabricator.wikimedia.org/T330270) (owner: 10Clément Goubert) [13:21:18] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:22:15] !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:22:58] !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:22:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:23:49] (03CR) 10Ssingh: [V: 03+1] dnsrecursor: drop support for buster and pdns-recursor < 4.6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) (owner: 10Ssingh) [13:24:58] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:25:01] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:25:02] !log jayme@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:25:17] !log jayme@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:25:18] !log jayme@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [13:25:59] !log jayme@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [13:26:00] !log jayme@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:27:13] (03CR) 10Filippo Giunchedi: [C: 03+1] service:catalogue: Add pki as an active active service [puppet] - 10https://gerrit.wikimedia.org/r/895758 (https://phabricator.wikimedia.org/T331523) (owner: 10Jbond) [13:27:50] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1003.eqiad.wmnet with reason: host reimage [13:27:55] !log jayme@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:27:57] !log jayme@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:28:07] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:28:14] !log jayme@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:28:16] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:30:19] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:30:20] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:32:14] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1003.eqiad.wmnet with reason: host reimage [13:32:30] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:33:18] (03CR) 10Filippo Giunchedi: "Thank you for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/899542 (owner: 10Jbond) [13:43:39] Something Swift/Thumbor-related seems to be wonky. [13:43:47] This fails: https://upload.wikimedia.org/wikipedia/commons/thumb/5/59/Venus_and_Adonis%2C_Lucrece%2C_and_the_Minor_Poems_%281927%29.djvu/page9-1536px-Venus_and_Adonis%2C_Lucrece%2C_and_the_Minor_Poems_%281927%29.djvu.jpg [13:44:00] This works: https://upload.wikimedia.org/wikipedia/commons/thumb/5/59/Venus_and_Adonis%2C_Lucrece%2C_and_the_Minor_Poems_%281927%29.djvu/page9-1537px-Venus_and_Adonis%2C_Lucrece%2C_and_the_Minor_Poems_%281927%29.djvu.jpg [13:44:19] Note 1px difference in requested thumb size. [13:44:53] The one that fails calls it a 404 error. Which is kinda weird for something that should be dynamically generated. [13:45:59] (seen for the first time today, but it's been a while since I last uploaded something comparable) [13:46:05] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40135/console" [puppet] - 10https://gerrit.wikimedia.org/r/899532 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [13:48:10] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Yak Shaving 🐃🪒): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) >>! In T209149#7335003, @Tgr wrote: >>>! In T209149#7333699, @kostajh wr... [13:49:50] !log installing graphite-web security updates [13:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:13] (03PS2) 10David Caro: maintain_dbusers: remove icinga alert, we'll use the default one [puppet] - 10https://gerrit.wikimedia.org/r/899532 (https://phabricator.wikimedia.org/T303663) [13:50:29] !log reprepro -C component/pdns-recursor include bullseye-wikimedia pdns-recursor_4.6.2-1+wmf11u1_amd64.changes: T321309 [13:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:34] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [13:51:05] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1003.eqiad.wmnet with OS bullseye [13:51:51] (03PS3) 10David Caro: maintain_dbusers: remove icinga alert, we'll use the default one [puppet] - 10https://gerrit.wikimedia.org/r/899532 (https://phabricator.wikimedia.org/T303663) [13:52:15] (03PS2) 10Ssingh: dnsrecursor: drop support for buster and pdns-recursor < 4.6 [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) [13:52:33] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: sync firmware store when downloading files [cookbooks] - 10https://gerrit.wikimedia.org/r/899628 (https://phabricator.wikimedia.org/T332158) [13:52:53] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40137/console" [puppet] - 10https://gerrit.wikimedia.org/r/899532 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [13:53:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] service:catalogue: Add pki as an active active service [puppet] - 10https://gerrit.wikimedia.org/r/895758 (https://phabricator.wikimedia.org/T331523) (owner: 10Jbond) [13:53:16] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40136/console" [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) (owner: 10Ssingh) [13:54:21] (03CR) 10Ssingh: [V: 03+1] "The idea is to transition Wikidough to bullseye and then once that happens, merge this patch and completely remove the pdns-recursor compo" [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) (owner: 10Ssingh) [13:54:54] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: sync firmware store when downloading files [cookbooks] - 10https://gerrit.wikimedia.org/r/899628 (https://phabricator.wikimedia.org/T332158) (owner: 10Jbond) [13:55:04] (03PS1) 10Herron: alerting_host: failover icinga and alertmanger from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) [13:55:25] (03Abandoned) 10Jbond: mod_auth_cas: add logout script for mod_auth_cas [puppet] - 10https://gerrit.wikimedia.org/r/695255 (owner: 10Jbond) [13:55:29] (03CR) 10Majavah: [C: 03+1] maintain_dbusers: remove icinga alert, we'll use the default one [puppet] - 10https://gerrit.wikimedia.org/r/899532 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [13:56:24] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40138/console" [puppet] - 10https://gerrit.wikimedia.org/r/899532 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [13:56:41] (03PS2) 10Jbond: pki.discovery.wmnet: convert to active/active discovery record [dns] - 10https://gerrit.wikimedia.org/r/899533 (https://phabricator.wikimedia.org/T331523) [13:57:32] (03CR) 10CI reject: [V: 04-1] pki.discovery.wmnet: convert to active/active discovery record [dns] - 10https://gerrit.wikimedia.org/r/899533 (https://phabricator.wikimedia.org/T331523) (owner: 10Jbond) [13:58:46] (03CR) 10David Caro: [V: 03+1 C: 03+2] maintain_dbusers: remove icinga alert, we'll use the default one [puppet] - 10https://gerrit.wikimedia.org/r/899532 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [13:59:14] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Ottomata) Okay, we can be approvers then. [13:59:31] (03PS1) 10Nicolas Fraison: spark: Allow communication from spark pods to HDFS/Hive [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859) [13:59:34] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10Ottomata) Approved. [14:00:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Ottomata) Approved. I believe this will need kerberos as well. [14:00:20] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui) [14:00:28] 10SRE, 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10cmooney) @RobH @ayounsi in terms of the CR to ASW connectivity I think this makes sense? |CR|CR Port|ASW|ASW Port| |------------|-------------|... [14:00:54] !log nodejs security updates on buster [14:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:25] TheresNoTime, urbanecm: ugh, I missed the deploy window! got confused about time zones :( [14:05:38] (03PS4) 10Daniel Kinzler: Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534) [14:05:45] duesen: welcome to daylight confusion time :) [14:05:55] TheresNoTime, urbanecm: I could self-service now, if that's ok [14:06:21] fine with me, if there's nothing else happening rn [14:06:59] * duesen looks at moritzm [14:09:47] (03PS1) 10Elukey: profile::analytics::cluster::packages::statistics: add golang [puppet] - 10https://gerrit.wikimedia.org/r/899636 [14:11:12] duesen: sure, go ahead [14:11:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/899636 (owner: 10Elukey) [14:12:04] !log depool dns4002 for reimaging to bullseye: T321309 [14:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:11] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [14:12:16] moritzm: ok, on it [14:12:17] !log [correction] depool _doh4002_ for reimaging to bullseye: T321309 [14:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [14:12:54] (03PS3) 10Jbond: pki.discovery.wmnet: convert to active/active discovery record [dns] - 10https://gerrit.wikimedia.org/r/899533 (https://phabricator.wikimedia.org/T331523) [14:12:56] (03CR) 10Volans: "LGTM, one nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/899628 (https://phabricator.wikimedia.org/T332158) (owner: 10Jbond) [14:13:05] (03PS2) 10Elukey: profile::analytics::cluster::packages::statistics: add golang [puppet] - 10https://gerrit.wikimedia.org/r/899636 [14:13:49] (03Merged) 10jenkins-bot: Always write parsoid output to parser cache. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898795 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [14:13:53] (03CR) 10Elukey: "Moritz: Thanks for the review! I realized that maybe buster-backports are better for stat100x, does it still look good?" [puppet] - 10https://gerrit.wikimedia.org/r/899636 (owner: 10Elukey) [14:14:17] (03CR) 10JMeybohm: [C: 04-1] spark-operator: enable spark operator mutation webhook (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [14:14:18] !log daniel@deploy2002 Started scap: Backport for [[gerrit:898795|Always write parsoid output to parser cache. (T320534)]] [14:14:23] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [14:14:39] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host doh4002.wikimedia.org with OS bullseye [14:14:45] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host doh4002.wikimedia.org with OS bullseye [14:15:16] please ignore BGP alerts in ulsfo (I am on on-call so will keep an eye out on the actual ones :) [14:15:50] !log daniel@deploy2002 daniel: Backport for [[gerrit:898795|Always write parsoid output to parser cache. (T320534)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:16:59] !log jbond@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=pki [14:17:13] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:17:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40139/console" [puppet] - 10https://gerrit.wikimedia.org/r/899636 (owner: 10Elukey) [14:17:23] (03PS1) 10Muehlenhoff: Add ksarabia to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/899638 (https://phabricator.wikimedia.org/T332042) [14:18:05] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:19:03] !log update pki to use discovery record [14:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:07] (03CR) 10Jbond: [C: 03+2] pki.discovery.wmnet: convert to active/active discovery record [dns] - 10https://gerrit.wikimedia.org/r/899533 (https://phabricator.wikimedia.org/T331523) (owner: 10Jbond) [14:20:30] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet on all recursors [14:20:33] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet on all recursors [14:20:36] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [14:22:04] !log switch pki to be active active [14:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:14] (JobUnavailable) firing: Reduced availability for job wikidough in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:22:34] 10SRE, 10observability: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10andrea.denisse) 05Open→03Resolved [14:22:36] 10SRE, 10ops-eqiad, 10SRE Observability (FY2022/2023-Q3): Decomission centrallog1001 - https://phabricator.wikimedia.org/T328803 (10andrea.denisse) [14:22:39] !log jbond@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=pki [14:22:46] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet on all recursors [14:22:50] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet on all recursors [14:22:55] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:23:11] ^ expected [14:23:21] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ssingh) [14:23:31] 10SRE, 10ops-eqiad, 10SRE Observability (FY2022/2023-Q3): Decommission centrallog1001 - https://phabricator.wikimedia.org/T328803 (10andrea.denisse) [14:24:15] !log daniel@deploy2002 Finished scap: Backport for [[gerrit:898795|Always write parsoid output to parser cache. (T320534)]] (duration: 09m 57s) [14:24:20] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [14:24:33] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:24:36] (03PS1) 10JMeybohm: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) [14:25:58] Ok, parser cache writes for parsoid output is now fully enabled (except on commons and wikidata). Cache writes are going up as expected, see https://grafana-rw.wikimedia.org/d/000000106/parser-cache?orgId=1&var-contentModel=wikitext&var-dc=eqiad&var-cache=parsoid&from=now-1h&to=now&viewPanel=14 and https://grafana.wikimedia.org/d/OxxOv5K4k/ve-backend-dashboard?orgId=1&refresh=30s&from=now-30m&to=now&viewPanel=11 [14:26:04] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10herron) [14:26:38] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ssingh) [14:27:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh4002.wikimedia.org with reason: host reimage [14:27:31] xover: Probably https://phabricator.wikimedia.org/T331138 [14:28:22] (03PS1) 10JMeybohm: k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) [14:29:24] (03PS3) 10Jbond: wmflib::service::probe::module_options: simplify function [puppet] - 10https://gerrit.wikimedia.org/r/899542 [14:29:26] (03PS1) 10Jbond: wmflib::service::probe::module_options: add tests [puppet] - 10https://gerrit.wikimedia.org/r/899643 [14:30:34] (03Abandoned) 10Jbond: wmflib::service::probe::module_options: simplify function [puppet] - 10https://gerrit.wikimedia.org/r/899542 (owner: 10Jbond) [14:30:42] (03CR) 10Jbond: wmflib::service::probe::module_options: simplify function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899542 (owner: 10Jbond) [14:31:01] claime: thanks [14:31:19] (03CR) 10Muehlenhoff: profile::analytics::cluster::packages::statistics: add golang (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899636 (owner: 10Elukey) [14:31:26] (03PS2) 10Clément Goubert: php7.4: Update php7.4 to latest version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/899585 (https://phabricator.wikimedia.org/T330270) [14:31:44] (03CR) 10Clément Goubert: php7.4: Update php7.4 to latest version (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/899585 (https://phabricator.wikimedia.org/T330270) (owner: 10Clément Goubert) [14:31:46] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh4002.wikimedia.org with reason: host reimage [14:32:14] (JobUnavailable) resolved: Reduced availability for job wikidough in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:32:21] jouncebot: nowandnext [14:32:21] No deployments scheduled for the next 2 hour(s) and 27 minute(s) [14:32:21] In 2 hour(s) and 27 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T1700) [14:32:23] (03CR) 10Muehlenhoff: [C: 03+2] Add ksarabia to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/899638 (https://phabricator.wikimedia.org/T332042) (owner: 10Muehlenhoff) [14:32:39] (03CR) 10CI reject: [V: 04-1] wmflib::service::probe::module_options: add tests [puppet] - 10https://gerrit.wikimedia.org/r/899643 (owner: 10Jbond) [14:32:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:33:04] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::analytics::cluster::packages::statistics: add golang [puppet] - 10https://gerrit.wikimedia.org/r/899636 (owner: 10Elukey) [14:33:38] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: sync firmware store when downloading files [cookbooks] - 10https://gerrit.wikimedia.org/r/899628 (https://phabricator.wikimedia.org/T332158) [14:34:23] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:34:30] 10SRE-swift-storage, 10Commons: 404 error for image thumbnail file on Commons - https://phabricator.wikimedia.org/T332019 (10Stepro) the same here: https://de.wikipedia.org/wiki/Tina_Beer{F36913055} [14:35:07] (03CR) 10Andrea Denisse: "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron) [14:35:13] (03CR) 10Andrea Denisse: [C: 03+1] alerting_host: failover icinga and alertmanger from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron) [14:35:30] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Jclark-ctr) db1207 a5 u22 Port 40 Cableid 2570 db1208 a5 u23 Port 41 Cableid 1880 db1209 a6 u24 Port 36 Cableid 1918 db1210 a6 u25 Port 41 Cableid 1... [14:35:45] 10SRE, 10ops-eqiad, 10SRE Observability (FY2022/2023-Q3): Decommission centrallog1001 - https://phabricator.wikimedia.org/T328803 (10andrea.denisse) [14:35:53] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: sync firmware store when downloading files [cookbooks] - 10https://gerrit.wikimedia.org/r/899628 (https://phabricator.wikimedia.org/T332158) (owner: 10Jbond) [14:36:31] (03PS2) 10JMeybohm: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) [14:36:33] (03PS2) 10JMeybohm: k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) [14:36:46] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:37:24] (03PS1) 10David Caro: cloud: remove replica_cnf htpassword and salt [puppet] - 10https://gerrit.wikimedia.org/r/899646 [14:38:12] (03CR) 10Filippo Giunchedi: [C: 03+1] alerting_host: failover icinga and alertmanger from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron) [14:38:20] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] php7.4: Update php7.4 to latest version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/899585 (https://phabricator.wikimedia.org/T330270) (owner: 10Clément Goubert) [14:38:27] !log Updating php7.4 production images [14:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:48] (03PS1) 10Elukey: profile::analytics::cluster::packages::statistics: fix golang install [puppet] - 10https://gerrit.wikimedia.org/r/899647 [14:39:51] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/898773 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri) [14:39:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @KSarabia-WMF I've enabled your access, but it will take up to 30 minutes unt... [14:40:20] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40142/console" [puppet] - 10https://gerrit.wikimedia.org/r/899647 (owner: 10Elukey) [14:40:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10MoritzMuehlenhoff) [14:40:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/899647 (owner: 10Elukey) [14:41:38] !log Rebuilding mw-on-k8s images - T330270 [14:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:49] !log cgoubert@deploy2002 Started scap: (no justification provided) [14:41:59] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:42:38] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::analytics::cluster::packages::statistics: fix golang install [puppet] - 10https://gerrit.wikimedia.org/r/899647 (owner: 10Elukey) [14:42:58] (03CR) 10Ahmon Dancy: [C: 03+1] docker-gc: remove image from repository [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/899611 (owner: 10Jaime Nuche) [14:43:01] (03CR) 10Raymond Ndibe: [C: 03+1] "I was expecting that the secrets repo creds are going to take precedence. Interesting. +1 here since I can't merge puppet yet" [puppet] - 10https://gerrit.wikimedia.org/r/899646 (owner: 10David Caro) [14:45:59] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:46:51] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:47:09] (03PS1) 10Elukey: profile::analytics::cluster::packages::statistics: use golang-go [puppet] - 10https://gerrit.wikimedia.org/r/899649 [14:47:34] (03PS2) 10Elukey: profile::analytics::cluster::packages::statistics: use golang-go [puppet] - 10https://gerrit.wikimedia.org/r/899649 [14:47:39] (HelmReleaseBadStatus) firing: Helm release thumbor/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:47:58] (03CR) 10David Caro: [C: 03+2] cloud: remove replica_cnf htpassword and salt [puppet] - 10https://gerrit.wikimedia.org/r/899646 (owner: 10David Caro) [14:49:04] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:49:18] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40143/console" [puppet] - 10https://gerrit.wikimedia.org/r/899649 (owner: 10Elukey) [14:52:10] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:53:40] !log Redeploying mw-on-k8s for php7.4 update T330270 [14:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:52] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [14:53:55] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [14:53:56] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:53:59] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:54:01] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [14:54:03] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [14:54:04] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:54:07] !log depool moss-fe1001 as rate of token denial is too high [14:54:08] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:54:09] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:54:09] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:54:11] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:54:12] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:54:16] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:19] Sorry for the spam [14:54:21] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:54:23] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:54:23] (03PS1) 10Muehlenhoff: Add jgiannelos to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/899651 (https://phabricator.wikimedia.org/T332063) [14:54:24] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:54:28] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:54:28] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:54:30] Should have inhibited SAL log [14:55:30] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::analytics::cluster::packages::statistics: use golang-go [puppet] - 10https://gerrit.wikimedia.org/r/899649 (owner: 10Elukey) [14:57:39] (HelmReleaseBadStatus) resolved: Helm release thumbor/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:58:49] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:59:08] (03CR) 10FNegri: [C: 03+2] [tbs.harbor] Fix wrong paths for Harbor certs [puppet] - 10https://gerrit.wikimedia.org/r/898773 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri) [14:59:09] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:00:17] (03CR) 10David Caro: "Just saw this xd" [puppet] - 10https://gerrit.wikimedia.org/r/899476 (owner: 10Majavah) [15:00:17] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:00:38] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10bking) [15:01:51] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:01:58] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) [15:03:48] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Xover) >>! In T331138#8675245, @Joe wrote: > Also: if pre-generation of thumbs makes sense (does it? do we have any numbers on this stuff?) … Yes. Wikipedia p... [15:04:36] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:05:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10KSarabia-WMF) @MoritzMuehlenhoff Thank you! [15:07:24] (03PS3) 10JMeybohm: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) [15:07:26] (03PS3) 10JMeybohm: k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) [15:07:28] (03PS1) 10JMeybohm: k8s: Remove 1.16 related code [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) [15:08:10] (03CR) 10Muehlenhoff: [C: 03+2] Add jgiannelos to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/899651 (https://phabricator.wikimedia.org/T332063) (owner: 10Muehlenhoff) [15:08:34] (03CR) 10Jbond: [C: 03+1] dnsrecursor: drop support for buster and pdns-recursor < 4.6 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) (owner: 10Ssingh) [15:10:27] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:10:50] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10hnowlan) >>! In T331647#8691230, @Ottomata wrote: > Hm, that group (as well as analytics-research-admins) gives some sudo rights to a system user (analytics-platform-eng) that does have analytics-privatedat... [15:11:14] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: sync firmware store when downloading files [cookbooks] - 10https://gerrit.wikimedia.org/r/899628 (https://phabricator.wikimedia.org/T332158) [15:11:23] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:12:07] (03CR) 10Ayounsi: "Awesome! some small comments but overall lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [15:12:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @Jgiannelos I've enabled your access, but it will take up to 30 minutes until the change has propa... [15:19:32] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:20:18] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40145/console" [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [15:21:43] (03PS1) 10Muehlenhoff: Add approvers for analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/899653 (https://phabricator.wikimedia.org/T331647) [15:22:00] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40146/console" [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [15:23:59] (03PS1) 10Hnowlan: admin_ng: increase namespace cpu quota for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) [15:25:10] (03PS1) 10Elukey: profile::analytics::cluster::packages::statistics: improve go deployment [puppet] - 10https://gerrit.wikimedia.org/r/899655 [15:25:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Jgiannelos) Thanks, I just verified that I have ssh access and followed the steps for kerberos from the email. [15:26:34] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40147/console" [puppet] - 10https://gerrit.wikimedia.org/r/899655 (owner: 10Elukey) [15:27:35] (03PS2) 10Jbond: wmflib::service::probe::module_options: add tests [puppet] - 10https://gerrit.wikimedia.org/r/899643 [15:29:43] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Xover) >>! In T331138#8676812, @thcipriani wrote: > I checked [[ https://www.mediawiki.org/wiki/Developers/Maintainers | maintainer's page on MediaWiki ]] to fi... [15:29:55] (03CR) 10CI reject: [V: 04-1] wmflib::service::probe::module_options: add tests [puppet] - 10https://gerrit.wikimedia.org/r/899643 (owner: 10Jbond) [15:30:15] (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/899658 [15:30:40] (03PS2) 10Elukey: profile::analytics::cluster::packages::statistics: improve go deployment [puppet] - 10https://gerrit.wikimedia.org/r/899655 [15:30:41] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [15:32:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40148/console" [puppet] - 10https://gerrit.wikimedia.org/r/899655 (owner: 10Elukey) [15:33:04] (03PS1) 10Ssingh: auditd: update location of audisp syslog.conf [puppet] - 10https://gerrit.wikimedia.org/r/899659 (https://phabricator.wikimedia.org/T321309) [15:33:29] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [15:34:33] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::analytics::cluster::packages::statistics: improve go deployment [puppet] - 10https://gerrit.wikimedia.org/r/899655 (owner: 10Elukey) [15:34:40] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40149/console" [puppet] - 10https://gerrit.wikimedia.org/r/899659 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:34:57] !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1002.eqiad.wmnet with OS bullseye [15:34:58] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team: Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10bd808) Basically wm-bot serves the same relay bot role in the WMCS SAL logging path as logmsgbot does for wiki cluster SAL loggin... [15:36:03] (03CR) 10Ssingh: [V: 03+1 C: 03+2] auditd: update location of audisp syslog.conf [puppet] - 10https://gerrit.wikimedia.org/r/899659 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:39:01] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10MatthewVernon) This is a k8s application running on the WMF OpenStack, yes?... [15:39:43] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Jclark-ctr) [15:39:46] (03PS4) 10JMeybohm: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) [15:39:48] (03PS4) 10JMeybohm: k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) [15:39:50] (03PS2) 10JMeybohm: k8s: Remove 1.16 related code [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) [15:39:54] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/899658 (owner: 10Volans) [15:41:48] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson @Cmjohnson can you assist with next steps of these? [15:42:05] (03PS1) 10Jbond: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899661 (https://phabricator.wikimedia.org/T328291) [15:43:00] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/899628 (https://phabricator.wikimedia.org/T332158) (owner: 10Jbond) [15:43:43] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v6.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/899658 (owner: 10Volans) [15:44:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doh4002.wikimedia.org with OS bullseye [15:44:42] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host doh4002.wikimedia.org with OS bullseye completed: - doh4002 (**PASS**) - Downtimed on Icinga/A... [15:44:48] (03PS1) 10David Caro: maintain_dbusers: move out of nfs to services [puppet] - 10https://gerrit.wikimedia.org/r/899662 (https://phabricator.wikimedia.org/T303663) [15:44:52] (03CR) 10Atieno: [V: 03+2 C: 03+2] Add ability to specify filters such as sharpening and etc. for TIFF format [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/863399 (https://phabricator.wikimedia.org/T325770) (owner: 10Vlad.shapik) [15:44:56] (03PS1) 10David Caro: maintain_dbusers: Remove unused param and adapt to best practices [puppet] - 10https://gerrit.wikimedia.org/r/899663 (https://phabricator.wikimedia.org/T303663) [15:45:01] (03PS2) 10Jbond: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899661 (https://phabricator.wikimedia.org/T328291) [15:48:02] (03CR) 10CI reject: [V: 04-1] maintain_dbusers: Remove unused param and adapt to best practices [puppet] - 10https://gerrit.wikimedia.org/r/899663 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [15:49:24] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1002.eqiad.wmnet with reason: host reimage [15:49:33] (03PS3) 10Jbond: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899661 (https://phabricator.wikimedia.org/T328291) [15:52:19] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1002.eqiad.wmnet with reason: host reimage [15:53:02] (03PS1) 10Volans: Upstream release v6.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/899664 [15:53:51] (03Merged) 10jenkins-bot: Add ability to specify filters such as sharpening and etc. for TIFF format [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/863399 (https://phabricator.wikimedia.org/T325770) (owner: 10Vlad.shapik) [15:56:54] (03PS2) 10Arturo Borrero Gonzalez: cloud_private_subnet: add route to other similar subnets [puppet] - 10https://gerrit.wikimedia.org/r/899616 (https://phabricator.wikimedia.org/T324992) [15:58:45] (03CR) 10CI reject: [V: 04-1] cloud_private_subnet: add route to other similar subnets [puppet] - 10https://gerrit.wikimedia.org/r/899616 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:59:42] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor1006.eqiad.wmnet [16:01:21] (03PS3) 10Arturo Borrero Gonzalez: cloud_private_subnet: add route to other similar subnets [puppet] - 10https://gerrit.wikimedia.org/r/899616 (https://phabricator.wikimedia.org/T324992) [16:01:29] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor1006.eqiad.wmnet [16:01:39] (03CR) 10JHathaway: [C: 03+1] "I think this is the correct change, and it can easily be reverted, if we missed something, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/898896 (https://phabricator.wikimedia.org/T331676) (owner: 10Dzahn) [16:01:45] (03PS4) 10Jbond: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899661 (https://phabricator.wikimedia.org/T328291) [16:02:44] !log restarted thumbor-instances on thumbor1006 [16:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:46] (03PS4) 10Arturo Borrero Gonzalez: cloud_private_subnet: add route to other similar subnets [puppet] - 10https://gerrit.wikimedia.org/r/899616 (https://phabricator.wikimedia.org/T324992) [16:05:44] (03PS5) 10Jbond: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899661 (https://phabricator.wikimedia.org/T328291) [16:08:17] (03PS5) 10Arturo Borrero Gonzalez: cloud_private_subnet: add route to other similar subnets [puppet] - 10https://gerrit.wikimedia.org/r/899616 (https://phabricator.wikimedia.org/T324992) [16:08:47] (03PS6) 10Jbond: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899661 (https://phabricator.wikimedia.org/T328291) [16:12:08] (03CR) 10Volans: [C: 03+2] Upstream release v6.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/899664 (owner: 10Volans) [16:13:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 32): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40156/console" [puppet] - 10https://gerrit.wikimedia.org/r/899661 (https://phabricator.wikimedia.org/T328291) (owner: 10Jbond) [16:15:18] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1002.eqiad.wmnet with OS bullseye [16:16:13] (03Merged) 10jenkins-bot: Upstream release v6.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/899664 (owner: 10Volans) [16:17:26] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [16:17:55] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [16:19:11] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [16:19:37] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [16:21:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack: basic bookworm classes [puppet] - 10https://gerrit.wikimedia.org/r/894717 (owner: 10Jbond) [16:25:30] (03PS3) 10Jbond: wmflib::service::probe::module_options: add tests [puppet] - 10https://gerrit.wikimedia.org/r/899643 [16:26:06] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: sync firmware store when downloading files (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/899628 (https://phabricator.wikimedia.org/T332158) (owner: 10Jbond) [16:26:14] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10doctaxon) further three are missing when searching: https://de.wikipedia.org/w/index.php?fulltext=1&search=%22Auguste%20Bock%22&title=Spezial%3ASuche&ns0=1 Picture links are missing in sou... [16:27:36] (03CR) 10Jbond: [C: 03+2] openstack: basic bookworm classes [puppet] - 10https://gerrit.wikimedia.org/r/894717 (owner: 10Jbond) [16:28:33] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:28:40] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: sync firmware store when downloading files [cookbooks] - 10https://gerrit.wikimedia.org/r/899628 (https://phabricator.wikimedia.org/T332158) (owner: 10Jbond) [16:28:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:30:18] 10SRE-tools, 10Infrastructure-Foundations: sync firmware between cumin hosts - https://phabricator.wikimedia.org/T332158 (10jbond) this has now been added to the upgrade cookbook [16:37:07] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable frontend of link recommendation for 6th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899673 (https://phabricator.wikimedia.org/T304550) [16:38:38] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.242 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:46:12] (03PS2) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 7th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892363 (https://phabricator.wikimedia.org/T304551) [16:46:14] (03PS2) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 8th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892364 (https://phabricator.wikimedia.org/T308133) [16:46:16] (03PS2) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892365 (https://phabricator.wikimedia.org/T308134) [16:48:27] (03PS3) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 7th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892363 (https://phabricator.wikimedia.org/T304551) [16:48:29] (03PS3) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 8th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892364 (https://phabricator.wikimedia.org/T308133) [16:48:31] (03PS3) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892365 (https://phabricator.wikimedia.org/T308134) [16:51:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:53:10] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10jeremyb-phone) >>! In T331820#8698824, @doctaxon wrote: > further three are missing when searching: https://de.wikipedia.org/w/index.php?fulltext=1&search=%22Auguste%20Bock%22&title=Spezial... [16:55:53] (03CR) 10Sergio Gimeno: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593 (owner: 10Kosta Harlan) [16:56:21] (03PS9) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) [16:56:23] (03PS2) 10Nicolas Fraison: spark: Allow communication from spark pods to HDFS/Hive [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859) [16:57:48] (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T1700) [17:04:08] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:05:31] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host doh4001.wikimedia.org with OS bullseye [17:05:38] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host doh4001.wikimedia.org with OS bullseye [17:08:08] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:09:17] 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10akosiaris) [17:10:09] ^ expected, please ignore DNS alerts as brett and I are reimaging traffic hosts [17:10:25] I am keeping an eye out for other alerts if any [17:10:53] (as the person on on-call :P) [17:12:44] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:12:46] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host doh5001.wikimedia.org with OS bullseye [17:12:52] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh5001.wikimedia.org with OS bullseye [17:13:14] (JobUnavailable) firing: Reduced availability for job wikidough in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:13:17] 10SRE-tools, 10Infrastructure-Foundations: sync firmware between cumin hosts - https://phabricator.wikimedia.org/T332158 (10jbond) 05In progress→03Resolved a:03jbond [17:13:28] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:15:13] (03PS2) 10TsepoThoabala: Deploy action blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896900 (https://phabricator.wikimedia.org/T330533) [17:17:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh4001.wikimedia.org with reason: host reimage [17:19:52] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:20:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh4001.wikimedia.org with reason: host reimage [17:20:37] 10SRE, 10Infrastructure-Foundations, 10Traffic: Consider confirming the hostname by user input when running the reimaging cookbook - https://phabricator.wikimedia.org/T332202 (10ssingh) [17:21:17] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:22:36] 10SRE, 10Infrastructure-Foundations, 10Traffic: Consider confirming the hostname by user input when running the reimaging cookbook - https://phabricator.wikimedia.org/T332202 (10ssingh) p:05Triage→03Low [17:23:14] (JobUnavailable) firing: (2) Reduced availability for job wikidough in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:25:15] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor2003.eqiad.wmnet [17:25:35] (03CR) 10Cwhite: [C: 03+2] logstash: move mediawiki ecs logs into mediawiki partition [puppet] - 10https://gerrit.wikimedia.org/r/895741 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:26:23] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:27:00] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor2003.eqiad.wmnet [17:27:23] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor2005.eqiad.wmnet [17:28:59] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor2005.eqiad.wmnet [17:31:19] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:31:57] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-aqu-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:20] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor2004.codfw.wmnet [17:32:53] (03CR) 10Dzahn: [C: 03+2] miscweb: add monitoring for design.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/898999 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [17:33:14] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor2004.codfw.wmnet [17:33:39] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:34:31] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:34:39] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor2006.codfw.wmnet [17:34:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doh4001.wikimedia.org with OS bullseye [17:34:45] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:34:47] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host doh4001.wikimedia.org with OS bullseye completed: - doh4001 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabl... [17:35:03] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:35:34] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor2006.codfw.wmnet [17:36:18] 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Unexpected auditd service restart failure - https://phabricator.wikimedia.org/T287266 (10ssingh) 05Open→03Resolved a:03ssingh We reimaged two hosts to bullseye and didn't notice any auditd failure, so confirming what @MoritzMuehlenhoff said above and marking... [17:36:19] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor1001.wmnet [17:36:39] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor1001.eqiad.wmnet [17:37:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor1001.eqiad.wmnet [17:38:14] (JobUnavailable) firing: (2) Reduced availability for job wikidough in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:39:20] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh5001.wikimedia.org with reason: host reimage [17:40:27] (03PS1) 10Esanders: Enable remaining DiscussionTools visual enhancements at hu/cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899693 (https://phabricator.wikimedia.org/T329407) [17:42:40] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh5001.wikimedia.org with reason: host reimage [17:43:02] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor1002.eqiad.wmnet [17:43:14] (JobUnavailable) resolved: (2) Reduced availability for job wikidough in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:43:48] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor1002.eqiad.wmnet [17:43:59] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor1005.eqiad.wmnet [17:44:44] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor1005.eqiad.wmnet [17:45:18] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor1006.eqiad.wmnet [17:45:50] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor1005.eqiad.wmnet [17:55:52] 10SRE, 10SRE-swift-storage, 10Commons, 10Thumbor: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 (10hnowlan) It appears that there was a lingering switchover issue related to communication between thumbor and swift. We're still investigating what caused this but it appears that new files... [17:58:23] (03CR) 10Brennen Bearnes: [C: 04-1] "Ported these changes to the scap config templates: https://gitlab.wikimedia.org/repos/phabricator/deployment/-/merge_requests/4" [puppet] - 10https://gerrit.wikimedia.org/r/896211 (https://phabricator.wikimedia.org/T331680) (owner: 10Varnent) [17:58:40] (03Abandoned) 10Brennen Bearnes: phabricator: update footer links for foundation wiki [puppet] - 10https://gerrit.wikimedia.org/r/896211 (https://phabricator.wikimedia.org/T331680) (owner: 10Varnent) [18:00:04] brennen and jeena: Dear deployers, time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T1800). [18:00:04] brennen and jeena: gettimeofday() says it's time for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T1800) [18:00:18] o/ [18:01:15] hnowlan: nothing with T331820 should hold train up, correct? [18:01:16] T331820: Upstream caches: 404 - https://phabricator.wikimedia.org/T331820 [18:01:38] brennen: no, I don't think so [18:01:48] (03PS2) 10Sergio Gimeno: GrowthExperiments: enable frontend of link recommendation for 6th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899673 (https://phabricator.wikimedia.org/T304550) [18:01:50] (03PS4) 10Sergio Gimeno: GrowthExperiments: Enable backend of link recommendation for 7,8,9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892363 (https://phabricator.wikimedia.org/T304551) [18:02:42] (03Abandoned) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 8th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892364 (https://phabricator.wikimedia.org/T308133) (owner: 10Sergio Gimeno) [18:02:55] (03Abandoned) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892365 (https://phabricator.wikimedia.org/T308134) (owner: 10Sergio Gimeno) [18:03:18] cool, thx. [18:03:32] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:03:50] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:04:08] (03PS5) 10Sergio Gimeno: GrowthExperiments: Enable backend of link recommendation for 7,8,9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892363 (https://phabricator.wikimedia.org/T304551) [18:04:47] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doh5001.wikimedia.org with OS bullseye [18:04:53] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh5001.wikimedia.org with OS bullseye completed: - doh5001 (**WARN**) - Downtimed on Icinga/Alertmanager - Disabl... [18:05:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:01] !log 1.40.0-wmf.27 train (T330205): no current blockers, rolling to group1. [18:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:07] T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205 [18:07:03] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899709 (https://phabricator.wikimedia.org/T330205) [18:07:05] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899709 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot) [18:08:05] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899709 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot) [18:09:45] (03PS6) 10Sergio Gimeno: GrowthExperiments: Enable backend of link recommendation for 7,8,9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892363 (https://phabricator.wikimedia.org/T304551) [18:10:25] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) >>! In T321309#8445726, @MoritzMuehlenhoff wrote: > One thing to keep in mind for the LVSes is that Bullseye only includes Python 2 as a build dependency (at the time of the release some crucial packag... [18:11:09] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Eevans) [18:12:28] (03PS7) 10Sergio Gimeno: GrowthExperiments: Enable backend of link recommendation for 7,8,9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892363 (https://phabricator.wikimedia.org/T304551) [18:12:45] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@8685c9e]: newly ported dags, reduce failures in map_subgraph_queries [18:12:50] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@8685c9e]: newly ported dags, reduce failures in map_subgraph_queries (duration: 00m 05s) [18:14:25] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) >>! In T321309#8699218, @BCornwall wrote: > Implementing, testing, and debugging T200319 is a massive undertaking - One that seems inappropriate for a part of the stack that is planned on repla... [18:14:48] (03PS1) 10Jaime Nuche: docker::gc: clean up older images from deployment servers using timer [puppet] - 10https://gerrit.wikimedia.org/r/899718 (https://phabricator.wikimedia.org/T329678) [18:15:16] (03CR) 10CI reject: [V: 04-1] docker::gc: clean up older images from deployment servers using timer [puppet] - 10https://gerrit.wikimedia.org/r/899718 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche) [18:15:21] (03PS8) 10Sergio Gimeno: GrowthExperiments: Enable backend of link recommendation for 7, 8, 9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892363 (https://phabricator.wikimedia.org/T304551) [18:16:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:46] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.27 refs T330205 [18:18:53] T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205 [18:19:24] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host doh5002.wikimedia.org with OS bullseye [18:19:30] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh5002.wikimedia.org with OS bullseye [18:19:36] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:20:22] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor1006.eqiad.wmnet [18:20:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:23:52] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:24:16] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:24:55] !log brennen@deploy2002 Synchronized php: group1 wikis to 1.40.0-wmf.27 refs T330205 (duration: 06m 08s) [18:25:04] T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205 [18:25:42] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host doh6001.wikimedia.org with OS bullseye [18:25:48] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh6001.wikimedia.org with OS bullseye [18:25:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:26:45] (JobUnavailable) firing: Reduced availability for job wikidough in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:28:57] (03PS1) 10FNegri: [tbs.harbor] Clean up admin pwd management [puppet] - 10https://gerrit.wikimedia.org/r/899724 (https://phabricator.wikimedia.org/T316323) [18:29:07] (03PS1) 10Jsn.sherman: Log additional click events on Special:MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899725 (https://phabricator.wikimedia.org/T326216) [18:29:25] (03CR) 10CI reject: [V: 04-1] [tbs.harbor] Clean up admin pwd management [puppet] - 10https://gerrit.wikimedia.org/r/899724 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri) [18:30:14] (03PS1) 10Bartosz Dziewoński: Clean up DiscussionTools config for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899726 [18:30:23] (03PS2) 10FNegri: [tbs.harbor] Clean up admin pwd management [puppet] - 10https://gerrit.wikimedia.org/r/899724 (https://phabricator.wikimedia.org/T316323) [18:30:51] (03PS2) 10Bartosz Dziewoński: Enable remaining DiscussionTools visual enhancements at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899693 (https://phabricator.wikimedia.org/T329407) (owner: 10Esanders) [18:30:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:31:02] (03PS3) 10Bartosz Dziewoński: Enable remaining DiscussionTools visual enhancements at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899693 (https://phabricator.wikimedia.org/T329407) (owner: 10Esanders) [18:31:08] (03PS2) 10Bartosz Dziewoński: Clean up DiscussionTools config for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899726 [18:33:25] (03PS3) 10Bartosz Dziewoński: Enable new Vector (2022) "Add topic" button at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898845 (https://phabricator.wikimedia.org/T331313) [18:33:27] (03PS3) 10Bartosz Dziewoński: Enable DiscussionTools usability improvements at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898846 (https://phabricator.wikimedia.org/T329407) [18:36:45] (JobUnavailable) firing: (2) Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:37:01] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:37:07] ^ expected [18:38:34] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:39:15] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) >>! In T321309#8699246, @MoritzMuehlenhoff wrote: >>>! In T321309#8699218, @BCornwall wrote: >> Implementing, testing, and debugging T200319 is a massive undertaking - One that seems inappropriate for... [18:41:45] (JobUnavailable) firing: (2) Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:42:36] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh6001.wikimedia.org with reason: host reimage [18:43:46] (03PS1) 10Dzahn: add new language 'anp' (Angika) [dns] - 10https://gerrit.wikimedia.org/r/899728 (https://phabricator.wikimedia.org/T332115) [18:44:24] (03CR) 10Dzahn: [C: 03+2] add new language 'anp' (Angika) [dns] - 10https://gerrit.wikimedia.org/r/899728 (https://phabricator.wikimedia.org/T332115) (owner: 10Dzahn) [18:46:20] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh6001.wikimedia.org with reason: host reimage [18:49:15] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh5002.wikimedia.org with reason: host reimage [18:49:24] !log adding new language prefix anp.wikipedia.org - Angika, an Eastern Indo-Aryan language spoken in some parts of the Indian states of Bihar and Jharkhand, as well as in parts of Nepal. (T332115) [18:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:29] T332115: Create Wikipedia Angika - https://phabricator.wikimedia.org/T332115 [18:49:49] mutante: TIL about this, always fascinating to see such "small" languages being added [18:51:10] sukhe: I like adding the definition. ChatGPT said: According to Ethnologue, there are approximately 720,000 speakers of the Angika language. . Wikipedia says: 740,000 [18:52:21] yeah 700k is significant I guess but not relative to the other spoken languages. fascinating still [18:52:37] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh5002.wikimedia.org with reason: host reimage [18:52:39] but also " 30 to 40 Million (As per recent Indian References)." [18:54:15] sukhe: if you are curious the actual rules what is accepted are https://meta.wikimedia.org/wiki/Language_proposal_policy#Requisites_for_eligibility [18:55:13] TIL :) [18:55:34] plus whatever rules to get into ISO-639 [18:57:45] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:58:45] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:01:45] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:01:45] (JobUnavailable) resolved: Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:02:40] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40158/console" [puppet] - 10https://gerrit.wikimedia.org/r/895878 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [19:03:15] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:03:41] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doh6001.wikimedia.org with OS bullseye [19:03:47] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh6001.wikimedia.org with OS bullseye completed: - doh6001 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabl... [19:05:06] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host doh6002.wikimedia.org with OS bullseye [19:05:12] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh6002.wikimedia.org with OS bullseye [19:07:13] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:08:13] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:08:53] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:10:36] (03CR) 10Esanders: [C: 03+1] Clean up DiscussionTools config for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899726 (owner: 10Bartosz Dziewoński) [19:12:55] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:13:14] (JobUnavailable) firing: Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:13:53] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:14:29] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host doh3001.wikimedia.org with OS bullseye [19:14:30] (03PS2) 10Cathal Mooney: Move cloudsw prefix-list filters from templates to YAML [homer/public] - 10https://gerrit.wikimedia.org/r/898684 (https://phabricator.wikimedia.org/T327919) [19:14:36] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh3001.wikimedia.org with OS bullseye [19:15:00] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doh5002.wikimedia.org with OS bullseye [19:15:06] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh5002.wikimedia.org with OS bullseye completed: - doh5002 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabl... [19:15:41] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:15:46] (03CR) 10Cathal Mooney: Move cloudsw prefix-list filters from templates to YAML (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/898684 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [19:16:26] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host doh2001.wikimedia.org with OS bullseye [19:16:33] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh2001.wikimedia.org with OS bullseye [19:17:27] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host doh1001.wikimedia.org with OS bullseye [19:17:34] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh1001.wikimedia.org with OS bullseye [19:17:47] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:18:11] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:18:33] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:18:41] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:19:25] (03PS1) 10Cwhite: logstash: add rdbms log spam filter [puppet] - 10https://gerrit.wikimedia.org/r/898915 (https://phabricator.wikimedia.org/T330205) [19:20:41] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:20:51] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:21:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:21:07] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:21:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:21:34] (03CR) 10Cwhite: [C: 03+2] logstash: add rdbms log spam filter [puppet] - 10https://gerrit.wikimedia.org/r/898915 (https://phabricator.wikimedia.org/T330205) (owner: 10Cwhite) [19:22:41] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh6002.wikimedia.org with reason: host reimage [19:23:14] (JobUnavailable) firing: (2) Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:23:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:24:23] (03PS1) 10Majavah: extdist: Add REL1_40 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899736 (https://phabricator.wikimedia.org/T329085) [19:24:52] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1013'] [19:25:35] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh6002.wikimedia.org with reason: host reimage [19:26:34] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh1001.wikimedia.org with reason: host reimage [19:26:53] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh2001.wikimedia.org with reason: host reimage [19:27:12] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1014.mgmt.eqiad.wmnet'] [19:28:14] (JobUnavailable) firing: (4) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:28:30] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-fe1004'] [19:31:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh2001.wikimedia.org with reason: host reimage [19:31:52] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @papaul, I suggest we move the ports from the existing switch to the new... [19:32:34] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh3001.wikimedia.org with reason: host reimage [19:32:38] !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1001.eqiad.wmnet with OS bullseye [19:33:14] (JobUnavailable) firing: (4) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:33:42] jouncebot: nowandnext [19:33:43] For the next 0 hour(s) and 26 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T1800) [19:33:43] In 0 hour(s) and 26 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T2000) [19:33:51] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh1001.wikimedia.org with reason: host reimage [19:34:04] brennen: is the train ongoing or can I quickly deploy something? [19:34:55] 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for Prabhat - https://phabricator.wikimedia.org/T332214 (10prabhat) [19:35:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-fe1013'] [19:35:14] taavi: train seems fairly stable at the moment, go ahead. [19:35:46] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1013'] [19:35:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899736 (https://phabricator.wikimedia.org/T329085) (owner: 10Majavah) [19:36:41] taavi: hmm - see https://phabricator.wikimedia.org/T330205#8699489 - don't know if cwhite has anything ongoing there. [19:36:58] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudcontrol1007: power supply temperature critical - https://phabricator.wikimedia.org/T331984 (10Jclark-ctr) Reseated power cord [19:37:03] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:37:03] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh3001.wikimedia.org with reason: host reimage [19:37:08] (03Merged) 10jenkins-bot: extdist: Add REL1_40 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899736 (https://phabricator.wikimedia.org/T329085) (owner: 10Majavah) [19:37:10] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudcontrol1007: power supply temperature critical - https://phabricator.wikimedia.org/T331984 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [19:37:34] !log taavi@deploy2002 Started scap: Backport for [[gerrit:899736|extdist: Add REL1_40 (T329085)]] [19:37:39] T329085: Add REL1_40 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T329085 [19:37:54] 10SRE, 10ops-eqiad, 10serviceops: Broken PSU on mw1435 - https://phabricator.wikimedia.org/T332117 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated powercord [19:38:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-fe1014.mgmt.eqiad.wmnet'] [19:39:06] !log taavi@deploy2002 taavi: Backport for [[gerrit:899736|extdist: Add REL1_40 (T329085)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [19:39:09] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['thanos-fe1004'] [19:39:50] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:40:17] 10SRE, 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10cmooney) As per discussion on IRC we can re-use the [[ https://www.fs.com/products/36114.html?attribute=400&id=9735 | 40GBase-LR4 ]] QSFP+ optic... [19:40:17] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:40:29] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 183, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:40:29] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:40:59] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:41:23] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:41:43] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doh6002.wikimedia.org with OS bullseye [19:41:49] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh6002.wikimedia.org with OS bullseye completed: - doh6002 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabl... [19:42:42] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:43:05] PROBLEM - cinder-volume process on cloudcontrol1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:43:14] (JobUnavailable) firing: (3) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:44:09] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10Jclark-ctr) 05Open→03Resolved a:05Cmjohnson→03Jclark-ctr Cable was replaced [19:44:44] 10SRE, 10ops-eqiad: Remove second links from cloud servers - https://phabricator.wikimedia.org/T331737 (10Jclark-ctr) a:03Jclark-ctr [19:44:44] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doh2001.wikimedia.org with OS bullseye [19:44:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh2001.wikimedia.org with OS bullseye completed: - doh2001 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabl... [19:45:21] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:45:29] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1014.mgmt.eqiad.wmnet'] [19:45:47] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host doh2002.wikimedia.org with OS bullseye [19:45:50] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-fe1004'] [19:45:54] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh2002.wikimedia.org with OS bullseye [19:46:43] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doh1001.wikimedia.org with OS bullseye [19:46:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh1001.wikimedia.org with OS bullseye completed: - doh1001 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabl... [19:47:22] 10SRE, 10ops-eqiad, 10SRE Observability (FY2022/2023-Q3): Decommission centrallog1001 - https://phabricator.wikimedia.org/T328803 (10Jclark-ctr) [19:47:36] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: host reimage [19:47:59] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:48:02] 10SRE, 10ops-eqiad, 10SRE Observability (FY2022/2023-Q3): Decommission centrallog1001 - https://phabricator.wikimedia.org/T328803 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Removed from rack offline script ran [19:48:14] (JobUnavailable) resolved: (3) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:48:47] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:48:59] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host doh1002.wikimedia.org with OS bullseye [19:49:06] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh1002.wikimedia.org with OS bullseye [19:49:38] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:899736|extdist: Add REL1_40 (T329085)]] (duration: 12m 04s) [19:49:39] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:49:41] * taavi done [19:49:41] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:49:43] T329085: Add REL1_40 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T329085 [19:49:55] Is there someone I can ask to get a security patch deployed quickly? [19:50:05] hi [19:50:08] https://phabricator.wikimedia.org/T331192 [19:50:18] Presents HTML injection [19:50:36] Quite risky as the attack vector is usernames [19:50:59] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: host reimage [19:51:25] Dreamy_Jazz: at first don't talk about it on a public channel? :) [19:51:37] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:51:42] hashar: yeah we're moving to a private venue [19:51:59] Good point. Moved to private discussion. [19:52:04] +1 :] [19:52:43] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:53:01] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Confirm cable labels and add to Netbox - https://phabricator.wikimedia.org/T331709 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Confirmed and updated cableid in netbox [19:53:03] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doh3001.wikimedia.org with OS bullseye [19:53:10] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh3001.wikimedia.org with OS bullseye completed: - doh3001 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabl... [19:53:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-fe1013'] [19:54:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-fe1014.mgmt.eqiad.wmnet'] [19:54:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['thanos-fe1004'] [19:54:15] (JobUnavailable) firing: (2) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:54:15] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host doh3002.wikimedia.org with OS bullseye [19:54:22] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh3002.wikimedia.org with OS bullseye [19:55:25] (03PS1) 10Cwhite: rsyslog: drop rdbms log spam pre-kafka [puppet] - 10https://gerrit.wikimedia.org/r/898916 (https://phabricator.wikimedia.org/T330205) [19:56:52] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh2002.wikimedia.org with reason: host reimage [19:56:57] (03PS2) 10Cwhite: rsyslog: drop rdbms log spam pre-kafka [puppet] - 10https://gerrit.wikimedia.org/r/898916 (https://phabricator.wikimedia.org/T330205) [19:57:27] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:57:31] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:59:15] (JobUnavailable) firing: (2) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T2000). Please do the needful. [20:00:04] sergi0, tsepoThoabala, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] hi [20:00:20] here... [20:00:31] Hi, I can deploy! [20:00:37] TheresNoTime: hold for a bit please [20:00:41] hi [20:00:43] taavi: ack [20:01:01] have a security fix deployment in progress, sorry [20:01:42] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh1002.wikimedia.org with reason: host reimage [20:02:02] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh2002.wikimedia.org with reason: host reimage [20:05:09] RECOVERY - IPMI Sensor Status on mw1435 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [20:05:14] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh1002.wikimedia.org with reason: host reimage [20:05:19] * taavi syncing [20:06:38] (03CR) 10Herron: [C: 03+1] rsyslog: drop rdbms log spam pre-kafka [puppet] - 10https://gerrit.wikimedia.org/r/898916 (https://phabricator.wikimedia.org/T330205) (owner: 10Cwhite) [20:08:23] (03PS3) 10Cwhite: rsyslog: drop rdbms log spam pre-kafka [puppet] - 10https://gerrit.wikimedia.org/r/898916 (https://phabricator.wikimedia.org/T330205) [20:10:53] TheresNoTime: I'm done, sorry about the delay [20:10:58] no problem! :) [20:11:05] !log deploy patch for T331192 [20:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:16] sergi0: going to start with your two patches [20:11:39] taavi: thank you! [20:11:54] TheresNoTime: great, the first one I will test in some wikis, probably not all, second is a noop, will trigger a maintenace script in the next 12-24h [20:12:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899673 (https://phabricator.wikimedia.org/T304550) (owner: 10Sergio Gimeno) [20:12:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892363 (https://phabricator.wikimedia.org/T304551) (owner: 10Sergio Gimeno) [20:12:06] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@b33bb73]: newly ported dags, reduce failures in map_subgraph_queries [20:12:15] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:12:20] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@b33bb73]: newly ported dags, reduce failures in map_subgraph_queries (duration: 00m 14s) [20:12:33] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh3002.wikimedia.org with reason: host reimage [20:12:57] (03Merged) 10jenkins-bot: GrowthExperiments: enable frontend of link recommendation for 6th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899673 (https://phabricator.wikimedia.org/T304550) (owner: 10Sergio Gimeno) [20:12:58] 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for Prabhat - https://phabricator.wikimedia.org/T332214 (10HShaikh) Approving this request for Prabhat. He needs this access to get data analysis done for our team [20:13:00] (03Merged) 10jenkins-bot: GrowthExperiments: Enable backend of link recommendation for 7, 8, 9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892363 (https://phabricator.wikimedia.org/T304551) (owner: 10Sergio Gimeno) [20:13:01] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:13:22] *not all within the backport window, will definitely test them all, but should be a pretty safe FE only change [20:13:23] !log samtar@deploy2002 Started scap: Backport for [[gerrit:899673|GrowthExperiments: enable frontend of link recommendation for 6th round wikis (T304550)]], [[gerrit:892363|GrowthExperiments: Enable backend of link recommendation for 7, 8, 9th round wikis (T304551 T308133 T308134)]] [20:13:24] (03PS4) 10Cwhite: rsyslog: drop rdbms log spam pre-kafka [puppet] - 10https://gerrit.wikimedia.org/r/898916 (https://phabricator.wikimedia.org/T330205) [20:13:33] T304550: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 [20:13:33] T308134: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 [20:13:33] T304551: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 [20:13:34] T308133: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 [20:14:15] (JobUnavailable) firing: (2) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:14:55] !log samtar@deploy2002 sgimeno and samtar: Backport for [[gerrit:899673|GrowthExperiments: enable frontend of link recommendation for 6th round wikis (T304550)]], [[gerrit:892363|GrowthExperiments: Enable backend of link recommendation for 7, 8, 9th round wikis (T304551 T308133 T308134)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:15:12] sergi0: those are live on mwdebug, could you test that first one? [20:15:22] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1001.eqiad.wmnet with OS bullseye [20:15:31] testing.. [20:15:57] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh3002.wikimedia.org with reason: host reimage [20:16:06] (03CR) 10Cwhite: [C: 03+2] rsyslog: drop rdbms log spam pre-kafka [puppet] - 10https://gerrit.wikimedia.org/r/898916 (https://phabricator.wikimedia.org/T330205) (owner: 10Cwhite) [20:16:08] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:17:26] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doh2002.wikimedia.org with OS bullseye [20:17:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh2002.wikimedia.org with OS bullseye completed: - doh2002 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabl... [20:18:01] TheresNoTime: looks good from my side [20:18:07] syncing :) [20:18:21] (03CR) 10Herron: [C: 03+1] rsyslog: drop rdbms log spam pre-kafka [puppet] - 10https://gerrit.wikimedia.org/r/898916 (https://phabricator.wikimedia.org/T330205) (owner: 10Cwhite) [20:19:03] (03PS3) 10Samtar: Deploy action blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896900 (https://phabricator.wikimedia.org/T330533) (owner: 10TsepoThoabala) [20:19:15] (JobUnavailable) resolved: (2) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:20:06] (03PS1) 10BCornwall: sre.{ganeti,hosts}.reimage: Confirm with hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) [20:20:09] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doh1002.wikimedia.org with OS bullseye [20:20:16] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh1002.wikimedia.org with OS bullseye completed: - doh1002 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabl... [20:20:40] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 183, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:20:49] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:21:05] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:22:53] (03PS2) 10BCornwall: sre.{ganeti,hosts}.reimage: Confirm with hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) [20:23:36] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:899673|GrowthExperiments: enable frontend of link recommendation for 6th round wikis (T304550)]], [[gerrit:892363|GrowthExperiments: Enable backend of link recommendation for 7, 8, 9th round wikis (T304551 T308133 T308134)]] (duration: 10m 12s) [20:23:39] sergi0: live :) [20:23:45] T304550: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 [20:23:45] T308134: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 [20:23:46] T304551: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 [20:23:46] T308133: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 [20:23:50] tsepoThoabala: ready for your patch? [20:23:55] cool, ty! [20:24:03] TheresNoTime  I am ready [20:24:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896900 (https://phabricator.wikimedia.org/T330533) (owner: 10TsepoThoabala) [20:25:03] (03Merged) 10jenkins-bot: Deploy action blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896900 (https://phabricator.wikimedia.org/T330533) (owner: 10TsepoThoabala) [20:25:14] (03PS3) 10BCornwall: sre.{ganeti,hosts}.reimage: Confirm with hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) [20:25:24] !log samtar@deploy2002 Started scap: Backport for [[gerrit:896900|Deploy action blocks on itwiki (T330533)]] [20:25:29] T330533: Deploy action blocks on itwiki - https://phabricator.wikimedia.org/T330533 [20:27:00] !log samtar@deploy2002 samtar and tsepothoabala: Backport for [[gerrit:896900|Deploy action blocks on itwiki (T330533)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:27:04] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:04] tsepoThoabala: that's live on mwdebug, can you test? [20:27:14] testing ... [20:27:29] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:30:16] TheresNoTime looks good from my side [20:30:24] Great :) syncing [20:32:50] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:33:15] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doh3002.wikimedia.org with OS bullseye [20:33:22] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh3002.wikimedia.org with OS bullseye completed: - doh3002 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabl... [20:33:22] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:33:28] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 477, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:35:55] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:896900|Deploy action blocks on itwiki (T330533)]] (duration: 10m 30s) [20:36:00] T330533: Deploy action blocks on itwiki - https://phabricator.wikimedia.org/T330533 [20:36:08] tsepoThoabala: that's live :) [20:36:17] ty! [20:36:17] MatmaRex: ready for your patches? [20:36:26] sure [20:36:28] (03PS4) 10Samtar: Enable remaining DiscussionTools visual enhancements at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899693 (https://phabricator.wikimedia.org/T329407) (owner: 10Esanders) [20:36:29] both can go at the same time [20:36:58] (03PS3) 10Samtar: Clean up DiscussionTools config for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899726 (owner: 10Bartosz Dziewoński) [20:37:02] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-aqu-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:24] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:37:32] (03PS1) 10Cwhite: rsyslog: move spam filter into ruleset [puppet] - 10https://gerrit.wikimedia.org/r/898917 (https://phabricator.wikimedia.org/T330205) [20:37:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899693 (https://phabricator.wikimedia.org/T329407) (owner: 10Esanders) [20:37:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899726 (owner: 10Bartosz Dziewoński) [20:38:49] (03CR) 10Herron: [C: 03+1] rsyslog: move spam filter into ruleset [puppet] - 10https://gerrit.wikimedia.org/r/898917 (https://phabricator.wikimedia.org/T330205) (owner: 10Cwhite) [20:39:06] (03Merged) 10jenkins-bot: Enable remaining DiscussionTools visual enhancements at cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899693 (https://phabricator.wikimedia.org/T329407) (owner: 10Esanders) [20:39:11] (03Merged) 10jenkins-bot: Clean up DiscussionTools config for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899726 (owner: 10Bartosz Dziewoński) [20:39:32] !log samtar@deploy2002 Started scap: Backport for [[gerrit:899693|Enable remaining DiscussionTools visual enhancements at cswiki, huwiki (T329407)]], [[gerrit:899726|Clean up DiscussionTools config for mediawikiwiki]] [20:39:37] T329407: [Config] Offer Usability Improvements as default-on features at partner wikis (desktop) - https://phabricator.wikimedia.org/T329407 [20:39:50] (03CR) 10Cwhite: [C: 03+2] rsyslog: move spam filter into ruleset [puppet] - 10https://gerrit.wikimedia.org/r/898917 (https://phabricator.wikimedia.org/T330205) (owner: 10Cwhite) [20:41:01] !log samtar@deploy2002 matmarex and samtar and esanders: Backport for [[gerrit:899693|Enable remaining DiscussionTools visual enhancements at cswiki, huwiki (T329407)]], [[gerrit:899726|Clean up DiscussionTools config for mediawikiwiki]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:41:13] MatmaRex: live on mwdebug for testing [20:42:37] TheresNoTime: everything looks good [20:42:42] syncing :) [20:42:47] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10Dylsss) See also: T32861 [20:44:40] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:48:19] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:899693|Enable remaining DiscussionTools visual enhancements at cswiki, huwiki (T329407)]], [[gerrit:899726|Clean up DiscussionTools config for mediawikiwiki]] (duration: 08m 46s) [20:48:25] T329407: [Config] Offer Usability Improvements as default-on features at partner wikis (desktop) - https://phabricator.wikimedia.org/T329407 [20:48:29] MatmaRex: aaaand live [20:48:37] thank you TheresNoTime [20:48:58] !log close UTC late backport window [20:48:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:34] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@10fea1f]: correct arguments to RangeHivePartitionSensor [20:51:50] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@10fea1f]: correct arguments to RangeHivePartitionSensor (duration: 00m 16s) [20:53:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:54:42] !log starting phabricator window a touch early with a test deploy to phab2002 [20:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:32] !log brennen@deploy2002 Started deploy [phabricator/deployment@9e9b406]: test deploy of current state to phab2002 (T331915) [20:55:37] T331915: Phabricator deployment 2023-03-15 - https://phabricator.wikimedia.org/T331915 [20:56:03] !log brennen@deploy2002 Finished deploy [phabricator/deployment@9e9b406]: test deploy of current state to phab2002 (T331915) (duration: 00m 31s) [21:00:05] brennen and mutante: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T2100). [21:01:02] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:01:24] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:02:25] 10SRE-swift-storage: Bring ms-fe201[3-4] into service - https://phabricator.wikimedia.org/T331178 (10Eevans) 05Open→03Resolved [21:07:16] 10SRE, 10DNS, 10Traffic: Acquire the enwp.org domain - https://phabricator.wikimedia.org/T32861 (10Aklapper) Continuation in T332220... [21:08:28] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:31] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on phab2002.codfw.wmnet,phab1004.eqiad.wmnet with reason: maintenance [21:08:47] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab2002.codfw.wmnet,phab1004.eqiad.wmnet with reason: maintenance [21:13:31] !log phabricator - maintenance window starting - expect possible downtime [21:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:44] !log phab* - upgrading PHP packages [21:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:05] !log milimetric@deploy2002 Started deploy [airflow-dags/analytics@c316893]: Deploying analytics dags [airflow-dags@c316893] [21:19:16] !log milimetric@deploy2002 Finished deploy [airflow-dags/analytics@c316893]: Deploying analytics dags [airflow-dags@c316893] (duration: 00m 11s) [21:25:44] !log brennen@deploy2002 Started deploy [phabricator/deployment@9e9b406]: deploy latest wmf/stable to phab1004 (T331915) [21:25:49] T331915: Phabricator deployment 2023-03-15 - https://phabricator.wikimedia.org/T331915 [21:26:36] !log brennen@deploy2002 Finished deploy [phabricator/deployment@9e9b406]: deploy latest wmf/stable to phab1004 (T331915) (duration: 00m 52s) [21:39:40] (03PS1) 10Ryan Kemper: sre.elasticsearch.rolling-operation: modernize log msgs [cookbooks] - 10https://gerrit.wikimedia.org/r/899789 (https://phabricator.wikimedia.org/T331303) [21:44:20] (03CR) 10Volans: "I don't mind the change and technically LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) (owner: 10BCornwall) [21:45:53] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Eevans) >>! In T330693#8697098, @gmodena wrote: >>>! In T330693#8696219, @Eev... [21:46:01] !log brennen@deploy2002 Started deploy [phabricator/deployment@982c225]: follow-up deploy for too large file message (T331915, T155130) [21:46:07] T155130: Unclear error message when uploading a larger attachment: "Exception: No configured storage engine can store this file." - https://phabricator.wikimedia.org/T155130 [21:46:08] T331915: Phabricator deployment 2023-03-15 - https://phabricator.wikimedia.org/T331915 [21:46:29] !log brennen@deploy2002 Finished deploy [phabricator/deployment@982c225]: follow-up deploy for too large file message (T331915, T155130) (duration: 00m 28s) [21:46:53] !log brennen@deploy2002 Started deploy [phabricator/deployment@982c225]: follow-up deploy for too large file message (T331915, T155130) [21:47:33] !log brennen@deploy2002 Finished deploy [phabricator/deployment@982c225]: follow-up deploy for too large file message (T331915, T155130) (duration: 00m 40s) [21:49:37] (03CR) 10Bking: [C: 03+1] sre.elasticsearch.rolling-operation: modernize log msgs [cookbooks] - 10https://gerrit.wikimedia.org/r/899789 (https://phabricator.wikimedia.org/T331303) (owner: 10Ryan Kemper) [21:50:04] (03CR) 10Ryan Kemper: [C: 03+2] sre.elasticsearch.rolling-operation: modernize log msgs [cookbooks] - 10https://gerrit.wikimedia.org/r/899789 (https://phabricator.wikimedia.org/T331303) (owner: 10Ryan Kemper) [21:52:38] (03Abandoned) 10Dzahn: phab: Improve error message for too large file uploads [puppet] - 10https://gerrit.wikimedia.org/r/877188 (https://phabricator.wikimedia.org/T155130) (owner: 10Aklapper) [21:59:36] !log end of phabricator update window (T331915) [21:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:42] T331915: Phabricator deployment 2023-03-15 - https://phabricator.wikimedia.org/T331915 [22:07:58] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@e17ee96]: max_partition macro now returns str [22:08:13] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@e17ee96]: max_partition macro now returns str (duration: 00m 14s) [22:11:18] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10BCornwall) p:05Triage→03Low [22:18:40] 10SRE, 10Traffic, 10Performance-Team (Radar): Adapt all the things to localized Special: namespaces - https://phabricator.wikimedia.org/T105434 (10BCornwall) 05Open→03Invalid The ticket is too broad for simple action, and too old. It may be important, in which case please re-open with more specific details. [22:20:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:24:55] jouncebot nowandnext [22:24:55] For the next 0 hour(s) and 35 minute(s): Phabricator update window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230315T2100) [22:24:55] In 7 hour(s) and 35 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T0600) [22:24:55] In 7 hour(s) and 35 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T0600) [22:25:07] rolling out a revert for one of these patches here. [22:28:43] !log brennen@deploy2002 Started deploy [phabricator/deployment@95b4f4b]: revert other assignee (T331915) [22:28:48] T331915: Phabricator deployment 2023-03-15 - https://phabricator.wikimedia.org/T331915 [22:29:11] !log brennen@deploy2002 Finished deploy [phabricator/deployment@95b4f4b]: revert other assignee (T331915) (duration: 00m 28s) [22:29:28] !log brennen@deploy2002 Started deploy [phabricator/deployment@95b4f4b]: revert other assignee (T331915) [22:30:23] !log brennen@deploy2002 Finished deploy [phabricator/deployment@95b4f4b]: revert other assignee (T331915) (duration: 00m 55s) [22:36:46] (03PS1) 10Dzahn: requesttracker: limit http monitoring to IPv4, for now [puppet] - 10https://gerrit.wikimedia.org/r/899807 (https://phabricator.wikimedia.org/T327978) [22:40:20] (03CR) 10Dzahn: [C: 03+2] requesttracker: limit http monitoring to IPv4, for now [puppet] - 10https://gerrit.wikimedia.org/r/899807 (https://phabricator.wikimedia.org/T327978) (owner: 10Dzahn) [22:41:54] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10Legoktm) ` >>> len("enwp.org") 8 >>> len("w.wiki") 6 ` Not to mention that in most cases w.wiki will generate shorter URLs than enwp.org. I think we'd be better off focusing improving the w.wiki service and getting... [22:44:22] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10Dzahn) +1 to what Legoktm said, it was quite the effort to introduce our very own w.wiki which is already official and shorter. Introducing a second "2nd tier" redirector also operated by us seems like future tech d... [22:50:12] (03CR) 10Dzahn: [C: 03+2] "design.wikimedia.org checked on miscweb* appears to have the same issue, refused via IPv6" [puppet] - 10https://gerrit.wikimedia.org/r/899807 (https://phabricator.wikimedia.org/T327978) (owner: 10Dzahn) [22:54:48] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:55:14] !log Removing 1 file for legal compliance [22:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:55:52] (03PS1) 10Dzahn: follow-up to Icfbf667ff3e629d [puppet] - 10https://gerrit.wikimedia.org/r/899836 (https://phabricator.wikimedia.org/T327976) [22:56:02] (03CR) 10CI reject: [V: 04-1] follow-up to Icfbf667ff3e629d [puppet] - 10https://gerrit.wikimedia.org/r/899836 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [22:56:33] (03PS2) 10Dzahn: miscweb: limit http monitoring for design.wm.org to IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/899836 (https://phabricator.wikimedia.org/T327976) [22:56:44] (03CR) 10CI reject: [V: 04-1] miscweb: limit http monitoring for design.wm.org to IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/899836 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [22:57:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 251.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [22:58:04] (03PS3) 10Dzahn: miscweb: limit http monitoring for design.wm.org to IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/899836 (https://phabricator.wikimedia.org/T327976) [23:00:14] (03PS4) 10Dzahn: miscweb: limit http monitoring for design.wm.org to IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/899836 (https://phabricator.wikimedia.org/T327976) [23:00:25] (03CR) 10Dzahn: [C: 03+2] miscweb: limit http monitoring for design.wm.org to IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/899836 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:00:28] (03CR) 10Dzahn: [V: 03+2 C: 03+2] miscweb: limit http monitoring for design.wm.org to IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/899836 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:03:17] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for Prabhat - https://phabricator.wikimedia.org/T332214 (10Ottomata) Approved! [23:03:40] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10TheresNoTime) Also agreed irt phasing out `enwp.org` in favour of `w.wiki` (be nice if you could do `w.wiki/en/A_page` or something, but that's for another time..) //however// that desire isn't mutually exclusive to... [23:09:09] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10violetwtf) @Legoktm The viability of this as a URL shortener is not relevant to the discussion. I am not proposing that we create a URL shortener, I am proposing that we take one on that already exists so that we ca... [23:11:07] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) There are more details about the errors in the FileOperation channel, for example [[https://l... [23:11:59] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10Dzahn) >>! In T332220#8700427, @TheresNoTime wrote: > I'd personally much rather **we** decide to break a load of enwp.org links now, than the domain expire one day and be used for something malicious.. While that'... [23:15:09] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10Dzahn) > better for them to use it with the safety of a WMF implementation behind it. Sure, that's true. Not argueing with that. > The technical effort is beyond minimal to keep this running, But at the same time... [23:15:18] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10violetwtf) >>! In T332220#8700465, @Dzahn wrote: >"write a bunch of complex rewrite rules to rewrite old URL shortener URLs to new URL shortener URLs and maintain them forever". The rewrite rules in question are: *... [23:19:38] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:21:30] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:21:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:22:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 259.9k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [23:22:16] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [23:24:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 218k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [23:24:45] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10Dzahn) >>! In T332220#8700472, @violetwtf wrote: > Please do not exaggerate the complexity of this task to make a point That is if you want to keep the old URLs around forever and not migrate them. You will also nee... [23:27:16] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [23:28:23] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10violetwtf) >>! In T332220#8700500, @Dzahn wrote: > That is if you want to keep the old URLs around forever and not migrate them. enwp.org has never been and ideally will never be responsible for things such as Wikipe... [23:29:42] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10bd808) This is not a declaration that this is a good idea, but if I understand the current `enwp.org` behavior, I think this would replace it: ` diff --git i/modules/ncredir/files/nc_redirects.dat w/modules/ncredir/f... [23:30:24] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10Legoktm) >>! In T332220#8700427, @TheresNoTime wrote: > ...//however// that desire isn't mutually exclusive to having the domain donated to us — I'd personally much rather **we** decide to break a load of enwp.org li... [23:33:07] I did *not* expect that task to be as dramatic (: [23:34:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 200.4k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [23:37:54] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10violetwtf) >>! In T332220#8700506, @bd808 wrote: > This is not a declaration that this is a good idea, but if I understand the current `enwp.org` behavior, I think this would replace it: Almost. I am editing this d... [23:44:44] making those 2 lines of code changes is definitely not going to be the end of it. that's what I can say [23:48:41] which we had already agreed on previously. then saying right after "we are just deploying 2 lines of code" seems to be downplaying it [23:50:31] I can certainly empathize with the frustration of something which appears to be "just a few rewrite rules" [23:50:53] out of all people you could pick to say "if you have been on IRC long enough" .. picking legoktm.. is pretty intense [23:51:53] RIP my freenode registration date, which was 2007 [23:51:57] That did make me chuckle a little yeah :p though in their defence, I *do* use enwp.org a lot on IRC.. [23:51:57] TheresNoTime: well, yea _appears_ is the point [23:52:14] I'm sorry if I made that all worse mutante. I had that comment open for quite a while and did not refresh the task to see all the new discussion before I posted it. :/ [23:53:20] bd808: dont worry, it's fine and I know exacly how that goes with open phab comments [23:53:21] * bd808 misses the bugzilla prompt about that all these years later [23:54:05] mutante: oh for sure, I'm sat somewhere in the middle on that whole task — if touching any of the wm infra has taught me anything, lots of things look simple until you start :D [23:54:35] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) Breaking down swift_proxy_server_errors_total by instance, you can see that ms-fe2009 was at... [23:54:59] I mean, really I think the main point is that enwp can be used in a way that's with handwritten URLs or from bots auto-linking things in a concise way that doesn't require any API call to e.g., make a new shortlink [23:55:23] but I think legoktm is right that there's no demonstrated need for WMF to get control of it, unless there's reason to believe it won't be maintained for much longer [23:56:13] there has never been a "one off" things that really stayed "one off" for 10 years [23:57:16] Nothing sticks around longer than a temporary fix, etc [23:58:32] (though it did get me thinking, `w.wiki/en/WP:AFD` -style links for w.wiki would be nice...) [23:58:54] TheresNoTime: that's a pretty wikipedia centric url scheme [23:59:33] The world revolves around the wikipedia projects. That's why its logo is in the shape of a globe. [23:59:35] I was thinking the `en` could just be the interwiki prefix?