[00:00:47] !log zabe@deploy2002 zabe: Backport for [[gerrit:903803|throttle: Remove expired throttle]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [00:06:38] (03PS1) 10Dzahn: vrts: replace Icinga with Prometheus for SMTP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903805 (https://phabricator.wikimedia.org/T331901) [00:06:43] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:903803|throttle: Remove expired throttle]] (duration: 07m 19s) [00:09:08] (03PS1) 10Dzahn: phabricator: replace Icinga with Prometheus for SMTP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903826 (https://phabricator.wikimedia.org/T331901) [00:30:46] !log restart pybal on lvs1018 to hopefully resolve flapping BGP session [00:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:31] 10SRE, 10ops-codfw, 10Traffic: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 (10ssingh) Thanks @Jhancock.wm for the fix! I can confirm the host has been resolved. For posterity: repooling the host. [00:37:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp2035.codfw.wmnet [00:37:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2035.codfw.wmnet [00:42:34] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2035.codfw.wmnet,service=cdn [00:42:34] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2035.codfw.wmnet,service=ats-be [01:17:33] (PuppetFailure) firing: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:22:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:54:35] (03PS1) 10Nray: Update "United States" static page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903835 (https://phabricator.wikimedia.org/T331681) [01:55:06] (03PS2) 10Nray: Update "United States" static page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903835 (https://phabricator.wikimedia.org/T331681) [02:04:03] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash: Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10RLazarus) No, it's a good question! I think like any other incident, it's worth writing a report when there's something we can learn. In this case, I think we stand to get... [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:21:08] (03PS1) 10Andrew Bogott: trove: move usage_timeout from mysql section to [defaults] [puppet] - 10https://gerrit.wikimedia.org/r/903840 [03:23:09] (03CR) 10Andrew Bogott: [C: 03+2] trove: move usage_timeout from mysql section to [defaults] [puppet] - 10https://gerrit.wikimedia.org/r/903840 (owner: 10Andrew Bogott) [04:30:04] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Marostegui) @papaul please do not reimage db1206, that host is already in production. We bought it in advance to test the raid controller as it's a new one. So it's serving traffic. [04:38:48] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) @Marostegui it is db1207 and db1208 not db1206. [04:40:19] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Marostegui) Great, as you mentioned db1206 earlier I got scared :) [05:11:52] (03CR) 10Samwilson: "I think this can be scheduled for deployment now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901553 (https://phabricator.wikimedia.org/T327515) (owner: 10Samwilson) [05:12:44] (03PS2) 10Samwilson: Remove WikiEditor's Realtime Preview config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901553 (https://phabricator.wikimedia.org/T327515) [05:22:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T0600) [06:20:06] 10SRE, 10Infrastructure-Foundations, 10Traffic: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10ayounsi) Brett, this project on which Jameel is working for his internship, is to collect latency data from users to all of our DCs. This will help improve our current [[ https://gerrit.wi... [06:21:12] 10SRE, 10Infrastructure-Foundations, 10Traffic: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10ayounsi) a:03JameelKaisar [06:21:53] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 108 [06:22:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 108 [06:28:55] jouncebot: now [06:28:55] For the next 0 hour(s) and 31 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T0600) [06:29:18] I am going to restart Gerrit to update some plugins [06:31:29] 10SRE, 10ChangeProp, 10Parsoid, 10RESTBase: HTTP 403 "Rerenders for this article are blacklisted in the config." via restbase for specific Commons pages - https://phabricator.wikimedia.org/T333069 (10Legoktm) known bad servers: restbase1031, restbase1032, restbase1033. I poked a bit, is it possible these... [06:32:37] (03CR) 10Hashar: [C: 03+2] wm-zuul-status: filter out non-live item [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/902705 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [06:33:13] (03CR) 10Hashar: [C: 03+2] wm-checks-api: parse PCC full message [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/903643 (owner: 10Hashar) [06:33:19] (03Merged) 10jenkins-bot: wm-zuul-status: filter out non-live item [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/902705 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [06:34:10] 10SRE, 10ChangeProp, 10Parsoid, 10RESTBase: HTTP 403 "Rerenders for this article are blacklisted in the config." via restbase for specific Commons pages - https://phabricator.wikimedia.org/T333069 (10Legoktm) p:05Triage→03Unbreak! Marking as UBN so this can be triaged appropriately. Not having deployme... [06:34:46] (03PS1) 10KartikMistry: CX3 Build 0.2.0+20230329 [extensions/ContentTranslation] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903808 (https://phabricator.wikimedia.org/T333128) [06:35:07] 10SRE, 10ChangeProp, 10Parsoid, 10RESTBase: Multiple restbase servers have not received any deployments since at least October 2022 - https://phabricator.wikimedia.org/T333069 (10Legoktm) [06:36:47] (03PS2) 10Hashar: wm-checks-api: parse PCC full message [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/903643 [06:37:15] (03CR) 10Hashar: [C: 03+2] wm-checks-api: parse PCC full message [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/903643 (owner: 10Hashar) [06:37:45] (03Merged) 10jenkins-bot: wm-checks-api: parse PCC full message [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/903643 (owner: 10Hashar) [06:38:34] !log phedenskog@deploy2002 Started deploy [performance/navtiming@f6c9fa3]: (no justification provided) [06:38:40] !log phedenskog@deploy2002 Finished deploy [performance/navtiming@f6c9fa3]: (no justification provided) (duration: 00m 05s) [06:40:30] !log hashar@deploy2002 Started deploy [gerrit/gerrit@e7c1696]: Update Gerrit javascript plugins [06:40:31] 10SRE, 10ChangeProp, 10Parsoid, 10RESTBase: Multiple restbase servers have not received any deployments since at least October 2022 - https://phabricator.wikimedia.org/T333069 (10Joe) Very simply, those 3 servers are not in the targets file for deployment. [06:40:36] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@e7c1696]: Update Gerrit javascript plugins (duration: 00m 06s) [06:42:27] !log gerrit2002: restarted Gerrit replica instance [06:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:34] !log hashar@deploy2002 Started deploy [gerrit/gerrit@e7c1696]: Update Gerrit javascript plugins [06:43:44] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@e7c1696]: Update Gerrit javascript plugins (duration: 00m 10s) [06:47:27] !log Restarted Gerrit [06:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:44] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [06:59:12] !log oblivian@deploy2002 Started deploy [restbase/deploy@11477d6]: Updating stale nodes, T333069 [06:59:18] T333069: Multiple restbase servers have not received any deployments since at least October 2022 - https://phabricator.wikimedia.org/T333069 [07:00:04] Amir1, Urbanecm, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T0700). [07:00:04] AaronSchulz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:06] (03CR) 10Hashar: doc: upgrade php from 7.3 to 7.4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [07:02:37] gah, wrong window [07:03:05] * AaronSchulz updates the deploy page [07:03:18] <_joe_> AaronSchulz: ahah I was about to ask where was your patch gone :D [07:06:52] 10SRE, 10ChangeProp, 10Parsoid, 10RESTBase, 10Patch-For-Review: Multiple restbase servers have not received any deployments since at least October 2022 - https://phabricator.wikimedia.org/T333069 (10Joe) Adding @hnowlan as FYI so that we are careful about this in the future. I would add that probably it... [07:07:24] ah. Edit conflict :~ [07:07:44] I'll do deployment then.. [07:07:53] !log Update Squid logformat [07:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903808 (https://phabricator.wikimedia.org/T333128) (owner: 10KartikMistry) [07:10:14] (03CR) 10Slyngshede: [C: 03+2] C:idm::jobs absent permission sync. [puppet] - 10https://gerrit.wikimedia.org/r/903647 (owner: 10Slyngshede) [07:21:12] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:24:10] (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20230329 [extensions/ContentTranslation] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903808 (https://phabricator.wikimedia.org/T333128) (owner: 10KartikMistry) [07:24:59] !log kartik@deploy2002 Started scap: Backport for [[gerrit:903808|CX3 Build 0.2.0+20230329 (T333128 T328533 T317995)]] [07:25:08] T333128: Links are not editable in Section Translation editor - https://phabricator.wikimedia.org/T333128 [07:25:08] T328533: In progress-translations not displayed for new articles on mobile despite being persisted - https://phabricator.wikimedia.org/T328533 [07:25:08] T317995: Instrument follow-up invite shown after publishing - https://phabricator.wikimedia.org/T317995 [07:26:28] !log kartik@deploy2002 kartik: Backport for [[gerrit:903808|CX3 Build 0.2.0+20230329 (T333128 T328533 T317995)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [07:27:42] 10SRE, 10ChangeProp, 10Parsoid, 10RESTBase, 10Patch-For-Review: Multiple restbase servers have not received any deployments since at least October 2022 - https://phabricator.wikimedia.org/T333069 (10Joe) 05Open→03Resolved a:03Joe All stale nodes have been updated. [07:27:46] !log installed spicerack v6.4.0 on cumin2002 [07:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:21] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto) [07:29:46] (03PS4) 10JMeybohm: k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) [07:29:48] (03PS1) 10JMeybohm: k8s: Remove deprecated typology annotaions [puppet] - 10https://gerrit.wikimedia.org/r/904050 (https://phabricator.wikimedia.org/T325066) [07:31:20] !log oblivian@deploy2002 Finished deploy [restbase/deploy@11477d6]: Updating stale nodes, T333069 (duration: 32m 07s) [07:31:26] T333069: Multiple restbase servers have not received any deployments since at least October 2022 - https://phabricator.wikimedia.org/T333069 [07:32:02] (03CR) 10CI reject: [V: 04-1] k8s: Remove deprecated typology annotaions [puppet] - 10https://gerrit.wikimedia.org/r/904050 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm) [07:34:16] (03PS2) 10JMeybohm: k8s: Remove deprecated typology annotaions [puppet] - 10https://gerrit.wikimedia.org/r/904050 (https://phabricator.wikimedia.org/T325066) [07:34:18] (03PS5) 10JMeybohm: k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) [07:34:50] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2003.codfw.wmnet with reason: Stop kafka, dist-upgrade [07:35:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2003.codfw.wmnet with reason: Stop kafka, dist-upgrade [07:36:11] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [07:36:19] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10ayounsi) [07:36:42] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks again everybody! [07:37:34] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:903808|CX3 Build 0.2.0+20230329 (T333128 T328533 T317995)]] (duration: 12m 35s) [07:37:43] T333128: Links are not editable in Section Translation editor - https://phabricator.wikimedia.org/T333128 [07:37:43] T328533: In progress-translations not displayed for new articles on mobile despite being persisted - https://phabricator.wikimedia.org/T328533 [07:37:44] T317995: Instrument follow-up invite shown after publishing - https://phabricator.wikimedia.org/T317995 [07:37:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (DELETE ipamhandles) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:41:50] (03CR) 10Elukey: k8s: Remove deprecated typology annotaions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904050 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm) [07:42:14] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Remove 1.16 related code (v2) [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [07:42:58] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE ipamhandles) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:43:19] (03PS1) 10Filippo Giunchedi: Revert "alertmanager: delete unused serviceops-collab receivers" [puppet] - 10https://gerrit.wikimedia.org/r/904069 [07:46:01] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "alertmanager: delete unused serviceops-collab receivers" [puppet] - 10https://gerrit.wikimedia.org/r/904069 (owner: 10Filippo Giunchedi) [07:48:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [07:51:48] (03CR) 10Filippo Giunchedi: [C: 03+1] pybal: Add runbook link to alert [alerts] - 10https://gerrit.wikimedia.org/r/903777 (https://phabricator.wikimedia.org/T310933) (owner: 10BCornwall) [07:52:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:53:13] (03PS3) 10JMeybohm: k8s: Remove deprecated topology annotaions [puppet] - 10https://gerrit.wikimedia.org/r/904050 (https://phabricator.wikimedia.org/T325066) [07:53:15] (03PS6) 10JMeybohm: k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) [07:53:30] (03CR) 10JMeybohm: k8s: Remove deprecated topology annotaions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904050 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm) [07:54:02] (03CR) 10Vgutierrez: [C: 03+1] "VCL looks good and tests are happy (tested against PS31):" [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [07:54:15] (03PS1) 10Slyngshede: C:idm::jobs ensure correct settings are used. [puppet] - 10https://gerrit.wikimedia.org/r/904056 [07:54:49] (03CR) 10CI reject: [V: 04-1] C:idm::jobs ensure correct settings are used. [puppet] - 10https://gerrit.wikimedia.org/r/904056 (owner: 10Slyngshede) [07:55:50] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40385/console" [puppet] - 10https://gerrit.wikimedia.org/r/904050 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm) [07:56:16] (03PS2) 10Slyngshede: C:idm::jobs ensure correct settings are used. [puppet] - 10https://gerrit.wikimedia.org/r/904056 [07:56:18] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40387/console" [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [07:56:42] (03CR) 10Hashar: [C: 03+2] "I have confirmed the error has gone from I8ed3ff5d7712569b74a23936aba43e5039b91b00" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/903643 (owner: 10Hashar) [07:57:33] (PuppetFailure) resolved: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:57:51] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40388/console" [puppet] - 10https://gerrit.wikimedia.org/r/903693 (https://phabricator.wikimedia.org/T332869) (owner: 10Jelto) [07:58:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [07:58:19] (03CR) 10Vgutierrez: [C: 03+1] pybal: Add runbook link to alert [alerts] - 10https://gerrit.wikimedia.org/r/903777 (https://phabricator.wikimedia.org/T310933) (owner: 10BCornwall) [08:01:35] (03CR) 10Jelto: [V: 03+1 C: 03+2] aphlict: pass ensure flags to logrotate timer [puppet] - 10https://gerrit.wikimedia.org/r/903693 (https://phabricator.wikimedia.org/T332869) (owner: 10Jelto) [08:03:57] !log installed spicerack v6.4.0 on cumin1001 [08:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:11] (03PS3) 10Slyngshede: C:idm::jobs ensure correct settings are used. [puppet] - 10https://gerrit.wikimedia.org/r/904056 [08:05:49] (03CR) 10Ayounsi: [C: 03+2] Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [08:07:05] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40390/console" [puppet] - 10https://gerrit.wikimedia.org/r/904056 (owner: 10Slyngshede) [08:10:15] (03CR) 10David Caro: maintain-dbusers: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [08:12:18] 10SRE, 10SRE-Access-Requests: Update SSH key for abi - https://phabricator.wikimedia.org/T333298 (10abi_) >>! In T333298#8734026, @Ladsgroup wrote: > you'll have access with the new keys in thirty minutes Thanks. I have access now. [08:13:02] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10MatthewVernon) Hi folks. I am happy for you to do the firmware update first if you think that's the best approach. Please do so at your earliest convenience - a note here when you... [08:15:30] (03CR) 10Ayounsi: [C: 03+2] k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [08:18:55] (03CR) 10David Caro: maintain-dbusers: run isort and black and use pep563 types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902815 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [08:22:07] (03CR) 10Stevemunene: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/903245 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [08:23:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: Remove deprecated topology annotaions [puppet] - 10https://gerrit.wikimedia.org/r/904050 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm) [08:28:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/903654 (https://phabricator.wikimedia.org/T288622) (owner: 10Herron) [08:31:04] (03CR) 10Jcrespo: [C: 03+2] Revert "bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs" [puppet] - 10https://gerrit.wikimedia.org/r/903201 (owner: 10Dzahn) [08:31:12] (03CR) 10Btullis: [C: 03+1] "Awesome. Please feel free to go ahead elukey :-)" [puppet] - 10https://gerrit.wikimedia.org/r/903245 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [08:37:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/903658 (owner: 10Herron) [08:38:54] (03PS1) 10Hashar: prometheus: add instance label for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/904058 [08:41:33] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::jobs ensure correct settings are used. [puppet] - 10https://gerrit.wikimedia.org/r/904056 (owner: 10Slyngshede) [08:47:28] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10TheDJ) >>! In T333042#8736491, @Lionel_Scheepmans wrote: > Hi folks. > > I'm in front of a very strange phenomenon probably linked to this bug, and th... [08:48:13] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10TheDJ) Found another DC inconsistency for an upload from the 23rd of March T333042#8737515 [08:48:19] (03PS2) 10Hashar: prometheus: add instance label for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/904058 [08:51:14] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10TheDJ) OK, for doc taxon I can also reproduce now with that one specific link that AntiCompositeNumber found: for DC in esams eqiad codfw ulsfo eqsin... [08:52:56] (03CR) 10Hashar: "I have added the documentation on a deployment runbook https://wikitech.wikimedia.org/w/index.php?title=Gerrit%2FUpgrade&diff=2064688&oldi" [puppet] - 10https://gerrit.wikimedia.org/r/904058 (owner: 10Hashar) [08:54:14] (03PS1) 10Slyngshede: C:idm::jobs Correct settings env variable. [puppet] - 10https://gerrit.wikimedia.org/r/904059 [08:54:21] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10TheDJ) [08:54:32] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10TheDJ) [08:54:55] (03CR) 10Slyngshede: [C: 03+2] C:idm::jobs Correct settings env variable. [puppet] - 10https://gerrit.wikimedia.org/r/904059 (owner: 10Slyngshede) [08:55:30] (03CR) 10Volans: [C: 03+2] run_cookbook: fix/improve calls to other cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/902449 (owner: 10Volans) [08:57:47] (03Merged) 10jenkins-bot: run_cookbook: fix/improve calls to other cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/902449 (owner: 10Volans) [08:58:48] (03CR) 10Filippo Giunchedi: "Thank you for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/904058 (owner: 10Hashar) [09:00:37] jouncebot: next [09:00:37] In 0 hour(s) and 59 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T1000) [09:02:10] (03PS1) 10Filippo Giunchedi: Revert "wmnet: move reads to graphite2004" [dns] - 10https://gerrit.wikimedia.org/r/904073 [09:02:21] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-jumbo1001.eqiad.wmnet with reason: restart kafka, upgrade to PKI [09:02:32] (03CR) 10Elukey: [V: 03+1 C: 03+2] Move kafka-jumbo1001's kafka broker to PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/903245 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [09:02:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-jumbo1001.eqiad.wmnet with reason: restart kafka, upgrade to PKI [09:02:52] !log move kafka on kafka-jumbo1001 to PKI TLS certs - T296064 [09:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:57] T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 [09:03:55] (03PS6) 10Clément Goubert: service_catalog: Add mw-api-int k8s service - 1 [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) [09:06:02] (03PS1) 10Filippo Giunchedi: Revert "graphite: check graphite2004" [puppet] - 10https://gerrit.wikimedia.org/r/904074 [09:09:42] (03PS1) 10Clément Goubert: service_catalog: Add mw-api-int k8s service - 2 [puppet] - 10https://gerrit.wikimedia.org/r/904060 (https://phabricator.wikimedia.org/T333120) [09:09:44] (03PS1) 10Clément Goubert: service_catalog: Add mw-api-int k8s service - 3 [puppet] - 10https://gerrit.wikimedia.org/r/904061 (https://phabricator.wikimedia.org/T333120) [09:09:55] (03CR) 10Hashar: prometheus: add instance label for Gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904058 (owner: 10Hashar) [09:10:12] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "wmnet: move reads to graphite2004" [dns] - 10https://gerrit.wikimedia.org/r/904073 (owner: 10Filippo Giunchedi) [09:10:16] (03PS2) 10Filippo Giunchedi: Revert "wmnet: move reads to graphite2004" [dns] - 10https://gerrit.wikimedia.org/r/904073 [09:10:17] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [09:10:20] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "graphite: check graphite2004" [puppet] - 10https://gerrit.wikimedia.org/r/904074 (owner: 10Filippo Giunchedi) [09:12:31] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10MatthewVernon) I think in these cases, removing the incorrect thumbnail will allow it to be recreated on next GET. [09:14:45] (03CR) 10Btullis: [C: 03+1] Add role_contacts to buster hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903686 (owner: 10Ayounsi) [09:15:17] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [09:15:18] (03PS1) 10Filippo Giunchedi: Revert "wmnet: move writes to graphite2004" [dns] - 10https://gerrit.wikimedia.org/r/904075 [09:15:30] (03PS1) 10Filippo Giunchedi: Revert "Failover statsd to graphite2004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904076 [09:15:42] (03PS1) 10Filippo Giunchedi: Revert "statsd: move writes to graphite2004" [puppet] - 10https://gerrit.wikimedia.org/r/904077 [09:19:08] (03CR) 10Jbond: [C: 03+1] Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [09:19:58] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1007 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:20:20] hmm...looking [09:21:41] (03CR) 10Jaime Nuche: "Thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [09:22:03] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "statsd: move writes to graphite2004" [puppet] - 10https://gerrit.wikimedia.org/r/904077 (owner: 10Filippo Giunchedi) [09:22:24] (03PS3) 10Jbond: setup.py: update dnspython requierments to match spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/903734 [09:22:36] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "wmnet: move writes to graphite2004" [dns] - 10https://gerrit.wikimedia.org/r/904075 (owner: 10Filippo Giunchedi) [09:22:40] (03PS2) 10Filippo Giunchedi: Revert "wmnet: move writes to graphite2004" [dns] - 10https://gerrit.wikimedia.org/r/904075 [09:22:42] (03CR) 10Jbond: "updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/903734 (owner: 10Jbond) [09:23:08] (03CR) 10Jbond: [C: 03+1] icinga: remove widespread puppet agent alerts [puppet] - 10https://gerrit.wikimedia.org/r/903654 (https://phabricator.wikimedia.org/T288622) (owner: 10Herron) [09:24:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by filippo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904076 (owner: 10Filippo Giunchedi) [09:24:52] <_joe_> jouncebot: nowandnext [09:24:52] No deployments scheduled for the next 0 hour(s) and 35 minute(s) [09:24:53] In 0 hour(s) and 35 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T1000) [09:25:42] (03Merged) 10jenkins-bot: Revert "Failover statsd to graphite2004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904076 (owner: 10Filippo Giunchedi) [09:25:47] (03CR) 10Jbond: "-1: missed the lookup keys" [puppet] - 10https://gerrit.wikimedia.org/r/899662 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [09:26:06] !log filippo@deploy2002 Started scap: Backport for [[gerrit:904076|Revert "Failover statsd to graphite2004"]] [09:26:09] (03PS1) 10Elukey: cumin: add aliases for Redis Misc pairs [puppet] - 10https://gerrit.wikimedia.org/r/904062 (https://phabricator.wikimedia.org/T332598) [09:27:13] (03PS2) 10Elukey: cumin: add aliases for Redis Misc pairs [puppet] - 10https://gerrit.wikimedia.org/r/904062 (https://phabricator.wikimedia.org/T332598) [09:27:35] !log filippo@deploy2002 filippo: Backport for [[gerrit:904076|Revert "Failover statsd to graphite2004"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [09:29:38] (03CR) 10Effie Mouzeli: [C: 03+1] "Just some nits, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:30:18] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) Just noticed what has been probably in the radar for @cmooney for some time now: [[https://ne... [09:30:44] (03PS14) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [09:31:32] (03CR) 10Clément Goubert: service_catalog: Add mw-api-int k8s service - 1 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:31:51] (03PS7) 10Clément Goubert: service_catalog: Add mw-api-int k8s service - 1 [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) [09:32:03] (03PS2) 10Clément Goubert: service_catalog: Add mw-api-int k8s service - 2 [puppet] - 10https://gerrit.wikimedia.org/r/904060 (https://phabricator.wikimedia.org/T333120) [09:32:11] (03PS2) 10Clément Goubert: service_catalog: Add mw-api-int k8s service - 3 [puppet] - 10https://gerrit.wikimedia.org/r/904061 (https://phabricator.wikimedia.org/T333120) [09:32:44] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond) [09:32:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:33:40] !log filippo@deploy2002 Finished scap: Backport for [[gerrit:904076|Revert "Failover statsd to graphite2004"]] (duration: 07m 34s) [09:34:02] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40391/console" [puppet] - 10https://gerrit.wikimedia.org/r/904060 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:34:50] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40392/console" [puppet] - 10https://gerrit.wikimedia.org/r/904061 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:34:58] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes1007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1007 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:35:17] (03PS3) 10Clément Goubert: service_catalog: Add mw-api-int k8s service - 3 [puppet] - 10https://gerrit.wikimedia.org/r/904061 (https://phabricator.wikimedia.org/T333120) [09:36:25] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40393/console" [puppet] - 10https://gerrit.wikimedia.org/r/904061 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:37:43] (03PS1) 10JMeybohm: Increase typha replicas in ml-serve and dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/904064 (https://phabricator.wikimedia.org/T292077) [09:37:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:38:04] (03PS4) 10Clément Goubert: service_catalog: Add mw-api-int k8s service - 3 [puppet] - 10https://gerrit.wikimedia.org/r/904061 (https://phabricator.wikimedia.org/T333120) [09:38:45] (03CR) 10Filippo Giunchedi: prometheus: add instance label for Gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904058 (owner: 10Hashar) [09:39:07] RECOVERY - Check systemd state on idm2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:17] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40394/console" [puppet] - 10https://gerrit.wikimedia.org/r/904061 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:40:11] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10elukey) [09:40:20] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Validators: add symlink to netbox-extra [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889963 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [09:40:25] (03CR) 10Hnowlan: [C: 03+1] cumin: add aliases for Redis Misc pairs [puppet] - 10https://gerrit.wikimedia.org/r/904062 (https://phabricator.wikimedia.org/T332598) (owner: 10Elukey) [09:43:12] (03PS1) 10Clément Goubert: mw-api-int: add geo and metafo records [dns] - 10https://gerrit.wikimedia.org/r/904065 (https://phabricator.wikimedia.org/T333120) [09:43:43] (03PS2) 10Clément Goubert: mw-api-int: add geo and metafo records [dns] - 10https://gerrit.wikimedia.org/r/904065 (https://phabricator.wikimedia.org/T333120) [09:44:13] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [09:44:36] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [cookbooks] - 10https://gerrit.wikimedia.org/r/903734 (owner: 10Jbond) [09:45:09] PROBLEM - prometheus-esams.wikimedia.org requires authentication on prometheus3001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [09:45:30] 10SRE, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10elukey) [09:45:40] (03PS1) 10Clément Goubert: service_catalog: Remove unnecessary anchors [puppet] - 10https://gerrit.wikimedia.org/r/904086 [09:47:25] denisse: the prometheus esams alert is you I take it ? please don't forget to !log [09:48:03] (03CR) 10Volans: "Is there any way we could query for those without hardcoding them?" [puppet] - 10https://gerrit.wikimedia.org/r/904062 (https://phabricator.wikimedia.org/T332598) (owner: 10Elukey) [09:48:13] Hi godog, yes. It's me. I'll add to the logs. [09:48:55] denisse: cool! cheers [09:49:16] (03CR) 10Elukey: "Thanks! Left a comment for DSE, once the patch is updated I'll take care of the rollout :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904064 (https://phabricator.wikimedia.org/T292077) (owner: 10JMeybohm) [09:50:01] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: a good reason - ayounsi@cumin1001 [09:50:39] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2002.codfw.wmnet with reason: a good reason - ayounsi@cumin1001 [09:50:44] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: a good reason - ayounsi@cumin1001 [09:50:52] (03CR) 10Elukey: cumin: add aliases for Redis Misc pairs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904062 (https://phabricator.wikimedia.org/T332598) (owner: 10Elukey) [09:52:07] (03CR) 10Elukey: cumin: add aliases for Redis Misc pairs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904062 (https://phabricator.wikimedia.org/T332598) (owner: 10Elukey) [09:53:26] (03CR) 10Jbond: [C: 03+2] setup.py: update dnspython requierments to match spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/903734 (owner: 10Jbond) [09:54:14] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2002.codfw.wmnet with reason: a good reason - ayounsi@cumin1001 [09:54:25] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: a good reason - ayounsi@cumin1001 [09:55:22] (03PS1) 10Jelto: aphlict: force unmask of logrotate service with refreshonly false [puppet] - 10https://gerrit.wikimedia.org/r/904087 (https://phabricator.wikimedia.org/T332869) [09:55:41] (03Merged) 10jenkins-bot: setup.py: update dnspython requierments to match spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/903734 (owner: 10Jbond) [09:55:53] RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: a good reason - ayounsi@cumin1001 [09:56:25] (03PS3) 10Arturo Borrero Gonzalez: cloud_private_subnet: add route to public IPv4 range [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992) [09:56:36] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40395/console" [puppet] - 10https://gerrit.wikimedia.org/r/904087 (https://phabricator.wikimedia.org/T332869) (owner: 10Jelto) [09:57:03] !log Adding mw-api-int to service_catalog in service_setup - T333120 [09:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:08] T333120: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 [09:57:13] (03CR) 10Clément Goubert: [C: 03+2] service_catalog: Add mw-api-int k8s service - 1 [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:57:29] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: a good reason - ayounsi@cumin1001 [09:57:46] (03PS15) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [09:58:04] !log updating prometheus3001 to bullseye [09:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:56] !log running puppet on O:kubernetes::worker and O:lvs::balancer - T333120 [09:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: a good reason - ayounsi@cumin1001 [09:59:25] (03CR) 10Filippo Giunchedi: Remove EventGate Icinga checks that have been moved to alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T1000) [10:00:06] (03CR) 10EoghanGaffney: [C: 03+1] aphlict: force unmask of logrotate service with refreshonly false [puppet] - 10https://gerrit.wikimedia.org/r/904087 (https://phabricator.wikimedia.org/T332869) (owner: 10Jelto) [10:00:33] RECOVERY - Check systemd state on arclamp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:19] (03CR) 10Jelto: [V: 03+1 C: 03+2] aphlict: force unmask of logrotate service with refreshonly false [puppet] - 10https://gerrit.wikimedia.org/r/904087 (https://phabricator.wikimedia.org/T332869) (owner: 10Jelto) [10:01:29] RECOVERY - prometheus-esams.wikimedia.org requires authentication on prometheus3001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 1.332 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:02:22] !log hnowlan@deploy2002 Started deploy [restbase/deploy@c265f3f]: Add ckbwiktionary, anpwiki T332093 T332379 [10:02:29] T332379: Post-creation work for anpwiki - https://phabricator.wikimedia.org/T332379 [10:02:29] T332093: Post-creation work for ckbwiktionary - https://phabricator.wikimedia.org/T332093 [10:03:41] (03PS3) 10Clément Goubert: service_catalog: Add mw-api-int k8s service - 2 [puppet] - 10https://gerrit.wikimedia.org/r/904060 (https://phabricator.wikimedia.org/T333120) [10:03:55] (03PS5) 10Clément Goubert: service_catalog: Add mw-api-int k8s service - 3 [puppet] - 10https://gerrit.wikimedia.org/r/904061 (https://phabricator.wikimedia.org/T333120) [10:08:37] (03CR) 10Hashar: gerrit: set gitiles clone url to http (Gerrit 3.6.2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903260 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar) [10:10:28] (03PS4) 10Arturo Borrero Gonzalez: cloud_private_subnet: codfw: relocate some hiera [puppet] - 10https://gerrit.wikimedia.org/r/903623 (https://phabricator.wikimedia.org/T324992) [10:10:58] (03CR) 10Effie Mouzeli: [C: 03+1] P:services_proxy::envoy: Add mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [10:20:29] (03CR) 10Effie Mouzeli: [C: 03+1] service_catalog: Add mw-api-int k8s service - 2 [puppet] - 10https://gerrit.wikimedia.org/r/904060 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [10:20:56] (03CR) 10Effie Mouzeli: [C: 03+1] service_catalog: Add mw-api-int k8s service - 3 [puppet] - 10https://gerrit.wikimedia.org/r/904061 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [10:21:52] !log hnowlan@deploy2002 Finished deploy [restbase/deploy@c265f3f]: Add ckbwiktionary, anpwiki T332093 T332379 (duration: 19m 30s) [10:21:59] T332379: Post-creation work for anpwiki - https://phabricator.wikimedia.org/T332379 [10:21:59] T332093: Post-creation work for ckbwiktionary - https://phabricator.wikimedia.org/T332093 [10:22:01] (03CR) 10Effie Mouzeli: [C: 03+1] "Thank you very much for this!" [puppet] - 10https://gerrit.wikimedia.org/r/904086 (owner: 10Clément Goubert) [10:26:49] (03CR) 10Effie Mouzeli: [C: 03+1] "one nit, otherwise LGTM" [dns] - 10https://gerrit.wikimedia.org/r/904065 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [10:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:30:52] possibly stupid question, but is there a reason why group0 isn’t on wmf.2 yet (except testwikis)? I can’t see one in T330208 [10:30:53] T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208 [10:31:16] (I see two scaps moving testwikis to wmf.2, a few hours apart – perhaps the second one was supposed to be group0 but accidentally not? just guessing though) [10:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:37:09] !log Switching mw-api-int to lvs_setup - T333120 [10:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:15] T333120: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 [10:37:27] (03CR) 10Clément Goubert: [C: 03+2] service_catalog: Add mw-api-int k8s service - 2 [puppet] - 10https://gerrit.wikimedia.org/r/904060 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [10:38:48] (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond) [10:39:59] (03Abandoned) 10Hashar: prometheus: add instance label for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/904058 (owner: 10Hashar) [10:40:26] (03PS1) 10Volans: netbox: add git safe.directory to netbox's src [puppet] - 10https://gerrit.wikimedia.org/r/904096 [10:40:40] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr) [10:40:47] (03Merged) 10jenkins-bot: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond) [10:40:49] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10Jclark-ctr) 05Open→03Resolved Pdu's have been configured [10:41:03] !log cgoubert@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T333120) [10:42:14] (03CR) 10Ayounsi: [C: 03+1] netbox: add git safe.directory to netbox's src [puppet] - 10https://gerrit.wikimedia.org/r/904096 (owner: 10Volans) [10:42:34] (03CR) 10Volans: [C: 03+2] netbox: add git safe.directory to netbox's src [puppet] - 10https://gerrit.wikimedia.org/r/904096 (owner: 10Volans) [10:42:51] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.81:4446]) https://wikitech.wikimedia.org/wiki/PyBal [10:42:59] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T333120) [10:43:02] Me &^ [10:43:05] T333120: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 [10:43:39] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 76 connections established with conf1007.eqiad.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [10:44:09] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.81:4446]) https://wikitech.wikimedia.org/wiki/PyBal [10:45:38] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10Lucas_Werkmeister_WMDE) > [ ] **Purpose** (Specify which service you need to get access to, e.g. Icinga, Grafana, Superset etc): I don’t think we... [10:46:23] !log cgoubert@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T333120) [10:47:37] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:48:09] (03PS5) 10Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) [10:49:01] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 77 connections established with conf1007.eqiad.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [10:49:05] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:49:17] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T333120) [10:49:23] T333120: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 [10:50:02] (03PS7) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [10:50:22] !log START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T333120) [10:50:22] (apparently it didn´t log itself) [10:50:22] Hmm apparently IRC logging isn´t working either [10:50:22] It logs to SAL but irc echo-ing looks broken [10:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:57] !log Switching mw-api-int to production - T333120 [10:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:02] (03CR) 10Clément Goubert: [C: 03+2] service_catalog: Add mw-api-int k8s service - 3 [puppet] - 10https://gerrit.wikimedia.org/r/904061 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [10:52:30] !log Running puppet on dns-auth - T333120 [10:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:43] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:55:07] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=mw-api-int-ro [10:55:08] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=mw-api-int,name=codfw [10:55:16] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: add geo and metafo records [dns] - 10https://gerrit.wikimedia.org/r/904065 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [10:55:43] (03PS3) 10Clément Goubert: mw-api-int: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/904065 (https://phabricator.wikimedia.org/T333120) [10:55:58] (03CR) 10Clément Goubert: mw-api-int: add discovery records (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/904065 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [10:57:06] (03PS5) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (https://phabricator.wikimedia.org/T329669) [10:57:11] !log Running authdns-update [10:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:24] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10Ladsgroup) That checklist is part of the process for giving the rights: https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#LDAP_a... [10:58:28] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:58:30] !log authdns-update successful on all nodes - T333120 [10:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:35] T333120: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 [11:00:16] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) `mw-api-int` and `mw-api-int-ro` services now in production, we can proceed with creating the envoy listeners in https://gerrit.wikimedia.org/r/c/operat... [11:00:33] (03PS5) 10Clément Goubert: P:services_proxy::envoy: Add mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) [11:01:13] (03PS2) 10Ladsgroup: admin: Add Oleksandr Tsyba to ldap [puppet] - 10https://gerrit.wikimedia.org/r/903691 (https://phabricator.wikimedia.org/T333157) [11:01:24] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Add Oleksandr Tsyba to ldap [puppet] - 10https://gerrit.wikimedia.org/r/903691 (https://phabricator.wikimedia.org/T333157) (owner: 10Ladsgroup) [11:01:38] (03PS2) 10EoghanGaffney: Add aphlict role to new vm host [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) [11:04:46] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [11:06:48] (03CR) 10Jelto: [C: 03+1] "lgtm and safer to start with absent aphlict first" [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [11:06:55] (03PS8) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [11:06:57] (03PS6) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (https://phabricator.wikimedia.org/T329669) [11:07:21] (03PS2) 10David Caro: maintain_dbusers: move out of nfs to services [puppet] - 10https://gerrit.wikimedia.org/r/899662 (https://phabricator.wikimedia.org/T303663) [11:07:23] (03PS2) 10David Caro: maintain_dbusers: Remove unused param and adapt to best practices [puppet] - 10https://gerrit.wikimedia.org/r/899663 (https://phabricator.wikimedia.org/T303663) [11:07:33] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "testing GraphQL - jbond@cumin2002" [11:07:40] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40399/console" [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [11:08:22] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10Ladsgroup) I need to wait for puppet to propagate and then I do the ldap changes in mwmaint. [11:12:28] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "testing GraphQL - jbond@cumin2002" [11:12:53] (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:services_proxy::envoy: Add mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [11:18:46] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@c265f3f] (beta): (no justification provided) [11:18:47] !log jgiannelos@deploy2002 deploy aborted: (no justification provided) (duration: 00m 01s) [11:23:43] (03PS1) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [11:24:05] (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [11:25:16] (03PS2) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [11:27:08] (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [11:28:19] (03PS3) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [11:31:46] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40401/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [11:33:29] (03PS4) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [11:34:44] (03CR) 10Clément Goubert: [C: 03+2] service_catalog: Remove unnecessary anchors [puppet] - 10https://gerrit.wikimedia.org/r/904086 (owner: 10Clément Goubert) [11:35:08] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40402/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [11:36:30] (03PS1) 10Clément Goubert: service_catalog: make mw-on-k8s service page [puppet] - 10https://gerrit.wikimedia.org/r/904146 [11:39:19] (03PS2) 10Clément Goubert: service_catalog: make mw-on-k8s services page [puppet] - 10https://gerrit.wikimedia.org/r/904146 [11:39:27] (03PS5) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [11:41:22] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40403/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [11:41:25] (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [11:47:42] (03PS1) 10Ayounsi: Add policy to export prefixes to k8s nodes [homer/public] - 10https://gerrit.wikimedia.org/r/904150 (https://phabricator.wikimedia.org/T328523) [11:47:52] (03PS6) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [11:48:14] (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [11:48:59] 10SRE, 10Infrastructure-Foundations, 10netops: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) p:05Triage→03Low [11:49:44] 10SRE, 10Infrastructure-Foundations, 10netops: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) [11:49:50] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [11:50:22] (03PS7) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [11:51:23] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [11:51:26] !log btullis@cumin1001 Added views for new wiki: kcgwiki T305280 [11:51:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [11:51:33] T305280: Prepare and check storage layer for kcgwiki - https://phabricator.wikimedia.org/T305280 [11:52:17] (03PS2) 10JMeybohm: Increase typha replicas in ml-serve and dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/904064 (https://phabricator.wikimedia.org/T292077) [11:52:21] (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [11:52:43] (03CR) 10JMeybohm: Increase typha replicas in ml-serve and dse (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904064 (https://phabricator.wikimedia.org/T292077) (owner: 10JMeybohm) [11:53:25] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [11:53:27] !log btullis@cumin1001 Added views for new wiki: guwwiki T303761 [11:53:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [11:53:33] T303761: Prepare and check storage layer for guwwiki - https://phabricator.wikimedia.org/T303761 [11:54:06] (03PS8) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [11:54:21] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [11:54:24] !log btullis@cumin1001 Added views for new wiki: guwwiktionary T309056 [11:54:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [11:54:30] T309056: Prepare and check storage layer for guwwiktionary - https://phabricator.wikimedia.org/T309056 [11:55:13] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [11:55:15] !log btullis@cumin1001 Added views for new wiki: shnwikivoyage T302798 [11:55:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [11:55:21] T302798: Prepare and check storage layer for shnwikivoyage - https://phabricator.wikimedia.org/T302798 [11:57:04] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Remove deprecated topology annotaions [puppet] - 10https://gerrit.wikimedia.org/r/904050 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm) [11:57:41] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [11:59:07] (03PS9) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [12:00:03] Hi, we have a patch to deploy for restbase. Can i do it now outside of our deployment window ? [12:00:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40405/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [12:05:27] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40406/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [12:09:42] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] Add aphlict role to new vm host [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [12:11:03] (03PS3) 10David Caro: maintain_dbusers: move out of nfs to services [puppet] - 10https://gerrit.wikimedia.org/r/899662 (https://phabricator.wikimedia.org/T303663) [12:11:05] (03CR) 10David Caro: maintain_dbusers: move out of nfs to services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899662 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [12:11:07] (03PS3) 10David Caro: maintain_dbusers: Remove unused param and adapt to best practices [puppet] - 10https://gerrit.wikimedia.org/r/899663 (https://phabricator.wikimedia.org/T303663) [12:13:44] (03CR) 10CI reject: [V: 04-1] maintain_dbusers: Remove unused param and adapt to best practices [puppet] - 10https://gerrit.wikimedia.org/r/899663 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [12:14:01] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40408/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [12:17:18] (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [12:18:48] (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) [12:18:57] (03CR) 10Slyngshede: [V: 03+1] "Adding joe back as reviewer as this would touch the mw* httpd servers." [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [12:19:14] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10Lucas_Werkmeister_WMDE) Okay, thanks! It sounded like the task was blocked on that checkmark, maybe that was a misunderstanding. [12:19:39] (03PS7) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (https://phabricator.wikimedia.org/T329669) [12:19:54] (03PS2) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) [12:20:11] (03PS4) 10David Caro: maintain_dbusers: move out of nfs to services [puppet] - 10https://gerrit.wikimedia.org/r/899662 (https://phabricator.wikimedia.org/T303663) [12:20:13] (03PS4) 10David Caro: maintain_dbusers: Remove unused param and adapt to best practices [puppet] - 10https://gerrit.wikimedia.org/r/899663 (https://phabricator.wikimedia.org/T303663) [12:22:03] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40409/console" [puppet] - 10https://gerrit.wikimedia.org/r/899663 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [12:22:32] !log btullis@cumin1001 Added views for new wiki: gurwiki T327841 [12:22:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [12:22:38] T327841: Prepare and check storage layer for gurwiki - https://phabricator.wikimedia.org/T327841 [12:22:38] (03CR) 10CI reject: [V: 04-1] maintain_dbusers: Remove unused param and adapt to best practices [puppet] - 10https://gerrit.wikimedia.org/r/899663 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [12:27:01] (03PS1) 10David Caro: wmcs::nfs::primary: remove unused mysql_variances hiera [puppet] - 10https://gerrit.wikimedia.org/r/904161 (https://phabricator.wikimedia.org/T303663) [12:28:01] (03PS2) 10David Caro: wmcs::nfs::primary: remove unused mysql_variances hiera [puppet] - 10https://gerrit.wikimedia.org/r/904161 (https://phabricator.wikimedia.org/T303663) [12:28:03] (03CR) 10David Caro: wmcs::nfs::primary: remove unused mysql_variances hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904161 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [12:28:16] (03CR) 10David Caro: wmcs::nfs::primary: remove unused mysql_variances hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904161 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [12:30:25] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:30:27] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:31:22] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:31:25] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:31:53] (03CR) 10Hashar: "I forgot to post my comment :)" [puppet] - 10https://gerrit.wikimedia.org/r/904058 (owner: 10Hashar) [12:38:05] (03PS9) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035) [12:38:14] (03PS5) 10David Caro: maintain_dbusers: Remove unused param and adapt to best practices [puppet] - 10https://gerrit.wikimedia.org/r/899663 (https://phabricator.wikimedia.org/T303663) [12:38:16] (03PS3) 10David Caro: wmcs::nfs::primary: remove unused mysql_variances hiera [puppet] - 10https://gerrit.wikimedia.org/r/904161 (https://phabricator.wikimedia.org/T303663) [12:39:29] (03PS1) 10Ayounsi: Remove labs/cloud-support1-b-codfw [puppet] - 10https://gerrit.wikimedia.org/r/904167 (https://phabricator.wikimedia.org/T327930) [12:39:31] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40410/console" [puppet] - 10https://gerrit.wikimedia.org/r/899663 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [12:40:57] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Thumbor-k8s performance improvements - https://phabricator.wikimedia.org/T333445 (10hnowlan) [12:41:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10ayounsi) a:03ayounsi [12:42:24] (03CR) 10Volans: cumin: add aliases for Redis Misc pairs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904062 (https://phabricator.wikimedia.org/T332598) (owner: 10Elukey) [12:42:34] (03PS1) 10Hnowlan: Thumbor: use emptyDir for /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/904168 (https://phabricator.wikimedia.org/T328033) [12:43:14] (03PS2) 10Hnowlan: Thumbor: use emptyDir for /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/904168 (https://phabricator.wikimedia.org/T333445) [12:43:53] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [12:45:57] (03Abandoned) 10David Caro: labstore: Send prom stats for getent_check [puppet] - 10https://gerrit.wikimedia.org/r/813898 (https://phabricator.wikimedia.org/T313444) (owner: 10David Caro) [12:46:26] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10ayounsi) Removed from Netbox, last step is the above Puppet change ready for reviews. [12:46:36] (03PS2) 10David Caro: cloudceph: add the location info to the hosts [puppet] - 10https://gerrit.wikimedia.org/r/896372 (https://phabricator.wikimedia.org/T297083) [12:46:57] (03CR) 10CI reject: [V: 04-1] cloudceph: add the location info to the hosts [puppet] - 10https://gerrit.wikimedia.org/r/896372 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [12:48:10] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10netops: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10ayounsi) [12:49:01] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10netops: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10ayounsi) 05Open→03Resolved Closing this task as the short term goals are done, medium terms have their own task. [12:49:08] (03CR) 10Clément Goubert: [C: 03+2] P:services_proxy::envoy: Add mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [12:50:42] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [12:52:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove labs/cloud-support1-b-codfw [puppet] - 10https://gerrit.wikimedia.org/r/904167 (https://phabricator.wikimedia.org/T327930) (owner: 10Ayounsi) [12:52:21] (03PS1) 10Lucas Werkmeister (WMDE): SpecialRecentChangesLinked: Use SelectQueryBuilder directly [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/904129 (https://phabricator.wikimedia.org/T333339) [12:52:35] (03PS1) 10Lucas Werkmeister (WMDE): SpecialRecentChangesLinked: Use SelectQueryBuilder directly [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/904130 (https://phabricator.wikimedia.org/T333339) [12:53:37] fyi, I might not be available during the beginning of the backport window, but hopefully later at least [12:53:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] "small nitpick, but LGTM otherwise." [deployment-charts] - 10https://gerrit.wikimedia.org/r/904168 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [12:54:22] (03PS3) 10David Caro: cloudceph: add the location info to the hosts [puppet] - 10https://gerrit.wikimedia.org/r/896372 (https://phabricator.wikimedia.org/T297083) [12:55:10] !log test enabling lldp on pfw3-codfw [12:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:41] (03PS3) 10Hashar: wm-zuul-status: fix items having no build [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/902718 (https://phabricator.wikimedia.org/T214068) [12:57:30] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:57:33] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:58:20] (03PS1) 10Jelto: releases: rename new blackbox check for jenkins login page [puppet] - 10https://gerrit.wikimedia.org/r/904173 (https://phabricator.wikimedia.org/T331901) [12:58:24] (03PS10) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [12:59:52] 10SRE, 10Infrastructure-Foundations, 10netops: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) Enabled it on pfw3-codfw, and removed the exception on fasw-c-codfw and it's working as expected: ` pfw3-codfw# run show lldp neighbors Local Interface Parent Int... [13:00:02] 10SRE, 10Infrastructure-Foundations, 10netops: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) a:03ayounsi [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T1300). Please do the needful. [13:00:05] MatmaRex, Arlolra, AaronSchulz, koi, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] (03CR) 10Herron: [C: 03+2] icinga: remove widespread puppet agent alerts [puppet] - 10https://gerrit.wikimedia.org/r/903654 (https://phabricator.wikimedia.org/T288622) (owner: 10Herron) [13:00:14] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40412/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [13:00:43] hi [13:00:50] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40413/console" [puppet] - 10https://gerrit.wikimedia.org/r/904173 (https://phabricator.wikimedia.org/T331901) (owner: 10Jelto) [13:01:28] !log test enabling lldp on mr1-ulsfo [13:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:15] MatmaRex: I’m a bit confused by the task for the change you brought… is this about the Visual/Wikitext on diff pages? [13:03:25] (if yes, then the “history” word is my main source of confusion ^^) [13:03:33] (03CR) 10Jelto: [V: 03+1] "Puppet on releases1002 fails with the new blackbox check from I9a6b093efbe406c7a1d76b570e5465d73172a4ed, so I opened this change to remove" [puppet] - 10https://gerrit.wikimedia.org/r/904173 (https://phabricator.wikimedia.org/T331901) (owner: 10Jelto) [13:03:37] the Visual/Wikitext *toggle, sorry [13:03:46] yes [13:03:54] hmm, it's a bit poorly phrased [13:04:00] alright, looks like my other meeting was canceled and I can deploy :) [13:04:13] we call them "historical visual diffs" in the code, as opposed to visual diffs in the visual editor [13:04:48] but then the previous config patches on thi stask used "history page visual diffs", so i copied the same phrase [13:05:38] I see [13:05:55] and then the other part is just the usual “I didn’t realize this beta feature I’ve been using for years wasn’t the default yet” ;) [13:06:03] (see also, reference previews, I think? ^^) [13:06:13] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40414/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [13:06:14] heh [13:06:32] oof, so many UploadChunkFile errors in logspam-watch :/ [13:06:46] (03PS2) 10Lucas Werkmeister (WMDE): Enable history page visual diffs on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903780 (https://phabricator.wikimedia.org/T314588) (owner: 10Bartosz Dziewoński) [13:07:22] let’s not split this change and just do everything everywhere all at once ;) [13:07:25] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40415/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [13:07:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903780 (https://phabricator.wikimedia.org/T314588) (owner: 10Bartosz Dziewoński) [13:08:23] (03Merged) 10jenkins-bot: Enable history page visual diffs on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903780 (https://phabricator.wikimedia.org/T314588) (owner: 10Bartosz Dziewoński) [13:08:44] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:903780|Enable history page visual diffs on remaining wikis (T314588)]] [13:08:50] T314588: Launch visual diffs on history pages out of beta and provide it to all users - https://phabricator.wikimedia.org/T314588 [13:09:29] (03CR) 10JMeybohm: Thumbor: use emptyDir for /tmp (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904168 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [13:09:59] !log dcausse@deploy2002 Started deploy [airflow-dags/search@92e9876]: (no justification provided) [13:10:09] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and matmarex: Backport for [[gerrit:903780|Enable history page visual diffs on remaining wikis (T314588)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:10:14] !log dcausse@deploy2002 Finished deploy [airflow-dags/search@92e9876]: (no justification provided) (duration: 00m 14s) [13:10:38] seems to work for me [13:10:45] (in a private window, on a random enwiki diff) [13:11:30] Lucas_WMDE: yep, looks good [13:11:34] ok, thanks [13:11:38] syncing [13:11:57] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:11:59] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:14:25] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40416/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [13:14:31] (03PS4) 10David Caro: maintain-dbusers: fix systemd service description [puppet] - 10https://gerrit.wikimedia.org/r/895814 (https://phabricator.wikimedia.org/T303663) [13:16:11] arlolra: I’m looking into your change; I can see that the VisualEditor change was merged before the wmf.1 cut (good), but I’m not sure how I can check that the parsoid services change is also already deployed [13:16:15] (03PS5) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [13:16:38] Lucas_WMDE: https://www.mediawiki.org/wiki/Parsoid/Deployments [13:17:08] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:903780|Enable history page visual diffs on remaining wikis (T314588)]] (duration: 08m 23s) [13:17:14] T314588: Launch visual diffs on history pages out of beta and provide it to all users - https://phabricator.wikimedia.org/T314588 [13:17:21] thanks, that looks good [13:17:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:18:03] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! And good reminder of all the files to touch when adding new rack subnets :)" [puppet] - 10https://gerrit.wikimedia.org/r/904167 (https://phabricator.wikimedia.org/T327930) (owner: 10Ayounsi) [13:18:05] (03PS3) 10Lucas Werkmeister (WMDE): Enabled native gallery editing in Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889257 (https://phabricator.wikimedia.org/T329662) (owner: 10Arlolra) [13:18:32] thanks Lucas_WMDE [13:18:38] np [13:18:48] ok, scaap warns about the 891889 dependency, but that should be fine [13:18:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889257 (https://phabricator.wikimedia.org/T329662) (owner: 10Arlolra) [13:18:59] *scap [13:19:10] (“scaap” feels like it means “sheep” in Dutch or something) [13:19:34] (03Merged) 10jenkins-bot: Enabled native gallery editing in Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889257 (https://phabricator.wikimedia.org/T329662) (owner: 10Arlolra) [13:19:57] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:889257|Enabled native gallery editing in Parsoid (T329662)]] [13:20:03] T329662: Edited gallery captions are ignored unless gallery's data-mw is dropped - https://phabricator.wikimedia.org/T329662 [13:21:19] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and arlolra: Backport for [[gerrit:889257|Enabled native gallery editing in Parsoid (T329662)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:22:18] arlolra: can you test the change on mwdebug? [13:22:24] ok, one sec [13:22:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:23:24] sure [13:23:50] (03PS4) 10David Caro: cloudceph: add the location info to the hosts [puppet] - 10https://gerrit.wikimedia.org/r/896372 (https://phabricator.wikimedia.org/T297083) [13:23:52] (03PS1) 10Ayounsi: Remove option to disable vcp_snmp_statistics [homer/public] - 10https://gerrit.wikimedia.org/r/904177 [13:24:42] Lucas_WMDE: you can proceed [13:24:46] ok thanks [13:24:50] syncing [13:26:32] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 127 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:27:54] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @aborrero cloudcontrol2004-dev is in a public VLAN that is what we didn't relocate it in B1. Bu... [13:28:30] (03PS1) 10Ilias Sarantopoulos: ml-services: create FastAPI app ofr ores legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/904178 (https://phabricator.wikimedia.org/T330414) [13:28:47] (03CR) 10CDanis: add tunnelencabulator (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [13:28:53] (03CR) 10Filippo Giunchedi: prometheus: add instance label for Gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904058 (owner: 10Hashar) [13:28:58] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:29:17] !log disable puppet on A:lvs to test Python 2 deprecation change: T321309 [13:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:23] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [13:30:17] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:889257|Enabled native gallery editing in Parsoid (T329662)]] (duration: 10m 19s) [13:30:22] T329662: Edited gallery captions are ignored unless gallery's data-mw is dropped - https://phabricator.wikimedia.org/T329662 [13:30:24] (03CR) 10Ssingh: [C: 03+2] Set profile::base::remove_python2_on_bullseye for the LVSes [puppet] - 10https://gerrit.wikimedia.org/r/902666 (https://phabricator.wikimedia.org/T321309) (owner: 10Muehlenhoff) [13:30:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:30:41] (03PS1) 10JMeybohm: k8s: Remove unused token hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/904179 (https://phabricator.wikimedia.org/T328291) [13:30:43] !log enable vcp-snmp-statistics on fasw-c-codfw [13:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:11] (03CR) 10Lucas Werkmeister (WMDE): "Big change… I looked through it a bit, but I’m mainly relying on Krinkle’s +1 for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [13:31:14] (03PS11) 10Lucas Werkmeister (WMDE): Add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [13:31:32] AaronSchulz: are you around for the deployment? [13:31:49] koi: also around? [13:31:57] yep [13:32:42] hm, namespaceDupes.php gurwiki without --fix says there’s nothing to do [13:34:16] !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript namespaceDupes.php gurwiki --fix # T332241 – 0 pages to fix (0 resolvable), 0 links to fix (0 resolvable, 0 deleted) [13:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:22] T332241: fix Category namespace on gurwiki - https://phabricator.wikimedia.org/T332241 [13:34:56] I think Dcljr is right and namespaceDupes can’t fix this one :/ [13:35:10] (03PS2) 10Bartosz Dziewoński: Clean up history page visual diffs beta feature config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903781 (https://phabricator.wikimedia.org/T333448) [13:35:17] (03CR) 10CI reject: [V: 04-1] Clean up history page visual diffs beta feature config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903781 (https://phabricator.wikimedia.org/T333448) (owner: 10Bartosz Dziewoński) [13:35:18] Lucas_WMDE: which kind of issue are you trying to fix? [13:35:28] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:35:37] IIUC, the issue is that https://gur.wikipedia.org/wiki/Zure:PrefixIndex?prefix=&namespace=0 lists “Buuri buuri :Budaa” [13:35:45] where “Buuri buuri” would be a namespace name [13:35:55] and the page was created before the namespace was correctly translated [13:36:00] but now the space is messing up the page [13:36:04] would cleanupTitles.php do it? [13:36:11] * Lucas_WMDE doesn’t know that script [13:36:12] looking [13:36:34] according to https://gur.wikipedia.org/w/index.php?title=Buuri_buuri:Budaa&action=info, the database has correct info there [13:36:36] I guess I can try a --dry-run first [13:36:51] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@0d2f12f]: (no justification provided) [13:36:53] “DRY RUN: would rename 734 (0,'Buuri_buuri_:Budaa') to (14,'Budaa')” [13:36:58] and also “DRY RUN: would rename 742 (0,'Buuri_Buuri:_Climate_change') to (14,'Climate_change')” [13:37:02] let’s do that then [13:37:21] !log enable puppet on A:lvs to test Python 2 deprecation change: T321309 [13:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:26] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [13:37:37] !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript cleanupTitles.php gurwiki # T332241 (2 of 767 rows updated) [13:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:43] thanks MatmaRex and urbanecm! [13:37:54] that might be it! [13:38:02] looks good now [13:38:04] thanks MatmaRex :) [13:38:05] yup, now I see Budaa in https://gur.wikipedia.org/wiki/Zure:PrefixIndex?prefix=&namespace=14 [13:38:05] thanks all of you :) [13:38:05] as i understand it, cleanupTitles.php is a more generic version of the same fix [13:38:09] yay [13:38:20] I’ll put the script output in the task [13:38:20] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:38:23] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:38:29] (still waiting for AaronSchulz for that profiling change btw) [13:38:37] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:38:40] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:39:41] (03PS1) 10Ayounsi: Enable LLDP on management routers [homer/public] - 10https://gerrit.wikimedia.org/r/904180 (https://phabricator.wikimedia.org/T320229) [13:40:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/904129 (https://phabricator.wikimedia.org/T333339) (owner: 10Lucas Werkmeister (WMDE)) [13:40:12] I’ll continue with my own backport in the meantime [13:40:22] wmf.1 first, because the wmf.2 version won’t really be testable [13:40:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service_catalog: make mw-on-k8s services page [puppet] - 10https://gerrit.wikimedia.org/r/904146 (owner: 10Clément Goubert) [13:40:31] but after wmf.1 there should be a visible decrease in logstash messages :) [13:41:01] (03CR) 10Clément Goubert: [C: 03+2] service_catalog: make mw-on-k8s services page [puppet] - 10https://gerrit.wikimedia.org/r/904146 (owner: 10Clément Goubert) [13:41:03] Amir1: any opinion on backporting those ->query() regex fixes to REL1_40? I’d say we should do it, since the warning also landed in REL1_40 iiuc [13:41:11] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2002.codfw.wmnet with reason: stop kafka, dist-upgrade [13:41:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2002.codfw.wmnet with reason: stop kafka, dist-upgrade [13:41:50] (03PS1) 10JMeybohm: k8s: Remove references to unused token hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/904181 (https://phabricator.wikimedia.org/T328291) [13:42:43] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:42:56] !log run dist-upgrade on kafka-main2002 to upgrade it to bullseye - T332013 [13:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:01] T332013: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 [13:43:04] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) FYI, it's still needed to disable LLDP on switch interfaces facing the management routers. [13:43:28] (03PS3) 10Hnowlan: Thumbor: use emptyDir for /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/904168 (https://phabricator.wikimedia.org/T333445) [13:44:11] (03CR) 10Hnowlan: "I've disabled the limit via helmfile.d while testing this feature" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904168 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [13:46:27] (03PS1) 10Filippo Giunchedi: profile: remove edac alerts [puppet] - 10https://gerrit.wikimedia.org/r/904182 (https://phabricator.wikimedia.org/T294564) [13:46:29] (03PS1) 10Andrew Bogott: Trove: increase formatting timeouts even more [puppet] - 10https://gerrit.wikimedia.org/r/904183 [13:46:52] !log jclark@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:49:42] (03CR) 10Andrew Bogott: [C: 03+2] Trove: increase formatting timeouts even more [puppet] - 10https://gerrit.wikimedia.org/r/904183 (owner: 10Andrew Bogott) [13:49:51] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:50:20] (03CR) 10Hnowlan: [C: 03+2] Thumbor: use emptyDir for /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/904168 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [13:50:28] (03PS1) 10Filippo Giunchedi: wmnet: prep statsd/graphite records for easier write failover [dns] - 10https://gerrit.wikimedia.org/r/904185 (https://phabricator.wikimedia.org/T239862) [13:50:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "already +2ing, to give the gate-and-submit a head start" [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/904130 (https://phabricator.wikimedia.org/T333339) (owner: 10Lucas Werkmeister (WMDE)) [13:51:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:51:21] (03CR) 10CI reject: [V: 04-1] wmnet: prep statsd/graphite records for easier write failover [dns] - 10https://gerrit.wikimedia.org/r/904185 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [13:54:12] Lucas_WMDE: sounds good to me, backport to an not-yet-released branch is cheap [13:54:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:54:20] ok, I can do it later [13:54:21] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Thumbor-k8s performance improvements - https://phabricator.wikimedia.org/T333445 (10akosiaris) [13:54:48] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:54:50] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@0d2f12f]: (no justification provided) (duration: 17m 59s) [13:55:05] (03Merged) 10jenkins-bot: Thumbor: use emptyDir for /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/904168 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [13:55:19] (03PS1) 10Filippo Giunchedi: profile: remove hardcoded statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/904186 (https://phabricator.wikimedia.org/T239862) [13:55:33] (03PS2) 10Ayounsi: Add policy to export prefixes to k8s nodes [homer/public] - 10https://gerrit.wikimedia.org/r/904150 (https://phabricator.wikimedia.org/T328523) [13:56:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:56:25] (03Merged) 10jenkins-bot: SpecialRecentChangesLinked: Use SelectQueryBuilder directly [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/904129 (https://phabricator.wikimedia.org/T333339) (owner: 10Lucas Werkmeister (WMDE)) [13:56:48] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:904129|SpecialRecentChangesLinked: Use SelectQueryBuilder directly (T333339)]] [13:56:53] T333339: Warning: SQLPlatform::isWriteQuery fallback to regex (from SpecialRecentChangesLinked) - https://phabricator.wikimedia.org/T333339 [13:57:06] (03PS2) 10Ayounsi: Remove labs/cloud-support1-b-codfw [puppet] - 10https://gerrit.wikimedia.org/r/904167 (https://phabricator.wikimedia.org/T327930) [13:57:20] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Jclark-ctr) [13:57:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [13:58:07] (03CR) 10Filippo Giunchedi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/904185 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [13:58:10] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:904129|SpecialRecentChangesLinked: Use SelectQueryBuilder directly (T333339)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:58:12] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:58:28] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:59:20] seems to be working, syncing [13:59:23] (03CR) 10Ayounsi: [C: 03+2] Remove labs/cloud-support1-b-codfw [puppet] - 10https://gerrit.wikimedia.org/r/904167 (https://phabricator.wikimedia.org/T327930) (owner: 10Ayounsi) [14:00:02] !log merge/deploy change in Puppet's modules/network/data/data.yaml - T327930 [14:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:14] jouncebot: now [14:00:14] No deployments scheduled for the next 2 hour(s) and 59 minute(s) [14:00:15] T327930: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 [14:00:21] backport window is overrunning a bit, shouldn’t be more than 10 minutes [14:00:27] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [14:00:53] (03CR) 10Lucas Werkmeister (WMDE): Add per-action component-level profiling in statsd using excimer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [14:03:56] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10Ladsgroup) [14:04:00] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10Ladsgroup) 05Open→03Resolved I prefer to have something there to satisfy the gods of bureaucracy but it's not mandatory for wmde/wmf groups for new hires. Anyway,... [14:04:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:04:50] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:904129|SpecialRecentChangesLinked: Use SelectQueryBuilder directly (T333339)]] (duration: 08m 02s) [14:04:53] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [14:04:56] T333339: Warning: SQLPlatform::isWriteQuery fallback to regex (from SpecialRecentChangesLinked) - https://phabricator.wikimedia.org/T333339 [14:05:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/904130 (https://phabricator.wikimedia.org/T333339) (owner: 10Lucas Werkmeister (WMDE)) [14:05:23] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [14:05:27] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [14:05:37] (03PS1) 10DCausse: rdf-streaming-updater: use newer bootstrap state [deployment-charts] - 10https://gerrit.wikimedia.org/r/904188 (https://phabricator.wikimedia.org/T328675) [14:05:41] (03CR) 10Bking: [C: 03+2] flink-app: add envoy configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/903740 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [14:06:34] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: use newer bootstrap state [deployment-charts] - 10https://gerrit.wikimedia.org/r/904188 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [14:06:57] (03Merged) 10jenkins-bot: SpecialRecentChangesLinked: Use SelectQueryBuilder directly [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/904130 (https://phabricator.wikimedia.org/T333339) (owner: 10Lucas Werkmeister (WMDE)) [14:07:14] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:904130|SpecialRecentChangesLinked: Use SelectQueryBuilder directly (T333339)]] [14:07:47] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Thumbor-k8s performance improvements - https://phabricator.wikimedia.org/T333445 (10akosiaris) I 've updated a bit the Thumbor dashboard. Aside from some performance changes (e.g. collapsing most rows by default) the main diff is adding 2 v... [14:07:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [14:08:16] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:08:32] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [14:08:37] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:904130|SpecialRecentChangesLinked: Use SelectQueryBuilder directly (T333339)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [14:10:16] (03Merged) 10jenkins-bot: flink-app: add envoy configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/903740 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [14:11:33] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [14:12:16] (03Merged) 10jenkins-bot: rdf-streaming-updater: use newer bootstrap state [deployment-charts] - 10https://gerrit.wikimedia.org/r/904188 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [14:13:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] cumin: add aliases for Redis Misc pairs [puppet] - 10https://gerrit.wikimedia.org/r/904062 (https://phabricator.wikimedia.org/T332598) (owner: 10Elukey) [14:14:32] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [14:14:45] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:904130|SpecialRecentChangesLinked: Use SelectQueryBuilder directly (T333339)]] (duration: 07m 30s) [14:14:50] T333339: Warning: SQLPlatform::isWriteQuery fallback to regex (from SpecialRecentChangesLinked) - https://phabricator.wikimedia.org/T333339 [14:15:25] !log UTC afternoon backport+config window done [14:15:27] 10SRE, 10Infrastructure-Foundations, 10netops: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10ayounsi) 05Open→03Resolved For the record, Netbox changes {F36932140} [14:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:44] PROBLEM - Check systemd state on kubernetes1012 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:46] still curious why group0 isn’t on wmf.2 yet btw :) [14:16:08] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:51] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [14:17:37] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10aborrero) [14:18:25] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:19:03] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:19:15] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:19:45] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:20:30] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:20:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:21:06] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:23:31] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Proposal: create a framework to build containerized incident management protects - https://phabricator.wikimedia.org/T265153 (10jbond) [14:24:47] 10SRE, 10DNS, 10Wikimedia-Language-setup, 10Patch-For-Review: zh-min-nan.wikinews.org redirects to unprefixed incubator - https://phabricator.wikimedia.org/T86915 (10Ladsgroup) Is there anything left to do on SRE side? Otherwise we should remove the SRE tag. [14:25:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:45] (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:30:28] (03CR) 10Elukey: [C: 03+2] cumin: add aliases for Redis Misc pairs [puppet] - 10https://gerrit.wikimedia.org/r/904062 (https://phabricator.wikimedia.org/T332598) (owner: 10Elukey) [14:30:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [14:30:38] PROBLEM - Check systemd state on kubernetes2022 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:33:18] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:34:36] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [14:35:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [14:36:07] (03PS1) 10Ayounsi: Remove custom BGP graceful-shutdown [homer/public] - 10https://gerrit.wikimedia.org/r/904192 (https://phabricator.wikimedia.org/T320230) [14:36:28] RECOVERY - Check systemd state on kubernetes1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:40:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:41:48] PROBLEM - Host ms-be2067 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:15] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) @Volans ` 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successf... [14:42:48] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond) FYI i have created a new implementation which produces the following data {P45977} > And this change is a good opportunity (while being... [14:42:52] 10SRE, 10Infrastructure-Foundations, 10netops: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) [14:43:27] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Use Junos BGP graceful-shutdown and shutdown features - https://phabricator.wikimedia.org/T320230 (10ayounsi) Upgrade doc updated: https://wikitech.wikimedia.org/w/index.php?title=Juniper_router_upgrade&diff=2064827&oldid=2016903 Receiver i... [14:44:22] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2022 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:44:40] (03CR) 10Ayounsi: [C: 03+2] "Self merging as no impact expected." [homer/public] - 10https://gerrit.wikimedia.org/r/904192 (https://phabricator.wikimedia.org/T320230) (owner: 10Ayounsi) [14:45:18] RECOVERY - Host ms-be2067 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [14:45:33] (03Merged) 10jenkins-bot: Remove custom BGP graceful-shutdown [homer/public] - 10https://gerrit.wikimedia.org/r/904192 (https://phabricator.wikimedia.org/T320230) (owner: 10Ayounsi) [14:46:12] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Papaul) @Jhancock.wm i fixed the Foreign drive issue, you can go ahead and update the firmware on the server. Let me know if you have any questions. [14:47:06] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:49:00] !log Remove custom BGP graceful-shutdown on all core routers - T320230 [14:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:06] T320230: Use Junos BGP graceful-shutdown and shutdown features - https://phabricator.wikimedia.org/T320230 [14:49:12] Hi, does anyone know, how to figure out if we are already running php8 in production? [14:49:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @Papaul we're gonna reimage this one onto new vlans (will happen to all the public vlan ones i... [14:52:29] (03PS1) 10Ayounsi: Rename bgp disable flag to shutdown [homer/public] - 10https://gerrit.wikimedia.org/r/904193 (https://phabricator.wikimedia.org/T320230) [14:53:47] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:53:50] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:54:33] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, 10Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [14:57:25] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephmon2005-dev [14:57:29] (03CR) 10Ayounsi: [C: 03+2] "tested locally." [homer/public] - 10https://gerrit.wikimedia.org/r/904193 (https://phabricator.wikimedia.org/T320230) (owner: 10Ayounsi) [14:57:38] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephmon2005-dev [14:58:05] (03Merged) 10jenkins-bot: Rename bgp disable flag to shutdown [homer/public] - 10https://gerrit.wikimedia.org/r/904193 (https://phabricator.wikimedia.org/T320230) (owner: 10Ayounsi) [14:58:59] !log cmooney@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephmon2005-dev [14:59:08] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephmon2005-dev [15:00:45] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@4a7a6cc]: prefix hive properties with spark.hive. [15:00:59] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@4a7a6cc]: prefix hive properties with spark.hive. (duration: 00m 13s) [15:03:02] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] deployment-prep: update prometheus host to prometheus05 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782) (owner: 10Samtar) [15:03:39] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:03:42] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:04:10] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:05:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10thcipriani) [15:05:09] (03PS5) 10Samtar: deployment-prep: update prometheus host to prometheus05 [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782) [15:05:36] (03CR) 10Samtar: deployment-prep: update prometheus host to prometheus05 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782) (owner: 10Samtar) [15:05:37] 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10thcipriani) [15:05:57] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, 10Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) We had a conversation about this today. Conclusions: * we will migrate the remaining of cloudvirts to single NIC, so the se... [15:06:15] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2001.codfw.wmnet with reason: Stop kafka, dist-upgrade [15:06:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2001.codfw.wmnet with reason: Stop kafka, dist-upgrade [15:07:11] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:09:25] (03PS1) 10Ayounsi: Remove decom prefixes from DNS [dns] - 10https://gerrit.wikimedia.org/r/904198 (https://phabricator.wikimedia.org/T327930) [15:09:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "New instance is SSHable, and has three prometheuses listening according to `sudo lsof -iTCP -sTCP:LISTEN -n -P`, so looks good as far as I" [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782) (owner: 10Samtar) [15:10:43] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/904198 (https://phabricator.wikimedia.org/T327930) (owner: 10Ayounsi) [15:10:52] (03CR) 10Ayounsi: [C: 03+2] Remove decom prefixes from DNS [dns] - 10https://gerrit.wikimedia.org/r/904198 (https://phabricator.wikimedia.org/T327930) (owner: 10Ayounsi) [15:11:08] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:18] (03PS2) 10Filippo Giunchedi: wmnet: prep statsd/graphite records for easier write failover [dns] - 10https://gerrit.wikimedia.org/r/904185 (https://phabricator.wikimedia.org/T239862) [15:13:55] (03CR) 10Jbond: [C: 03+1] profile: remove edac alerts [puppet] - 10https://gerrit.wikimedia.org/r/904182 (https://phabricator.wikimedia.org/T294564) (owner: 10Filippo Giunchedi) [15:14:14] 10SRE, 10Infrastructure-Foundations, 10netops: Use Junos BGP graceful-shutdown and shutdown features - https://phabricator.wikimedia.org/T320230 (10ayounsi) 05Open→03Resolved a:03ayounsi [15:14:20] (03PS3) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) [15:15:36] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:15:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40417/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899662 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [15:16:25] (03CR) 10Clément Goubert: [C: 03+1] wmnet: prep statsd/graphite records for easier write failover [dns] - 10https://gerrit.wikimedia.org/r/904185 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [15:16:31] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [15:16:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/899662 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [15:16:57] (03CR) 10Clément Goubert: [C: 03+1] profile: remove hardcoded statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/904186 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [15:17:10] (03CR) 10Jbond: [C: 03+1] maintain_dbusers: Remove unused param and adapt to best practices [puppet] - 10https://gerrit.wikimedia.org/r/899663 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [15:17:25] (03CR) 10Jbond: [C: 03+1] wmcs::nfs::primary: remove unused mysql_variances hiera [puppet] - 10https://gerrit.wikimedia.org/r/904161 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [15:17:35] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: remove edac alerts [puppet] - 10https://gerrit.wikimedia.org/r/904182 (https://phabricator.wikimedia.org/T294564) (owner: 10Filippo Giunchedi) [15:19:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:19:39] (HelmReleaseBadStatus) firing: Helm release thumbor/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:21:50] (03PS6) 10Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) [15:21:57] Traffic is reimaging LVS hosts to bullseye today; BGP alerts in ulsfo expected [15:22:15] keeping an eye out if something else is broken [15:22:19] rather, will be [15:22:49] sukhe: I wonder if that's gonna trigger some ENI confusion attack [15:23:00] (03CR) 10BCornwall: [C: 03+2] pybal: Add runbook link to alert [alerts] - 10https://gerrit.wikimedia.org/r/903777 (https://phabricator.wikimedia.org/T310933) (owner: 10BCornwall) [15:23:13] ENI... PNI! [15:23:18] predictable network interface [15:23:44] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:23:45] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:23:53] (03CR) 10Elukey: [C: 03+2] Increase typha replicas in ml-serve and dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/904064 (https://phabricator.wikimedia.org/T292077) (owner: 10JMeybohm) [15:24:00] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:24:00] vgutierrez: yeah possibly! [15:24:02] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:24:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:25:02] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:25:08] PROBLEM - PyBal backends health check on lvs4010 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:25:08] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:25:11] ^ expected [15:25:42] PROBLEM - pybal on lvs4010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:25:55] also expected [15:26:15] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [15:27:13] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:27:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS bullseye [15:27:34] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye [15:27:36] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:27:44] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:28:01] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:28:33] (03CR) 10Ayounsi: [C: 03+2] "tested in ulsfo." [homer/public] - 10https://gerrit.wikimedia.org/r/904180 (https://phabricator.wikimedia.org/T320229) (owner: 10Ayounsi) [15:29:02] !log elukey@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [15:29:07] (03Merged) 10jenkins-bot: Enable LLDP on management routers [homer/public] - 10https://gerrit.wikimedia.org/r/904180 (https://phabricator.wikimedia.org/T320229) (owner: 10Ayounsi) [15:29:11] !log elukey@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [15:35:55] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10elukey) [15:35:57] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40418/console" [puppet] - 10https://gerrit.wikimedia.org/r/899607 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [15:36:37] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40419/console" [puppet] - 10https://gerrit.wikimedia.org/r/899607 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [15:37:38] (03PS1) 10Ssingh: hiera: update interface names for lvs4010 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/904200 (https://phabricator.wikimedia.org/T321309) [15:37:46] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40420/console" [puppet] - 10https://gerrit.wikimedia.org/r/899607 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [15:38:56] (03CR) 10Ssingh: [C: 03+2] hiera: update interface names for lvs4010 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/904200 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:43:16] (03PS1) 10Volans: sre.hosts.provision: handle the case of no NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/904201 [15:43:41] (03PS2) 10Volans: sre.hosts.provision: handle the case of no NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/904201 (https://phabricator.wikimedia.org/T326661) [15:44:13] (03CR) 10Dzahn: "thanks for this! we need to pass the service name parameter though if it doesn't happen to match the $title. let me amend" [puppet] - 10https://gerrit.wikimedia.org/r/904173 (https://phabricator.wikimedia.org/T331901) (owner: 10Jelto) [15:44:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [15:46:14] (03CR) 10Jbond: [C: 03+1] "thanks, see inline for comments" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [15:47:12] (03PS2) 10Dzahn: releases: rename new blackbox check for jenkins login page [puppet] - 10https://gerrit.wikimedia.org/r/904173 (https://phabricator.wikimedia.org/T331901) (owner: 10Jelto) [15:47:13] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/904201 (https://phabricator.wikimedia.org/T326661) (owner: 10Volans) [15:47:27] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [15:47:38] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: handle the case of no NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/904201 (https://phabricator.wikimedia.org/T326661) (owner: 10Volans) [15:48:44] (03CR) 10Dzahn: [C: 03+2] "the checks might be actually duplicate, but let's fix it first" [puppet] - 10https://gerrit.wikimedia.org/r/904173 (https://phabricator.wikimedia.org/T331901) (owner: 10Jelto) [15:49:59] (03PS7) 10Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) [15:50:03] (03Merged) 10jenkins-bot: sre.hosts.provision: handle the case of no NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/904201 (https://phabricator.wikimedia.org/T326661) (owner: 10Volans) [15:50:09] (03CR) 10Cathal Mooney: [C: 03+2] Move Icinga eventgate logging external errors checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [15:50:27] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:50:31] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:51:06] !log btullis@cumin1001 Added views for new wiki: gucwiki T326235 [15:51:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [15:51:11] T326235: Prepare and check storage layer for gucwiki - https://phabricator.wikimedia.org/T326235 [15:51:32] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Move Icinga eventgate logging external errors checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [15:52:10] (03Merged) 10jenkins-bot: Move Icinga eventgate logging external errors checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [15:52:15] 10SRE, 10Infrastructure-Foundations, 10netops: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) 05Open→03Resolved LLDP is now enabled on all the SRXs. > FYI, it's still needed to disable LLDP on switch interfaces facing the management routers. To expand on thi... [15:54:26] (03CR) 10Dzahn: [C: 03+2] "puppet fixed on releases1002" [puppet] - 10https://gerrit.wikimedia.org/r/904173 (https://phabricator.wikimedia.org/T331901) (owner: 10Jelto) [15:54:35] (03PS2) 10Cathal Mooney: Remove EventGate Icinga checks that have been moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) [15:54:39] (HelmReleaseBadStatus) resolved: Helm release thumbor/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:54:46] (03CR) 10CI reject: [V: 04-1] Remove EventGate Icinga checks that have been moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [15:55:07] (03CR) 10Dzahn: gerrit: set gitiles clone url to http (Gerrit 3.6.2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903260 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar) [15:55:09] (03CR) 10Dzahn: [C: 03+2] gerrit: set gitiles clone url to http (Gerrit 3.6.2) [puppet] - 10https://gerrit.wikimedia.org/r/903260 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar) [15:58:22] hmm, mediawiki.org is not on wmf.2 yet? [15:58:49] 10SRE, 10Data-Engineering, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10BTullis) Circling around to this old problem, if indeed it's still a problem. From what I can see, although the hosts have all been refreshed since the last entry on this ticke... [15:58:53] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:58:59] 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10FJoseph-WMF) Approved [15:59:31] MatmaRex: train is at group0 right now [15:59:41] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [16:00:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [16:00:17] dancy: mediawiki.org is on group0 though [16:00:24] (03PS1) 10Ssingh: pybal: install pybal from component in bullseye [puppet] - 10https://gerrit.wikimedia.org/r/904203 (https://phabricator.wikimedia.org/T321309) [16:00:31] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:00:33] hmm. lemme check things [16:00:39] isn't it? [16:01:27] I was wrong.. Train is only at testwikis. @dduvall: Did something block the train yesterday? [16:01:45] :/ [16:01:48] not intentionally [16:01:52] * dduvall checks [16:02:44] strange. i ran stage-train as usual [16:02:53] stage-train only rolls out to testwikis [16:03:05] (03CR) 10Jelto: releases: rename new blackbox check for jenkins login page (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904173 (https://phabricator.wikimedia.org/T331901) (owner: 10Jelto) [16:03:08] You wanted `scap deploy-promote group0` [16:03:26] er... :D [16:03:29] oops [16:03:34] i will fix after our team meeting [16:03:40] 👍🏾 [16:04:06] thanks [16:04:06] MatmaRex: Thanks for the notification [16:05:12] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [16:05:33] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) [16:06:33] 10SRE, 10Infrastructure-Foundations, 10Traffic: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10BCornwall) Thanks for that, @ayounsi! Are you aware of https://gerrit.wikimedia.org/g/operations/software/latency-measurement ? It may or may not be relevant but I wanted to make sure it w... [16:07:16] (03PS2) 10Ssingh: pybal: install pybal from component in bullseye [puppet] - 10https://gerrit.wikimedia.org/r/904203 (https://phabricator.wikimedia.org/T321309) [16:07:18] MatmaRex: yes, thank you. sorry, all [16:08:42] (03PS3) 10Cathal Mooney: Remove EventGate Icinga checks that have been moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) [16:08:54] (03CR) 10CI reject: [V: 04-1] Remove EventGate Icinga checks that have been moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [16:14:46] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10jcrespo) [16:15:12] (03CR) 10Ssingh: [C: 03+2] pybal: install pybal from component in bullseye [puppet] - 10https://gerrit.wikimedia.org/r/904203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:22:25] (03PS1) 10Ottomata: Install flink operator in wikikube staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/904226 (https://phabricator.wikimedia.org/T333464) [16:24:12] (03PS1) 10Ssingh: pybal: add conditional for Python{2,3} packages [puppet] - 10https://gerrit.wikimedia.org/r/904227 (https://phabricator.wikimedia.org/T321309) [16:25:06] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40425/console" [puppet] - 10https://gerrit.wikimedia.org/r/904227 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:25:53] (03PS2) 10Ottomata: Install flink operator in wikikube staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/904226 (https://phabricator.wikimedia.org/T333464) [16:26:09] (03CR) 10Ssingh: [V: 03+1 C: 03+2] pybal: add conditional for Python{2,3} packages [puppet] - 10https://gerrit.wikimedia.org/r/904227 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:29:58] !log btullis@cumin1001 Added views for new wiki: anpwiki T332458 [16:29:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [16:30:04] T332458: Prepare and check storage layer for anpwiki - https://phabricator.wikimedia.org/T332458 [16:35:36] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40426/console" [puppet] - 10https://gerrit.wikimedia.org/r/895886 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [16:37:03] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40427/console" [puppet] - 10https://gerrit.wikimedia.org/r/895886 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [16:37:23] (03CR) 10David Caro: maintain_dbusers: move out of nfs to services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899662 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [16:37:39] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS bullseye [16:37:49] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye completed: - lvs4010 (**WARN**) - Downtimed on Icinga/Aler... [16:39:41] (03CR) 10David Caro: [C: 03+2] maintain_dbusers: move out of nfs to services [puppet] - 10https://gerrit.wikimedia.org/r/899662 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [16:39:43] (03CR) 10David Caro: [V: 03+1 C: 03+2] maintain_dbusers: Remove unused param and adapt to best practices [puppet] - 10https://gerrit.wikimedia.org/r/899663 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [16:41:05] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: /srv 274074 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [16:41:11] (03CR) 10FNegri: [C: 03+2] [tbs.harbor] Clean up admin pwd management (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899724 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri) [16:42:33] (03CR) 10David Caro: [C: 03+2] wmcs::nfs::primary: remove unused mysql_variances hiera [puppet] - 10https://gerrit.wikimedia.org/r/904161 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [16:43:11] (03CR) 10BCornwall: [V: 03+1 C: 03+2] varnish: Change systemd units Requires to BindsTo [puppet] - 10https://gerrit.wikimedia.org/r/895875 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [16:43:29] (03CR) 10Dzahn: "I haven't added this or contributed to this repo before, would prefer if you could you do this with serviceops / people who usually merge " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/899611 (owner: 10Jaime Nuche) [16:44:08] !log Disable puppet on A:cp to roll out T284555 [16:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:14] T284555: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 [16:46:39] (03PS2) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) [16:47:34] (03PS3) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) [16:48:29] (03PS4) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) [16:50:32] 10SRE, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Wikipedia-iOS-App-Backlog: Integrate In-App Internet censorship circumvention by domain fronting - https://phabricator.wikimedia.org/T327286 (10JTannerWMF) Thanks for creating this task its a valid request. Our team can't prioritize it right now but its... [16:53:37] (03CR) 10Dzahn: ":( I had searched the puppet repo for this (which wasn't just a grep because of the way how team name and severity are combined).. and che" [puppet] - 10https://gerrit.wikimedia.org/r/904069 (owner: 10Filippo Giunchedi) [16:55:33] (03PS2) 10Herron: grizzly: adapt managed dashboards to 0.2 metadata approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/903776 (https://phabricator.wikimedia.org/T332895) [16:57:09] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [16:58:21] (03PS1) 10David Caro: wmcs::eqiad1::control: remove passwords from default yaml [puppet] - 10https://gerrit.wikimedia.org/r/904234 [16:58:23] (03PS3) 10Herron: grizzly: adapt managed dashboards to 0.2 metadata approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/903776 (https://phabricator.wikimedia.org/T332895) [16:58:57] (03CR) 10David Caro: [C: 03+2] wmcs::eqiad1::control: remove passwords from default yaml [puppet] - 10https://gerrit.wikimedia.org/r/904234 (owner: 10David Caro) [16:59:12] (03PS1) 10Volans: sre.hosts.provision: fix NIC link detection [cookbooks] - 10https://gerrit.wikimedia.org/r/904235 (https://phabricator.wikimedia.org/T326661) [16:59:29] (03PS4) 10Herron: grizzly: adapt managed dashboards to 0.2 metadata approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/903776 (https://phabricator.wikimedia.org/T332895) [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T1700) [17:00:53] (03PS1) 10Dzahn: alertmanager: update sre-collab IRC channel name, don't use test channel [puppet] - 10https://gerrit.wikimedia.org/r/904236 [17:02:36] (03PS2) 10Dzahn: alertmanager: update sre-collab IRC channel name, don't use test channel [puppet] - 10https://gerrit.wikimedia.org/r/904236 (https://phabricator.wikimedia.org/T329587) [17:02:49] (03CR) 10Dzahn: "keeping the receivers, just changing the channel name then: https://gerrit.wikimedia.org/r/c/operations/puppet/+/904236" [puppet] - 10https://gerrit.wikimedia.org/r/904069 (owner: 10Filippo Giunchedi) [17:05:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/904235 (https://phabricator.wikimedia.org/T326661) (owner: 10Volans) [17:07:47] (03CR) 10Dzahn: "sorry for the noise if this broke puppet on prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/904236 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [17:07:51] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash: Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10colewhite) Capturing some detail from the meeting today: This particular incident gave responders [[ https://grafana-rw.wikimedia.org/d/VCK8-FpZz/cwhite-logstash?orgId=1&... [17:08:03] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: fix NIC link detection [cookbooks] - 10https://gerrit.wikimedia.org/r/904235 (https://phabricator.wikimedia.org/T326661) (owner: 10Volans) [17:08:29] (03CR) 10Dzahn: "sorry for the noise if this broke puppet on prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/904069 (owner: 10Filippo Giunchedi) [17:10:14] (03Merged) 10jenkins-bot: sre.hosts.provision: fix NIC link detection [cookbooks] - 10https://gerrit.wikimedia.org/r/904235 (https://phabricator.wikimedia.org/T326661) (owner: 10Volans) [17:11:52] !log Re-enable puppet on A:cp - T284555 [17:11:55] (03CR) 10Raymond Ndibe: maintain-dbusers: run isort and black and use pep563 types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902815 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [17:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:57] T284555: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 [17:14:38] (03CR) 10Raymond Ndibe: maintain-dbusers: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [17:15:52] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [17:16:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [17:18:30] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [17:20:03] (03CR) 10BCornwall: [C: 03+2] docker-service-shim: change Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895877 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [17:26:29] (03PS1) 10Ssingh: P:lvs: do not enable legacy vlan names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/904237 (https://phabricator.wikimedia.org/T321309) [17:27:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1208.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:59] !log Disable puppet on A:cp to roll out another T284555 [17:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:04] T284555: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 [17:29:16] (03CR) 10BCornwall: [V: 03+1 C: 03+2] ats-mtail: Change systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895878 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [17:30:05] (03PS1) 10Hashar: Migrate from git fat to git lfs [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) [17:30:35] (03CR) 10CI reject: [V: 04-1] Migrate from git fat to git lfs [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar) [17:31:08] (03CR) 10Cathal Mooney: "LGTM! Covers the majority of elements from the task afaik, but I will leave to Arzhel to give the +1 and make sure it matches what he had " [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [17:31:15] (03PS2) 10Ssingh: P:lvs: do not enable legacy vlan names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/904237 (https://phabricator.wikimedia.org/T321309) [17:32:34] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Cmjohnson) This is showing 6 disks failed. Is it possible there is a different problem that is causing the disks to fail? I do not see any errors for the raid controller [17:33:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1208.mgmt.eqiad.wmnet with reboot policy FORCED [17:35:58] (03CR) 10BCornwall: [V: 03+1 C: 03+2] fifo-log-demux: systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895886 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [17:36:32] (03PS3) 10Ssingh: P:lvs: do not enable legacy vlan names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/904237 (https://phabricator.wikimedia.org/T321309) [17:36:38] (03PS2) 10BCornwall: fifo-log-demux: systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895886 (https://phabricator.wikimedia.org/T284555) [17:36:43] (03CR) 10BCornwall: [V: 03+2] fifo-log-demux: systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895886 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [17:38:39] (03CR) 10BBlack: [C: 03+1] P:lvs: do not enable legacy vlan names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/904237 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:38:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1013.eqiad.wmnet with OS bullseye [17:39:00] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye [17:39:01] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe1013.eqiad.wmnet with OS bullseye [17:39:07] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye executed with... [17:39:23] (03CR) 10Ssingh: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40431/console NOOP on lvs4009." [puppet] - 10https://gerrit.wikimedia.org/r/904237 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:39:47] (03CR) 10Ssingh: [C: 03+2] P:lvs: do not enable legacy vlan names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/904237 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:40:16] 10SRE, 10Traffic, 10VPS-project-Codesearch, 10serviceops, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) [17:41:49] BGP alerts in ulsfo expected shortly [17:42:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS bullseye [17:42:29] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye [17:42:45] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:43:48] !log Re-enable puppet on A:cp - T284555 [17:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:53] T284555: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 [17:45:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1208.mgmt.eqiad.wmnet with reboot policy FORCED [17:47:45] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:51:58] (03PS1) 10Dduvall: docker_registry_ha: Do not lose original request URI during JWT auth [puppet] - 10https://gerrit.wikimedia.org/r/904241 (https://phabricator.wikimedia.org/T322453) [17:52:45] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:57:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc1014.eqiad.wmnet with reason: PC maint [17:57:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1014.eqiad.wmnet with reason: PC maint [17:57:39] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [18:00:04] dduvall and dancy: (Dis)respected human, time to deploy Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T1800). Please do the needful. [18:00:04] dduvall and dancy: gettimeofday() says it's time for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T1800) [18:01:07] o/ [18:01:54] o/ [18:02:39] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [18:03:03] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40433/console" [puppet] - 10https://gerrit.wikimedia.org/r/895885 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [18:04:42] i'll be rolling group0 and group1 for 1.41.0-wmf.2 today. pinging risky patch folks for awareness since there are a few big ones this week: Urbanecm, tgr, kostajh, matej_suchanek, cscott, duesen [18:05:33] ack. my risky patch should be backported by now (should I even include such risky patches on the train task in that case?), so hopefully shouldn't change anything more. [18:06:27] urbanecm: hmm, good question. my stance is that it couldn't hurt but it's also good to know that rolling back probably wouldn't be the right course in that case [18:06:45] recourse in the case of failure that is [18:06:56] urbanecm: I like and appreciate the risky patch reports. [18:07:07] i'll continue doing that in that case :) [18:07:15] ty <3 [18:07:55] dduvall: true. the issue with my patch is that it can't be reverted without making an additional action after that revert, so that's why i backported it. [18:08:48] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904243 (https://phabricator.wikimedia.org/T330208) [18:08:50] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904243 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot) [18:09:33] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904243 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot) [18:11:52] (03PS1) 10Dzahn: microsites: add monitor for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904244 (https://phabricator.wikimedia.org/T327976) [18:12:15] (03CR) 10Dzahn: [C: 03+2] microsites: add monitor for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904244 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [18:12:27] urbanecm: gotcha! well thanks for chiming in here with details and updates. that's really the most important thing overall :) [18:12:41] (03CR) 10CI reject: [V: 04-1] microsites: add monitor for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904244 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [18:12:47] 10SRE, 10Keyholder, 10VPS-project-Codesearch, 10serviceops, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) [18:13:23] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 519663512 and 155 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:13:39] np! [18:13:39] 10SRE, 10Keyholder, 10VPS-project-Codesearch, 10serviceops, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) Removing the Traffic team as our services have been rolled out with the change.... [18:13:47] (03PS2) 10Herron: alerting_host: failover icinga and alertmanger from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) [18:13:52] 10SRE, 10Keyholder, 10VPS-project-Codesearch, 10serviceops, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) 05In progress→03Open [18:15:22] 10SRE, 10Keyholder, 10VPS-project-Codesearch, 10serviceops, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) a:05BCornwall→03None [18:16:44] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.2 refs T330208 [18:16:52] T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208 [18:16:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:18:17] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede [18:19:18] (03PS1) 10Dzahn: microsites: add monitor for transparency/transparency-archive site [puppet] - 10https://gerrit.wikimedia.org/r/904266 (https://phabricator.wikimedia.org/T327976) [18:19:46] (03CR) 10CI reject: [V: 04-1] microsites: add monitor for transparency/transparency-archive site [puppet] - 10https://gerrit.wikimedia.org/r/904266 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [18:19:48] (03CR) 10Ahmon Dancy: docker_registry_ha: Do not lose original request URI during JWT auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904241 (https://phabricator.wikimedia.org/T322453) (owner: 10Dduvall) [18:19:59] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/904247 (https://phabricator.wikimedia.org/T333479) [18:20:18] (03PS2) 10Dzahn: microsites: add monitor for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904244 (https://phabricator.wikimedia.org/T327976) [18:20:51] (03Abandoned) 10Ladsgroup: mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/904247 (https://phabricator.wikimedia.org/T333479) (owner: 10Gerrit maintenance bot) [18:21:29] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/904248 (https://phabricator.wikimedia.org/T333480) [18:21:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:22:08] (03PS2) 10Dzahn: microsites: add monitor for transparency/transparency-archive site [puppet] - 10https://gerrit.wikimedia.org/r/904266 (https://phabricator.wikimedia.org/T327976) [18:22:10] 10SRE, 10Traffic-Icebox, 10conftool, 10serviceops: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10BCornwall) @Joe, thank you for all the work on this ticket! Would you say that this is resolved since the CRs have all... [18:22:24] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1208.mgmt.eqiad.wmnet with reboot policy FORCED [18:22:26] 10SRE, 10Traffic-Icebox, 10conftool, 10serviceops: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10BCornwall) 05Open→03Stalled a:03Joe [18:22:30] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:22:30] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:22:35] (03CR) 10CI reject: [V: 04-1] microsites: add monitor for transparency/transparency-archive site [puppet] - 10https://gerrit.wikimedia.org/r/904266 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [18:23:17] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee [18:23:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS bullseye [18:23:29] (03PS3) 10Dzahn: microsites: add monitor for transparency/transparency-archive site [puppet] - 10https://gerrit.wikimedia.org/r/904266 (https://phabricator.wikimedia.org/T327976) [18:23:38] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye completed: - lvs4010 (**WARN**) - Downtimed on Icinga/Aler... [18:23:39] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 543544 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:23:52] 10SRE, 10Traffic-Icebox: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 (10BCornwall) 05Open→03Resolved a:03BCornwall Setting as resolved as the fixes were applied. If there's a want for general disk alerting/k... [18:24:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T333480 [18:24:49] T333480: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T333480 [18:25:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T333480 [18:25:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1160 with weight 0 T333480', diff saved to https://phabricator.wikimedia.org/P45979 and previous config saved to /var/cache/conftool/dbconfig/20230329-182536-ladsgroup.json [18:26:06] 10SRE, 10PyBal, 10Traffic: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 (10BCornwall) [18:27:01] 10SRE, 10PyBal, 10Traffic: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 (10BCornwall) 05Stalled→03In progress [18:27:30] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:27:30] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:27:49] hm? [18:27:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1013.eqiad.wmnet with OS bullseye [18:28:04] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe1013.eqiad.wmnet with OS bullseye [18:28:09] 10SRE, 10Commons, 10MediaWiki-File-management, 10RESTBase-API, and 3 others: RFC: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847 (10BCornwall) [18:28:13] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye [18:28:17] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye executed with... [18:28:57] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [18:29:02] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10PyBal, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10BCornwall) [18:29:08] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye [18:29:14] (03CR) 10Dzahn: [C: 03+2] "We currently have these in the test channel and I want them to get attention in the new channel:" [puppet] - 10https://gerrit.wikimedia.org/r/904236 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [18:29:35] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1004.eqiad.wmnet with OS bullseye [18:29:41] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed w... [18:29:55] alrighty. new errors dashboard looks ok, all errors dashboard looks... normal, varnish looks ok, slow db query dashboard looks ok. rolling to group1 [18:31:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1208.mgmt.eqiad.wmnet with reboot policy FORCED [18:32:04] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904267 (https://phabricator.wikimedia.org/T330208) [18:32:06] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904267 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot) [18:32:07] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1208.mgmt.eqiad.wmnet with reboot policy FORCED [18:32:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1208.mgmt.eqiad.wmnet with reboot policy FORCED [18:32:51] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904267 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot) [18:32:52] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Cmjohnson) @wiki_willy ms-fe1013 and thanos-fe1004 both installed but did not set puppet certificates correctly and now they both just fail when I try... [18:35:39] (03PS2) 10Dduvall: docker_registry_ha: Do not lose original request URI during JWT auth [puppet] - 10https://gerrit.wikimedia.org/r/904241 (https://phabricator.wikimedia.org/T322453) [18:35:58] (03CR) 10Dduvall: docker_registry_ha: Do not lose original request URI during JWT auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904241 (https://phabricator.wikimedia.org/T322453) (owner: 10Dduvall) [18:37:53] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Cmjohnson) @wiki_willy all 3 of these servers are well out of warranty (2-3 years). analytics1068 is marked failed in netbox [18:38:04] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@d66d6e0]: bump glent to 0.3.3 [18:38:08] (03CR) 10Ahmon Dancy: [C: 03+1] docker_registry_ha: Do not lose original request URI during JWT auth [puppet] - 10https://gerrit.wikimedia.org/r/904241 (https://phabricator.wikimedia.org/T322453) (owner: 10Dduvall) [18:38:20] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@d66d6e0]: bump glent to 0.3.3 (duration: 00m 16s) [18:38:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1208.mgmt.eqiad.wmnet with reboot policy FORCED [18:39:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [18:39:18] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.2 refs T330208 [18:39:26] T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208 [18:43:45] urbanecm: on the surface https://phabricator.wikimedia.org/T333483 doesn't seem related to your risky patch but it has similar characteristics of https://phabricator.wikimedia.org/T330691 which may be in your purview [18:44:09] i'm trying to suss out whether it should be a blocker atm [18:45:06] !log dduvall@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.2 refs T330208 (duration: 05m 48s) [18:45:12] T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208 [18:45:26] oh it happened in a maintenance script. maybe it's not that dire [18:46:19] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED [18:46:52] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [18:46:53] (03PS1) 10Ssingh: hiera: lvs/interfaces: update interface name for lvs4010 [puppet] - 10https://gerrit.wikimedia.org/r/904268 (https://phabricator.wikimedia.org/T321309) [18:47:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [18:47:41] (03CR) 10Ssingh: [C: 03+2] hiera: lvs/interfaces: update interface name for lvs4010 [puppet] - 10https://gerrit.wikimedia.org/r/904268 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:48:20] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [18:48:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [18:48:58] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED [18:50:15] (03PS2) 10Ladsgroup: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/904248 (https://phabricator.wikimedia.org/T333480) (owner: 10Gerrit maintenance bot) [18:50:20] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/904248 (https://phabricator.wikimedia.org/T333480) (owner: 10Gerrit maintenance bot) [18:50:58] !log Starting s4 eqiad failover from db1138 to db1160 - T333480 [18:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:03] T333480: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T333480 [18:51:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1160 to s4 primary T333480', diff saved to https://phabricator.wikimedia.org/P45980 and previous config saved to /var/cache/conftool/dbconfig/20230329-185125-ladsgroup.json [18:52:59] (03PS3) 10Nray: Update "United States" static page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903835 (https://phabricator.wikimedia.org/T331681) [18:53:27] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 (10colewhite) This started happening to the logstash-k8s daily index today: https://logstash.wikimedia.o... [18:54:12] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) After the switch configuration step I get the output below and ` Testing Redfish API connection to db1209 (10.65.1.88) Retrying (Retry(total=2, connect=None, read=None, redirect... [18:54:30] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [18:54:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1138 T333480', diff saved to https://phabricator.wikimedia.org/P45981 and previous config saved to /var/cache/conftool/dbconfig/20230329-185431-ladsgroup.json [18:55:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [18:55:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [18:55:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [19:05:00] RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [19:07:00] (03PS1) 10Dzahn: microsites: add monitoring for WDQS and CDQS UI sites [puppet] - 10https://gerrit.wikimedia.org/r/904271 (https://phabricator.wikimedia.org/T327976) [19:10:24] (03PS1) 10Dzahn: microsites: add monitoring for security.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904273 (https://phabricator.wikimedia.org/T327976) [19:11:14] (03PS1) 10Ssingh: pybal: install prometheus-client from component [puppet] - 10https://gerrit.wikimedia.org/r/904274 (https://phabricator.wikimedia.org/T321309) [19:14:08] !log disable puppet on A:lvs to roll out pybal prometheus-client change [19:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:03] (03CR) 10Ssingh: [C: 03+2] pybal: install prometheus-client from component [puppet] - 10https://gerrit.wikimedia.org/r/904274 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:15:33] 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10Ladsgroup) Blocked on approval by owners of `analytics-privatedata-users`: @Ottomata or @odimitrijevic [19:19:06] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:19:33] (03CR) 10Hashar: "I should probably amend the commit message to be more descriptive." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar) [19:20:23] !log [enable] puppet on A:lvs to roll out pybal prometheus-client change [19:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:42] !log force puppet agent run on A:lvs to additionally confirm nothing broke [19:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:42] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:28:50] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Priority Backlog 📥): git-fat replacement/removal - https://phabricator.wikimedia.org/T279509 (10hashar) [19:36:40] (03PS2) 10Ladsgroup: Revert "Revert "mwscript: Switch to use run.php"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800) [19:38:41] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Priority Backlog 📥): git-fat replacement/removal - https://phabricator.wikimedia.org/T279509 (10hashar) Some caution, #scap does not properly support `git-lfs`. On the scap targets the local git repository cache has a remote set to the deploymen... [19:48:55] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [19:50:21] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [19:50:23] !log btullis@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) [19:53:39] (03Abandoned) 10Cathal Mooney: Add function to int_automation to validate QFX5120 port blocks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [19:56:22] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [19:59:24] (03PS3) 10Ladsgroup: Revert "Revert "mwscript: Switch to use run.php"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800) [19:59:26] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10wiki_willy) @Jclark-ctr has a few spares onsite, so we can probably use those as replacements. Thanks, Willy [20:00:04] (03CR) 10CI reject: [V: 04-1] Revert "Revert "mwscript: Switch to use run.php"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T2000). [20:00:04] AaronSchulz and nray: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:36] (if someone else could deploy, that'd be grand) [20:00:41] o/ [20:01:39] (03PS4) 10Ladsgroup: Revert "Revert "mwscript: Switch to use run.php"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800) [20:01:53] I can deploy' [20:02:11] (ty) [20:02:20] * AaronSchulz waits for that reversion [20:02:46] (03PS4) 10Nray: Update "United States" static page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903835 (https://phabricator.wikimedia.org/T331681) [20:03:16] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [20:03:26] AaronSchulz: hm, what revert are you talking about? can I deploy your patch or is it blocked on that/something? [20:03:32] ok, looks like that isn't being deployed now [20:03:39] (03PS5) 10Nray: Update "United States" static page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903835 (https://phabricator.wikimedia.org/T331681) [20:03:46] * AaronSchulz can go now then [20:04:11] hm? [20:04:15] taavi: I was looking at https://gerrit.wikimedia.org/r/893552 but nevermind that [20:04:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903835 (https://phabricator.wikimedia.org/T331681) (owner: 10Nray) [20:04:48] nray: starting from your patch in the meantime [20:04:54] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [20:04:58] @taavi thank you [20:05:17] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [20:05:23] (03Merged) 10jenkins-bot: Update "United States" static page to facilitate synthetic testing of T331681 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903835 (https://phabricator.wikimedia.org/T331681) (owner: 10Nray) [20:05:36] (03CR) 10Aaron Schulz: Add per-action component-level profiling in statsd using excimer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [20:05:45] !log taavi@deploy2002 Started scap: Backport for [[gerrit:903835|Update "United States" static page to facilitate synthetic testing of T331681 (T331681)]] [20:05:51] (03CR) 10Ahmon Dancy: "I will test in train-dev" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [20:05:51] T331681: Make a proposal for supporting the disabling of multiple features in client preferences - https://phabricator.wikimedia.org/T331681 [20:06:42] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [20:06:58] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host db1211.mgmt.eqiad.wmnet with reboot policy FORCED [20:07:16] !log taavi@deploy2002 nray and taavi: Backport for [[gerrit:903835|Update "United States" static page to facilitate synthetic testing of T331681 (T331681)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:07:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1210.mgmt.eqiad.wmnet with reboot policy FORCED [20:07:24] nray: please test if possible [20:07:31] @taavi thank you, will do now [20:07:54] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [20:09:26] BGP alerts in ulsfo expected [20:10:02] @taavi looks good, you can proceed! [20:10:06] thanks, syncing [20:10:07] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1211.mgmt.eqiad.wmnet with reboot policy FORCED [20:10:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS bullseye [20:10:52] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye [20:13:45] (03PS3) 10Jforrester: Clean up history page visual diffs beta feature config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903781 (https://phabricator.wikimedia.org/T333448) (owner: 10Bartosz Dziewoński) [20:15:14] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:15:30] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:903835|Update "United States" static page to facilitate synthetic testing of T331681 (T331681)]] (duration: 09m 45s) [20:15:32] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:15:36] T331681: Make a proposal for supporting the disabling of multiple features in client preferences - https://phabricator.wikimedia.org/T331681 [20:15:36] that's now live [20:15:45] AaronSchulz: your arclamp patch is up next [20:15:50] ok [20:15:50] @taavi thank you for your help! [20:16:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [20:16:45] (03Merged) 10jenkins-bot: Add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [20:17:08] !log taavi@deploy2002 Started scap: Backport for [[gerrit:893839|Add per-action component-level profiling in statsd using excimer (T225968)]] [20:17:13] T225968: Per component/skin/extension profiling of entry points with Grafana dashboards - https://phabricator.wikimedia.org/T225968 [20:18:33] !log taavi@deploy2002 aaron and taavi: Backport for [[gerrit:893839|Add per-action component-level profiling in statsd using excimer (T225968)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:18:47] AaronSchulz: please test [20:19:07] aye [20:21:05] (03CR) 10Ladsgroup: Revert "Revert "mwscript: Switch to use run.php"" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [20:21:07] (03CR) 10Ahmon Dancy: [C: 03+1] Revert "Revert "mwscript: Switch to use run.php"" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [20:23:30] don't see anything aversive [20:23:54] ok, I'll sync [20:24:02] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1073'] [20:24:16] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1073'] [20:25:14] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [20:25:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [20:26:15] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [20:26:39] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [20:27:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [20:28:06] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [20:28:13] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [20:28:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [20:29:00] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:893839|Add per-action component-level profiling in statsd using excimer (T225968)]] (duration: 11m 52s) [20:29:06] T225968: Per component/skin/extension profiling of entry points with Grafana dashboards - https://phabricator.wikimedia.org/T225968 [20:29:08] aaand it's live [20:29:56] good, I see the new stats data [20:29:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:34:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:35:22] (MjolnirUpdateFailureRateExceedesThreshold) firing: (2) Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [20:43:06] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:43:22] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:48:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS bullseye [20:52:48] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye completed: - lvs4010 (**WARN**) - Downtimed on Icinga/Alertmanager - Disabled... [21:05:59] jouncebot: nowandnext [21:05:59] No deployments scheduled for the next 8 hour(s) and 54 minute(s) [21:05:59] In 8 hour(s) and 54 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T0600) [21:05:59] In 8 hour(s) and 54 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T0600) [21:06:04] awesome [21:15:31] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs4010.ulsfo.wmnet [21:15:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs4010.ulsfo.wmnet [21:17:55] (03PS1) 10Jdlrobson: Disable Vector js/css sharing on pl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904284 (https://phabricator.wikimedia.org/T332809) [21:23:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED [21:24:06] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED [21:24:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED [21:27:11] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) [21:28:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1210.mgmt.eqiad.wmnet with reboot policy FORCED [21:43:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED [21:44:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [21:45:22] (MjolnirUpdateFailureRateExceedesThreshold) resolved: (2) Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [21:45:32] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@ada9bb0]: disable auto-versioning of glent uploads [21:45:47] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@ada9bb0]: disable auto-versioning of glent uploads (duration: 00m 14s) [21:45:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Papaul) [21:46:36] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['gerrit1003'] [21:46:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['gerrit1003'] [21:46:52] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [21:47:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1210.mgmt.eqiad.wmnet with reboot policy FORCED [21:48:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1210.mgmt.eqiad.wmnet with reboot policy FORCED [21:49:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1210.mgmt.eqiad.wmnet with reboot policy FORCED [21:50:13] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1210.mgmt.eqiad.wmnet with reboot policy FORCED [21:50:30] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['gerrit1003'] [21:50:42] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['gerrit1003'] [21:53:34] (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:54:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1211.mgmt.eqiad.wmnet with reboot policy FORCED [21:58:09] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [21:58:28] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [21:59:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1212.mgmt.eqiad.wmnet with reboot policy FORCED [22:00:59] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [22:01:08] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [22:04:41] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [22:04:49] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [22:04:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1211.mgmt.eqiad.wmnet with reboot policy FORCED [22:06:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1213.mgmt.eqiad.wmnet with reboot policy FORCED [22:06:21] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [22:06:55] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [22:11:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1212.mgmt.eqiad.wmnet with reboot policy FORCED [22:13:20] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [22:13:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [22:13:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1214.mgmt.eqiad.wmnet with reboot policy FORCED [22:13:46] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [22:16:13] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [22:17:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1213.mgmt.eqiad.wmnet with reboot policy FORCED [22:18:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1215.mgmt.eqiad.wmnet with reboot policy FORCED [22:23:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1214.mgmt.eqiad.wmnet with reboot policy FORCED [22:24:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1216.mgmt.eqiad.wmnet with reboot policy FORCED [22:26:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1215.mgmt.eqiad.wmnet with reboot policy FORCED [22:28:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1217.mgmt.eqiad.wmnet with reboot policy FORCED [22:32:58] PROBLEM - Host ms-be2067 is DOWN: PING CRITICAL - Packet loss = 100% [22:33:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1216.mgmt.eqiad.wmnet with reboot policy FORCED [22:35:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1218.mgmt.eqiad.wmnet with reboot policy FORCED [22:36:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1217.mgmt.eqiad.wmnet with reboot policy FORCED [22:37:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1217.mgmt.eqiad.wmnet with reboot policy FORCED [22:38:38] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10phaultfinder) [22:39:42] RECOVERY - Host ms-be2067 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [22:46:15] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [22:47:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:47:28] PROBLEM - Host ms-be2067 is DOWN: PING CRITICAL - Packet loss = 100% [22:50:22] PROBLEM - Check systemd state on ms-be2059 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:22] RECOVERY - Host ms-be2067 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms [22:52:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:55:04] RECOVERY - Check systemd state on phab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1218.mgmt.eqiad.wmnet with reboot policy FORCED [22:59:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1217.mgmt.eqiad.wmnet with reboot policy FORCED [23:01:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1219.mgmt.eqiad.wmnet with reboot policy FORCED [23:01:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1220.mgmt.eqiad.wmnet with reboot policy FORCED [23:08:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1219.mgmt.eqiad.wmnet with reboot policy FORCED [23:09:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1220.mgmt.eqiad.wmnet with reboot policy FORCED [23:10:11] (03CR) 10Cwhite: [C: 03+1] wmnet: prep statsd/graphite records for easier write failover [dns] - 10https://gerrit.wikimedia.org/r/904185 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [23:10:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1221.mgmt.eqiad.wmnet with reboot policy FORCED [23:10:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1222.mgmt.eqiad.wmnet with reboot policy FORCED [23:15:42] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:30] (03CR) 10Dzahn: [C: 03+2] microsites: add monitor for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904244 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:20:49] (03CR) 10Dzahn: [C: 03+2] microsites: add monitor for transparency/transparency-archive site [puppet] - 10https://gerrit.wikimedia.org/r/904266 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:20:57] (03PS4) 10Dzahn: microsites: add monitor for transparency/transparency-archive site [puppet] - 10https://gerrit.wikimedia.org/r/904266 (https://phabricator.wikimedia.org/T327976) [23:22:25] (03CR) 10Dzahn: [V: 03+2] microsites: add monitor for transparency/transparency-archive site [puppet] - 10https://gerrit.wikimedia.org/r/904266 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:22:47] (03CR) 10Dzahn: [C: 03+2] microsites: add monitoring for WDQS and CDQS UI sites [puppet] - 10https://gerrit.wikimedia.org/r/904271 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:22:55] (03PS2) 10Dzahn: microsites: add monitoring for WDQS and CDQS UI sites [puppet] - 10https://gerrit.wikimedia.org/r/904271 (https://phabricator.wikimedia.org/T327976) [23:23:08] (03CR) 10Dzahn: [V: 03+2] microsites: add monitoring for WDQS and CDQS UI sites [puppet] - 10https://gerrit.wikimedia.org/r/904271 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:23:22] (03CR) 10Dzahn: [C: 03+2] microsites: add monitoring for security.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904273 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:23:28] (03PS2) 10Dzahn: microsites: add monitoring for security.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904273 (https://phabricator.wikimedia.org/T327976) [23:23:35] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [23:23:38] (03CR) 10Dzahn: [V: 03+2] microsites: add monitoring for security.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904273 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:23:43] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [23:25:49] (03CR) 10Dzahn: [C: 03+2] site: add contint2002 to ci::master role [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [23:27:32] RECOVERY - Check systemd state on ms-be2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:27:48] (03CR) 10Dzahn: [C: 03+2] site: add contint2002 to ci::master role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [23:40:27] (03PS1) 10Dzahn: site: remove superfluous insetup role for contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/904370 (https://phabricator.wikimedia.org/T324659) [23:40:32] (ProbeDown) firing: Service miscweb2002:443 has failed probes (http_commons_query_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:40:52] ^ I just added this check [23:41:04] the part that surprises me is that it's in this channel. I'll fix both [23:42:54] (03CR) 10Dzahn: [C: 03+2] "on contint2002 at first puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [23:43:07] (03CR) 10Dzahn: [C: 03+2] "removed insetup role at https://gerrit.wikimedia.org/r/c/operations/puppet/+/904370" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [23:44:29] (03CR) 10Dzahn: [C: 03+2] site: remove superfluous insetup role for contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/904370 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [23:45:32] (ProbeDown) firing: (9) Service contint2002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:45:53] ^ also because contint2002 just got the role... sigh [23:48:11] !log contint2002 - a2dismod mpm_event (ONCE AGAIN this year old issue when applying roles with apache for the first time) - running puppet - now it can actually install PHP 7.3 and start apache [23:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:45] !log contint2002 - a2dismod mpm_event (ONCE AGAIN this year old issue when applying roles with apache for the first time) - running puppet - now it can actually install PHP 7.3 and start apache T324659 [23:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:50] T324659: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 [23:49:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1222.mgmt.eqiad.wmnet with reboot policy FORCED [23:49:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1221.mgmt.eqiad.wmnet with reboot policy FORCED [23:50:02] (03CR) 10Dzahn: [C: 03+2] "A manual "a2dismod mpm_event" followed by a puppet run fixes the (common!) issue with "php7.3]/Exec[ensure_present_mod_php7.3]/returns: ch" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [23:50:32] (ProbeDown) firing: (9) Service contint2002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:50:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1223.mgmt.eqiad.wmnet with reboot policy FORCED [23:51:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1224.mgmt.eqiad.wmnet with reboot policy FORCED [23:51:36] (03CR) 10Dzahn: [C: 03+2] "I confirmed zuul, zuul-merger and jenkins are all masked. This also means we get alerts that integration.wikimedia.org isn't working on i" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [23:52:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on contint2002.wikimedia.org with reason: WIP-known-to-be-debugged-new-host [23:53:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on contint2002.wikimedia.org with reason: WIP-known-to-be-debugged-new-host [23:59:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1223.mgmt.eqiad.wmnet with reboot policy FORCED [23:59:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1224.mgmt.eqiad.wmnet with reboot policy FORCED