[00:11:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1007.eqiad.wmnet with reason: host reimage [00:14:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1007.eqiad.wmnet with reason: host reimage [00:17:56] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:42] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:29:59] (03PS3) 10Stang: Fix broken wordmarks in Bengali projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124) [00:30:19] (03CR) 10Stang: Fix broken wordmarks in Bengali projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124) (owner: 10Stang) [00:34:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1007.eqiad.wmnet with OS bullseye [00:34:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye co... [00:34:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1009.eqiad.wmnet with OS bullseye [00:34:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1009.eqiad.wmnet with OS bullseye [00:43:20] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1009.eqiad.wmnet with reason: host reimage [00:50:01] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1009.eqiad.wmnet with reason: host reimage [00:53:29] (03PS2) 10Jdlrobson: [WIP] Logos: yaml can be populated by buildLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843995 [00:53:40] (03CR) 10CI reject: [V: 04-1] [WIP] Logos: yaml can be populated by buildLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843995 (owner: 10Jdlrobson) [00:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:54:35] (03CR) 10Jdlrobson: "This modifies logos/config.yaml with the new logos we've created. We either need to support local_wordmark / local_tagline directives in t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843995 (owner: 10Jdlrobson) [01:01:36] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:03:18] (03CR) 10Legoktm: [C: 03+1] mariadb: new image for mariadb/mysql backups (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842993 (https://phabricator.wikimedia.org/T254636) (owner: 10BryanDavis) [01:04:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1009.eqiad.wmnet with OS bullseye [01:04:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1009.eqiad.wmnet with OS bullseye co... [01:05:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1011.eqiad.wmnet with OS bullseye [01:05:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye [01:05:38] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:09:06] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 52 probes of 783 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:09:16] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:09:16] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 157 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:11:00] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:13:34] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:14:40] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 14 probes of 783 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:14:52] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 54 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:16:59] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1011.eqiad.wmnet with reason: host reimage [01:18:50] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:20:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1011.eqiad.wmnet with reason: host reimage [01:35:10] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1011.eqiad.wmnet with OS bullseye [01:35:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye co... [01:36:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) 05Open→03Resolved @btullis these have been fixed, I updated the nic firmware and re-ran the image script. [01:37:45] (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:51] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T317804 (10Cmjohnson) 05Open→03Resolved these are updated with kafka-stretch1001 and 1002 [01:41:35] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Cmjohnson) @herron for raid setup, are all the disk raid 50? I do not think that the OS will install with that setup? There are 8 750GB SSDs P... [01:59:49] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:00:57] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:03:29] PROBLEM - MegaRAID on db2139 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:03:31] ACKNOWLEDGEMENT - MegaRAID on db2139 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T321147 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:03:36] 10SRE, 10ops-codfw: Degraded RAID on db2139 - https://phabricator.wikimedia.org/T321147 (10ops-monitoring-bot) [02:06:17] (03CR) 10Aftab: [C: 03+1] Fix broken wordmarks in Bengali projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124) (owner: 10Stang) [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:54:45] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Papaul) @Jclark-ctr @ayounsi is still looking into how he can prioritize T304677: Possible DHCP improvments [03:05:52] (03CR) 10Andrew Bogott: [C: 03+1] "Thanks for splitting this out! Let's merge this part tomorrow when we're both working." [puppet] - 10https://gerrit.wikimedia.org/r/842454 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [03:28:47] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:58:33] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:27:31] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:37:05] RECOVERY - MegaRAID on db2139 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:38:44] (03PS2) 10KartikMistry: Enable specialcontribute campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843003 (https://phabricator.wikimedia.org/T319306) [04:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [04:54:43] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:29:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:30:43] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:33:21] (03PS2) 10KartikMistry: Update cxserver to 2022-10-18-161640-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/842812 (https://phabricator.wikimedia.org/T317224) [05:34:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:40:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "recheck" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/826859 (https://phabricator.wikimedia.org/T316347) (owner: 10JMeybohm) [05:45:35] * kart_ updating cxserver.. [05:49:06] ah. No. I'll wait for sometime. Need to check something before that.. [05:58:31] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:07:23] (03PS1) 10Giuseppe Lavagetto: Add pip to python3-bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844319 [06:40:36] !log enabled graceful-shutdown on drmrs Arelion BGP [06:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:57] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:58:07] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:04] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:17] 'morning [07:00:38] Moin [07:00:57] kart_: do you want to self-deploy, or should i deploy for you? [07:00:58] urbanecm: I'll go ahead with my patch.. [07:01:07] 👍 [07:01:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843003 (https://phabricator.wikimedia.org/T319306) (owner: 10KartikMistry) [07:02:35] (03Merged) 10jenkins-bot: Enable specialcontribute campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843003 (https://phabricator.wikimedia.org/T319306) (owner: 10KartikMistry) [07:03:07] !log kartik@deploy1002 Started scap: Backport for [[gerrit:843003|Enable specialcontribute campaign (T319306)]] [07:03:13] T319306: Adjust visibility for the option to translate in the Persistent Contribution entry point - https://phabricator.wikimedia.org/T319306 [07:03:35] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:843003|Enable specialcontribute campaign (T319306)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [07:06:16] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:09:19] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:843003|Enable specialcontribute campaign (T319306)]] (duration: 06m 11s) [07:09:24] T319306: Adjust visibility for the option to translate in the Persistent Contribution entry point - https://phabricator.wikimedia.org/T319306 [07:10:11] urbanecm: I'm done. [07:10:55] Ack! [07:12:06] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:24:51] (03PS2) 10Urbanecm: Remove GEHomepageImpactModuleEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837114 (owner: 10Kosta Harlan) [07:27:56] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 101 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:33:12] (03PS1) 10Urbanecm: [growth] Turn mentorship off by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844428 (https://phabricator.wikimedia.org/T321056) [07:33:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837114 (owner: 10Kosta Harlan) [07:34:22] (03Merged) 10jenkins-bot: Remove GEHomepageImpactModuleEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837114 (owner: 10Kosta Harlan) [07:34:43] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:837114|Remove GEHomepageImpactModuleEnabled]] [07:35:07] !log urbanecm@deploy1002 urbanecm and kharlan: Backport for [[gerrit:837114|Remove GEHomepageImpactModuleEnabled]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [07:39:11] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:837114|Remove GEHomepageImpactModuleEnabled]] (duration: 04m 27s) [07:39:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844428 (https://phabricator.wikimedia.org/T321056) (owner: 10Urbanecm) [07:40:18] (03Merged) 10jenkins-bot: [growth] Turn mentorship off by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844428 (https://phabricator.wikimedia.org/T321056) (owner: 10Urbanecm) [07:40:40] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:844428|[growth] Turn mentorship off by default (T321056)]] [07:40:45] T321056: Turn mentorship off on all wikis where the community has not yet setup a mentorship list - https://phabricator.wikimedia.org/T321056 [07:41:03] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:844428|[growth] Turn mentorship off by default (T321056)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:45:54] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:844428|[growth] Turn mentorship off by default (T321056)]] (duration: 05m 14s) [07:46:04] T321056: Turn mentorship off on all wikis where the community has not yet setup a mentorship list - https://phabricator.wikimedia.org/T321056 [07:49:59] (03PS1) 10Filippo Giunchedi: sre: remove 'host' label from PybalBackendDown [alerts] - 10https://gerrit.wikimedia.org/r/844429 (https://phabricator.wikimedia.org/T320627) [07:57:22] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: remove 'host' label from PybalBackendDown [alerts] - 10https://gerrit.wikimedia.org/r/844429 (https://phabricator.wikimedia.org/T320627) (owner: 10Filippo Giunchedi) [07:57:52] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 89 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:00:04] hashar and dduvall: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T0800) [08:04:48] ACKNOWLEDGEMENT - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T321157 - The acknowledgement expires at: 2022-10-20 08:04:23. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:04:48] ACKNOWLEDGEMENT - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 ayounsi https://phabricator.wikimedia.org/T321157 - The acknowledgement expires at: 2022-10-20 08:04:23. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:04:48] ACKNOWLEDGEMENT - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP ayounsi https://phabricator.wikimedia.org/T321157 - The acknowledgement expires at: 2022-10-20 08:04:23. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:04:48] ACKNOWLEDGEMENT - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T321157 - The acknowledgement expires at: 2022-10-20 08:04:23. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:04:48] ACKNOWLEDGEMENT - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T321157 - The acknowledgement expires at: 2022-10-20 08:04:23. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:06:31] I am running the train [08:07:15] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844433 (https://phabricator.wikimedia.org/T320511) [08:07:17] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844433 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [08:08:05] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844433 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [08:10:52] (03PS1) 10Jelto: gitlab_runner: make allowed_images list configurable in hiera [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730) [08:12:16] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.6 refs T320511 [08:12:21] T320511: 1.40.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T320511 [08:15:54] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.6 refs T320511 (duration: 03m 37s) [08:16:26] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2139 - https://phabricator.wikimedia.org/T321147 (10Peachey88) [08:20:15] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37599/console" [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto) [08:21:27] looks like the train is okish [08:22:32] (03PS2) 10Jelto: gitlab_runner: make allowed_images list configurable in hiera [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730) [08:22:49] (03CR) 10JMeybohm: [C: 04-1] Remove references to deprecated kubeyaml (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [08:23:23] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37600/console" [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto) [08:27:59] (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844439 (https://phabricator.wikimedia.org/T321082) [08:30:14] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) >>! In T291918#7387775, @Joe wrote: >>>! In T291918#7387656, @jijiki wrote: >> Naming things is hard though, I do not agree with the `kube` prefix, in the future af... [08:30:42] (03CR) 10Jelto: [C: 04-1] "With I112110d2553a41e839f9990c39ac2a872135c588 allowed_images for Trusted Runners and Shared Runners will be separated. I can rebase this " [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961) (owner: 10SBassett) [08:35:05] (03Restored) 10Matthias Mullie: Add SearchVue to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie) [08:35:26] (03Restored) 10Matthias Mullie: [SearchVue] Enable extension on beta ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie) [08:35:37] (03CR) 10Kosta Harlan: [C: 03+1] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844439 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm) [08:43:10] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Clement_Goubert) >>! In T291918#7387656, @jijiki wrote: > Naming things is hard though, I do not agree with the `kube` prefix, in the future after baremetal mediawiki se... [08:43:22] hi kostajh, should we include https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/844430 in the version bump as well? [08:44:21] urbanecm: I was going to suggest it, but it seems like either way is fine [08:44:44] There’s a chance it wouldn’t work as I intend which would mean more time to create a revert, build a new image, etc [08:44:45] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Nice to be able to remove stuff :)" [homer/public] - 10https://gerrit.wikimedia.org/r/843519 (https://phabricator.wikimedia.org/T320962) (owner: 10Ayounsi) [08:45:12] kostajh: i see. well, at worst i'll try out a revert as well :)) [08:45:30] but if you say there's a chance it doesn't work, perhaps worth doing it in two deployments? [08:49:24] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ayounsi) I don't fully understand your comment above. I see that the alerts above are gone, but are now replaced with: ` test_enabled_not_connected xe-0/0/17 Interface enab... [08:50:30] (03CR) 10Ayounsi: [C: 03+2] Management: remove access/wifi exceptions [homer/public] - 10https://gerrit.wikimedia.org/r/843519 (https://phabricator.wikimedia.org/T320962) (owner: 10Ayounsi) [08:51:12] (03Merged) 10jenkins-bot: Management: remove access/wifi exceptions [homer/public] - 10https://gerrit.wikimedia.org/r/843519 (https://phabricator.wikimedia.org/T320962) (owner: 10Ayounsi) [08:51:58] RECOVERY - Check systemd state on mw1439 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:35] (03PS2) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) [08:54:18] (03Abandoned) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 (owner: 10David Caro) [08:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [08:55:08] (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [08:55:34] (03PS1) 10Ayounsi: Management: remove NAT include [homer/public] - 10https://gerrit.wikimedia.org/r/844443 (https://phabricator.wikimedia.org/T320962) [08:56:21] (03CR) 10Ayounsi: [C: 03+2] Management: remove NAT include [homer/public] - 10https://gerrit.wikimedia.org/r/844443 (https://phabricator.wikimedia.org/T320962) (owner: 10Ayounsi) [08:56:56] (03Merged) 10jenkins-bot: Management: remove NAT include [homer/public] - 10https://gerrit.wikimedia.org/r/844443 (https://phabricator.wikimedia.org/T320962) (owner: 10Ayounsi) [08:59:00] (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844444 (https://phabricator.wikimedia.org/T321082) [08:59:26] urbanecm: yeah two separate deployments might make sense here [08:59:57] kostajh: done :). can i start now? or do you prefer waiting for later in the day? [09:01:42] !log remove DHCP server and access zone on mr1-eqiad - T320962 [09:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:47] T320962: Decommission eqiad cage WiFi - https://phabricator.wikimedia.org/T320962 [09:01:55] (03PS3) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) [09:03:17] (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [09:06:58] 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on db2139 - https://phabricator.wikimedia.org/T321147 (10Ladsgroup) It's a backup source. [09:09:19] urbanecm: go for it! [09:09:25] okay! [09:09:43] (03CR) 10Urbanecm: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844439 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm) [09:11:06] 10SRE, 10Infrastructure-Foundations: Figure out a captcha option for IDM - https://phabricator.wikimedia.org/T320809 (10Aklapper) [09:12:26] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [09:12:56] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844439 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm) [09:13:22] patch is at deploy1002 already, running `helmfile -e staging -i apply ` in the service's dir [09:13:22] 10SRE, 10Infrastructure-Foundations: Figure out a captcha option for IDM - https://phabricator.wikimedia.org/T320809 (10Aklapper) There's lots of (partially heated) past discussion. See stuff like `T289607` (no news for a year), `T250227`, `T6845`, etc etc. Phabricator itself has no captchas involved. [09:13:32] (following https://wikitech.wikimedia.org/wiki/Add_Link#Deployment_2) [09:13:34] !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [09:13:50] 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on db2139 - https://phabricator.wikimedia.org/T321147 (10jcrespo) This is weird, //Rebuild// happens when a new disk is added. It is now in a good state: Raid status: OK: optimal, 1 logical, 10 physical, WriteBack policy [09:13:53] the only changed thing is the version, continuing [09:14:12] (03PS1) 10Btullis: Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) [09:14:22] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [09:14:48] (03CR) 10CI reject: [V: 04-1] Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:14:52] !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [09:15:16] (03PS4) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) [09:16:29] `curl "https://staging.svc.eqiad.wmnet:4005/v1/linkrecommendations/wikipedia/be_x_old/Barack_Obama"` returns `{"message":"Page not found: Barack_Obama"}`, while production timeouts [09:16:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Remove all nutcracker templates and refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/843878 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [09:16:48] (03PS2) 10Btullis: Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) [09:16:51] cswiki's example from https://wikitech.wikimedia.org/wiki/Add_Link#Deployment_2 also works fine [09:16:55] (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [09:17:22] (03CR) 10CI reject: [V: 04-1] Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:17:33] `service-checker-swagger staging.svc.eqiad.wmnet https://staging.svc.eqiad.wmnet:4005 -t 2 -s /apispec_1.json` also works fine [09:17:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mwdebug: Remove nutcracker config values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/843880 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [09:18:11] kostajh: going with eqiad, unless i need to check anything else on staging? [09:18:35] urbanecm: sounds good [09:18:38] doing [09:18:42] !log urbanecm@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [09:18:54] diff seems fine, continuing [09:19:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:48] (03PS5) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) [09:20:03] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Decommission eqiad cage WiFi - https://phabricator.wikimedia.org/T320962 (10ayounsi) Access security zone, DHCP server, NAT config removed from the routers. New DHCP relay feature enabled instead of the old bootp one. Netbo... [09:20:38] (03PS3) 10Btullis: Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) [09:20:47] urbanecm: cool. sometimes the eqiad one takes a few minutes to roll out [09:20:52] ack [09:21:11] (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [09:21:14] (03CR) 10CI reject: [V: 04-1] Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:21:22] !log urbanecm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [09:21:41] (03CR) 10David Caro: [C: 03+2] Add SSH key for sstefanova to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934) (owner: 10Slavina Stefanova) [09:21:49] works fine in production as well, continuing with codfw [09:21:50] (03CR) 10David Caro: [V: 03+2 C: 03+2] Add SSH key for sstefanova to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934) (owner: 10Slavina Stefanova) [09:21:54] !log urbanecm@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [09:22:58] (03PS6) 10David Caro: labstore: Send prom stats for getent_check [puppet] - 10https://gerrit.wikimedia.org/r/813898 [09:23:30] (03PS7) 10David Caro: labstore: Send prom stats for getent_check [puppet] - 10https://gerrit.wikimedia.org/r/813898 (https://phabricator.wikimedia.org/T313444) [09:23:36] !log urbanecm@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [09:23:52] and seems T321082 is now resolved :) [09:23:52] T321082: Requests to be-x-old.wikipedia.org result in HTTP 504 Gateway Timeout - https://phabricator.wikimedia.org/T321082 [09:24:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:24:32] 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10aborrero) I'm trying to capture this project also in https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProp... [09:24:40] (03PS8) 10David Caro: wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (https://phabricator.wikimedia.org/T313444) [09:26:08] (03CR) 10David Caro: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/832259 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:27:19] 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10taavi) >>! In T314847#8326550, @cmooney wrote: >>> /32 Service IPs should be from the cloud realm public /24 (185.15.56.0/24) if the s... [09:27:27] (03PS6) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) [09:28:42] (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [09:29:05] (03PS1) 10JMeybohm: Provide the cluster_cidr to kube-proxy on masters as well [puppet] - 10https://gerrit.wikimedia.org/r/844446 (https://phabricator.wikimedia.org/T300500) [09:32:22] (03PS7) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) [09:34:08] (03PS4) 10Btullis: Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) [09:34:36] (03PS1) 10JMeybohm: Provide the cluster_cidr to kube-proxy in wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/844449 (https://phabricator.wikimedia.org/T300500) [09:34:38] (03PS1) 10JMeybohm: Provide the cluster_cidr to kube-proxy in wikikube eqiad [puppet] - 10https://gerrit.wikimedia.org/r/844450 (https://phabricator.wikimedia.org/T300500) [09:40:59] (03CR) 10Clément Goubert: "This change is ready for review." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [09:43:00] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:43:10] (03PS5) 10Btullis: Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) [09:44:17] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37604/console" [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:45:11] _joe_: Any news on the maintenance script patch? [09:45:43] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) `mw-main` is probably the least misleading one, yes. I would like `mw-web` more, but it's going to mislead a lot of people into thinking it's just requests to wiki... [09:46:09] <_joe_> hoo: sorry, no, I've been too busy the last couple days :/ [09:46:16] (03CR) 10David Caro: P:wmcs: unify toolsdb profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah) [09:46:19] <_joe_> my bad [09:47:16] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Clement_Goubert) This may be a stupid question but why would the api requests coming from browsers not go to the endpoint mapped to `mw-api-ext`? [09:47:56] No worries, just wanted to make sure its not forgotten [09:48:03] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10jijiki) >>! In T291918#8328103, @Clement_Goubert wrote: >>>! In T291918#7387656, @jijiki wrote: >> Naming things is hard though, I do not agree with the `kube` prefix, i... [09:48:05] (03PS2) 10David Caro: P:toolforge: use puppetdb for grid hba data [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah) [09:48:34] (03PS4) 10Giuseppe Lavagetto: maintenance::wikidata: Update cron with lb and lb-pool [puppet] - 10https://gerrit.wikimedia.org/r/841148 (owner: 10Hoo man) [09:50:10] (03CR) 10CI reject: [V: 04-1] P:toolforge: use puppetdb for grid hba data [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah) [09:53:45] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) >>! In T291918#8328319, @Clement_Goubert wrote: > This may be a stupid question but why would the api requests coming from browsers not go to the endpoint mapped to... [09:54:31] (03CR) 10Hnowlan: [C: 03+2] thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [09:56:14] (03CR) 10David Caro: P:toolforge: use puppetdb for grid hba data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah) [09:58:03] !log volans@cumin1001 START - Cookbook sre.dns.netbox [09:58:06] (03Merged) 10jenkins-bot: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [09:58:28] (03PS2) 10Hnowlan: admin: add thumbor namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) [10:00:31] (03PS3) 10Majavah: P:toolforge: use puppetdb for grid hba data [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) [10:00:49] (03CR) 10Majavah: P:toolforge: use puppetdb for grid hba data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah) [10:01:31] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Clement_Goubert) Given both of your answers, I think `mw-web` is actually the better choice, barring calling it `mw-real-users` which is kind of weird. The API calls fro... [10:01:56] (03PS1) 10Volans: wmnet: remove subnet used for eqiad's wifi [dns] - 10https://gerrit.wikimedia.org/r/844451 (https://phabricator.wikimedia.org/T320962) [10:02:12] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Remove all nutcracker templates and refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/843878 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [10:02:23] (03CR) 10Clément Goubert: [C: 03+2] mwdebug: Remove nutcracker config values [deployment-charts] - 10https://gerrit.wikimedia.org/r/843880 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [10:04:09] (03CR) 10Hnowlan: [C: 03+2] admin: add thumbor namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [10:04:11] (03CR) 10Clément Goubert: [C: 03+2] mwdebug: Remove nutcracker config values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/843880 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [10:04:29] 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on db2139 - https://phabricator.wikimedia.org/T321147 (10jcrespo) 05Open→03Resolved a:03jcrespo Resolved- this is a backup host- it is ok to ignore it unless it reappears. [10:04:59] 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10cmooney) >>! In T314847#8328277, @taavi wrote: > Your comment was written in a way that made me understand that everything used in cod... [10:05:18] (03CR) 10Volans: [C: 03+2] "The prefix has been removed from Netbox" [dns] - 10https://gerrit.wikimedia.org/r/844451 (https://phabricator.wikimedia.org/T320962) (owner: 10Volans) [10:05:20] 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10aborrero) >>! In T314847#8328277, @taavi wrote: >>>! In T314847#8328272, @aborrero wrote: >> HAproxy uses LVS/ipvsadm for them under t... [10:06:34] (03Merged) 10jenkins-bot: mediawiki: Remove all nutcracker templates and refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/843878 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [10:06:36] (03Merged) 10jenkins-bot: mwdebug: Remove nutcracker config values [deployment-charts] - 10https://gerrit.wikimedia.org/r/843880 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [10:06:46] (03PS1) 10Hnowlan: api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326) [10:07:42] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:07:45] (03Merged) 10jenkins-bot: admin: add thumbor namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [10:08:40] (03PS5) 10Hnowlan: helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) [10:08:50] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:09:44] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:10:38] (03CR) 10Jelto: "With I112110d2553a41e839f9990c39ac2a872135c588 allowed_images for Trusted Runners and Shared Runners will be separated. I can rebase this " [puppet] - 10https://gerrit.wikimedia.org/r/842857 (https://phabricator.wikimedia.org/T320825) (owner: 10Brennen Bearnes) [10:13:06] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:17:33] !log Deploying mediawiki helm chart v0.2.4 on k8s-experimental mwdebug - T321042 [10:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:38] T321042: Remove nutcracker from mediawiki chart - https://phabricator.wikimedia.org/T321042 [10:17:43] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:18:25] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:18:42] (03PS1) 10Majavah: openstack: remove unused nova::placement manifests [puppet] - 10https://gerrit.wikimedia.org/r/844455 [10:18:56] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:19:49] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37607/console" [puppet] - 10https://gerrit.wikimedia.org/r/844455 (owner: 10Majavah) [10:20:14] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:20:33] jouncebot: nowandnext [10:20:33] No deployments scheduled for the next 2 hour(s) and 39 minute(s) [10:20:33] In 2 hour(s) and 39 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1300) [10:22:06] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 92 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:28:08] (03PS1) 10Majavah: P:openstack: expose remaining APIs to the internet [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312) [10:28:08] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 77 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:29:55] (03PS2) 10Majavah: P:openstack: expose remaining APIs to the internet [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312) [10:31:40] (03PS1) 10Jbond: P:lvs::configueration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 [10:31:44] (03PS3) 10Majavah: P:openstack: expose remaining APIs to the internet [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312) [10:32:33] (03CR) 10CI reject: [V: 04-1] P:openstack: expose remaining APIs to the internet [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312) (owner: 10Majavah) [10:32:44] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37611/console" [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312) (owner: 10Majavah) [10:32:46] (03PS2) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 [10:33:47] (03PS4) 10Majavah: P:openstack: expose remaining APIs to the internet [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312) [10:33:49] (03PS1) 10Clément Goubert: kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) [10:34:08] (03CR) 10CI reject: [V: 04-1] P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond) [10:34:48] (03PS3) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 [10:36:26] (03CR) 10CI reject: [V: 04-1] P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond) [10:36:39] (03PS4) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 [10:37:52] (03PS5) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 [10:39:48] (03CR) 10CI reject: [V: 04-1] P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond) [10:40:54] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:42:13] (03PS2) 10Clément Goubert: kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) [10:42:40] (03PS1) 10Btullis: Add postgresql replication password for new an-db servers [labs/private] - 10https://gerrit.wikimedia.org/r/844460 (https://phabricator.wikimedia.org/T319440) [10:42:47] (03CR) 10CI reject: [V: 04-1] kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [10:43:19] (03CR) 10Giuseppe Lavagetto: P:lvs::configuration: Store all site data in an accessible structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond) [10:43:32] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add postgresql replication password for new an-db servers [labs/private] - 10https://gerrit.wikimedia.org/r/844460 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [10:44:33] (03PS3) 10Clément Goubert: kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) [10:44:39] (03PS6) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 [10:45:11] (03CR) 10CI reject: [V: 04-1] kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [10:45:58] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37619/console" [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [10:47:14] (03PS4) 10Clément Goubert: kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) [10:54:19] (03PS5) 10Clément Goubert: kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) [10:54:54] (03CR) 10CI reject: [V: 04-1] kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [10:56:50] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37623/console" [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [10:57:20] (03PS7) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 [10:58:25] (03PS6) 10Clément Goubert: kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) [11:01:00] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37624/console" [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [11:02:00] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:02:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [11:03:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [11:03:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35553 and previous config saved to /var/cache/conftool/dbconfig/20221019-110308-ladsgroup.json [11:03:13] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:05:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:05:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:05:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35554 and previous config saved to /var/cache/conftool/dbconfig/20221019-110552-ladsgroup.json [11:05:57] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:06:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [11:06:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [11:06:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T314041)', diff saved to https://phabricator.wikimedia.org/P35555 and previous config saved to /var/cache/conftool/dbconfig/20221019-110635-ladsgroup.json [11:06:40] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:07:19] (03PS6) 10Hnowlan: helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) [11:07:46] !log upload wmf-beamer-style 0.2 to apt [11:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35556 and previous config saved to /var/cache/conftool/dbconfig/20221019-110902-ladsgroup.json [11:10:18] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:10:39] !log hnowlan@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:10:41] !log hnowlan@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:11:27] !log hnowlan@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:12:48] !log hnowlan@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:13:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 38): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37626/console" [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond) [11:13:38] !log hnowlan@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:14:17] !log hnowlan@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:16:01] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/844015 (https://phabricator.wikimedia.org/T321177) [11:16:05] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/844016 (https://phabricator.wikimedia.org/T321177) [11:17:10] (03CR) 10Volans: wmnet: Update s4-master alias (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/844016 (https://phabricator.wikimedia.org/T321177) (owner: 10Gerrit maintenance bot) [11:17:33] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/844017 (https://phabricator.wikimedia.org/T321178) [11:17:37] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/844018 (https://phabricator.wikimedia.org/T321178) [11:18:35] 10SRE: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10hnowlan) This _appears_ to have abated somewhat? https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13&from=1664190035000&to=now [11:21:56] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:22:56] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:23:20] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:24:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P35557 and previous config saved to /var/cache/conftool/dbconfig/20221019-112409-ladsgroup.json [11:25:10] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:25:12] (03CR) 10Giuseppe Lavagetto: P:lvs::configuration: Store all site data in an accessible structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond) [11:29:22] (03PS2) 10Matthias Mullie: Add SearchVue to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367) [11:29:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35558 and previous config saved to /var/cache/conftool/dbconfig/20221019-112925-ladsgroup.json [11:29:30] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:29:38] (03PS3) 10Matthias Mullie: Add SearchVue to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367) [11:30:03] jouncebot: nowandnext [11:30:03] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [11:30:03] In 1 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1300) [11:30:37] !log jnuche@deploy1002 Installing scap version "4.27.1" for 553 hosts [11:39:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P35559 and previous config saved to /var/cache/conftool/dbconfig/20221019-113915-ladsgroup.json [11:41:03] (03PS8) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 [11:42:09] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Clement_Goubert) Since there seems to be consensus on everything but `mw-{app,main,web}`, I'll consider these other service names as valid going forward unless told othe... [11:42:48] (03CR) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond) [11:43:39] (03PS1) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461 [11:43:43] !log jnuche@deploy1002 Installing scap version "4.27.1" for 552 hosts [11:43:46] (03PS3) 10Matthias Mullie: [SearchVue] Enable extension on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367) [11:43:59] !log jnuche@deploy1002 Installation of scap version "4.27.1" completed for 552 hosts [11:44:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P35560 and previous config saved to /var/cache/conftool/dbconfig/20221019-114431-ladsgroup.json [11:45:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37633/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [11:46:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Please test thsi first in toolsbeta with a livehack in the puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah) [11:47:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Patch Set 1: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah) [11:47:26] (03PS2) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461 [11:48:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37634/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [11:49:58] (03CR) 10Volans: "my 2 cents inline" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [11:54:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35561 and previous config saved to /var/cache/conftool/dbconfig/20221019-115421-ladsgroup.json [11:54:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:54:27] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:54:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:54:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35562 and previous config saved to /var/cache/conftool/dbconfig/20221019-115443-ladsgroup.json [11:55:35] (03PS3) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461 [11:56:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37635/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [11:59:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P35563 and previous config saved to /var/cache/conftool/dbconfig/20221019-115938-ladsgroup.json [12:00:27] (03PS4) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461 [12:01:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37636/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [12:04:35] (03Abandoned) 10Cparle: Alert for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/842420 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [12:14:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35564 and previous config saved to /var/cache/conftool/dbconfig/20221019-121444-ladsgroup.json [12:14:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance [12:14:50] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:15:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance [12:15:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T318955)', diff saved to https://phabricator.wikimedia.org/P35565 and previous config saved to /var/cache/conftool/dbconfig/20221019-121506-ladsgroup.json [12:15:16] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37637/console" [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah) [12:19:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T314041)', diff saved to https://phabricator.wikimedia.org/P35566 and previous config saved to /var/cache/conftool/dbconfig/20221019-121939-ladsgroup.json [12:19:45] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [12:26:31] (03PS1) 10Jbond: wmflib::selector: dummy selector to query things in puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/844464 [12:27:04] (03CR) 10CI reject: [V: 04-1] wmflib::selector: dummy selector to query things in puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/844464 (owner: 10Jbond) [12:27:17] !log remove cr4-ulsfo SV8 RS sessions [12:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:43] (03CR) 10Klausman: [C: 03+1] Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [12:34:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P35567 and previous config saved to /var/cache/conftool/dbconfig/20221019-123446-ladsgroup.json [12:39:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T318955)', diff saved to https://phabricator.wikimedia.org/P35568 and previous config saved to /var/cache/conftool/dbconfig/20221019-123946-ladsgroup.json [12:39:52] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:45:52] 10SRE, 10MediaWiki-Authentication-and-authorization, 10Platform Engineering: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10Aklapper) [12:49:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P35569 and previous config saved to /var/cache/conftool/dbconfig/20221019-124952-ladsgroup.json [12:51:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35570 and previous config saved to /var/cache/conftool/dbconfig/20221019-125101-ladsgroup.json [12:51:06] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [12:54:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P35571 and previous config saved to /var/cache/conftool/dbconfig/20221019-125452-ladsgroup.json [12:56:18] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809 (10JMeybohm) p:05Medium→03Low [12:58:28] 10SRE, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699 (10LSobanski) @hnowlan Is this something that still needs to happen and if yes, who would own the next step? [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1300). Please do the needful. [13:00:05] matthiasmullie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] o/ [13:00:21] I can’t deploy yet (maybe later if nobody else is available) [13:00:43] I have 3 patches. [13:00:48] The 1.40.0-wmf.6 backport doesn't need to be tested on mwdebug - it's specific to that branch & wikipedias, so there's no place to test yet. Happy to self-deploy this one. [13:01:01] The other 2 are about enabling a new extension on beta (+ extension-list inclusion) & don't directly impact prod, but IDK whether scap is needed in this case... :p Guidance on how to do these is much appreciated! [13:02:55] 10SRE, 10PyBal, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Cleanup pybal Prometheus metrics on monitor stop() - https://phabricator.wikimedia.org/T321191 (10fgiunchedi) [13:04:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T314041)', diff saved to https://phabricator.wikimedia.org/P35572 and previous config saved to /var/cache/conftool/dbconfig/20221019-130459-ladsgroup.json [13:05:04] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [13:05:06] 10SRE, 10MediaWiki-Authentication-and-authorization, 10Platform Engineering, 10serviceops: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10LSobanski) [13:05:20] matthiasmullie: at least extension-list, CS.php and IS.php should still be synced I think [13:05:22] (03PS1) 10Filippo Giunchedi: Clean up monitor metrics on stop() [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/844469 (https://phabricator.wikimedia.org/T321191) [13:05:36] you can probably let `scap backport` do everything, and IIRC it’ll skip IS-labs.php on its own [13:05:46] feel free to self-service [13:05:51] (03PS17) 10Btullis: Add postgresql to an-db100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440) [13:06:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P35573 and previous config saved to /var/cache/conftool/dbconfig/20221019-130607-ladsgroup.json [13:06:37] Okay; I'll get started & scap them all until/unless someone stops me! [13:06:49] Thanks, Lucas_WMDE [13:07:17] (03CR) 10Btullis: "Adding Jaime to reviewers, particularly with reference to the new bacula client configuration." [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [13:07:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843913 (https://phabricator.wikimedia.org/T320337) (owner: 10Matthias Mullie) [13:08:50] 10SRE, 10serviceops, 10Patch-For-Review: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929 (10LSobanski) [13:09:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P35574 and previous config saved to /var/cache/conftool/dbconfig/20221019-130959-ladsgroup.json [13:11:42] 10SRE, 10API Platform, 10serviceops: Block non-browser requests that use generic agents - https://phabricator.wikimedia.org/T319423 (10LSobanski) [13:14:30] (03PS1) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) [13:14:40] 10SRE, 10Cloud-Services, 10Developer-Advocacy, 10Infrastructure-Foundations, 10LDAP: Create a single application to provision and manage developer (LDAP) accounts - https://phabricator.wikimedia.org/T179463 (10LSobanski) [13:15:05] (03CR) 10CI reject: [V: 04-1] Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu) [13:15:13] jouncebot: now [13:15:13] For the next 0 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1300) [13:15:40] (03PS5) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461 [13:18:30] (03PS2) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) [13:19:05] (03CR) 10CI reject: [V: 04-1] Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu) [13:21:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P35575 and previous config saved to /var/cache/conftool/dbconfig/20221019-132114-ladsgroup.json [13:25:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T318955)', diff saved to https://phabricator.wikimedia.org/P35576 and previous config saved to /var/cache/conftool/dbconfig/20221019-132505-ladsgroup.json [13:25:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [13:25:11] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:25:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [13:25:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35577 and previous config saved to /var/cache/conftool/dbconfig/20221019-132527-ladsgroup.json [13:26:46] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 100 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:27:22] (03Merged) 10jenkins-bot: Add default value for search-thumbnail-extra-namespaces [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843913 (https://phabricator.wikimedia.org/T320337) (owner: 10Matthias Mullie) [13:27:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8220 [13:27:46] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:843913|Add default value for search-thumbnail-extra-namespaces (T320337)]] [13:27:51] T320337: [M] Special:Search results should have a way to suppress thumbnails - https://phabricator.wikimedia.org/T320337 [13:28:08] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:28:11] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:843913|Add default value for search-thumbnail-extra-namespaces (T320337)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:28:57] !log hashar@deploy1002 backport Cancelled [13:29:04] o/ [13:29:18] ok, I’m around now if needed :) [13:29:24] I did not even start that one! :D [13:30:10] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:30:17] (03PS1) 10Hashar: Downgrade lcobucci/jwt (4.2.1 => 4.1.5) [vendor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844035 (https://phabricator.wikimedia.org/T321160) [13:30:42] (03CR) 10Jcrespo: "Please note there where some issues with postgres backups: https://phabricator.wikimedia.org/T316655 so we may need to redo them soon- but" [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [13:31:13] Lucas_WMDE: First patch syncing now; I can just go ahead and `scap backport` both of the other (beta, extension-list etc) patches myself (unless things fall apart :p), ok? [13:31:23] ok, sure! [13:31:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8220 [13:32:27] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:843913|Add default value for search-thumbnail-extra-namespaces (T320337)]] (duration: 04m 41s) [13:33:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie) [13:33:59] (03Merged) 10jenkins-bot: Add SearchVue to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie) [13:34:16] (03CR) 10FNegri: [C: 03+1] "I like this, and I think it can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah) [13:34:23] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:830874|Add SearchVue to extension-list and config var (T310367)]] [13:34:29] T310367: [L] Deploy SearchVue - https://phabricator.wikimedia.org/T310367 [13:36:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37642/console" [puppet] - 10https://gerrit.wikimedia.org/r/844446 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [13:36:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35578 and previous config saved to /var/cache/conftool/dbconfig/20221019-133620-ladsgroup.json [13:36:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [13:36:26] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [13:36:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [13:36:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T318950)', diff saved to https://phabricator.wikimedia.org/P35579 and previous config saved to /var/cache/conftool/dbconfig/20221019-133642-ladsgroup.json [13:37:07] (03CR) 10Elukey: [C: 03+1] Provide the cluster_cidr to kube-proxy on masters as well [puppet] - 10https://gerrit.wikimedia.org/r/844446 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [13:38:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T318950)', diff saved to https://phabricator.wikimedia.org/P35580 and previous config saved to /var/cache/conftool/dbconfig/20221019-133852-ladsgroup.json [13:39:25] (03PS4) 10Lucas Werkmeister (WMDE): Add config for redirect badges on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827968 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große) [13:40:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10herron) Yes the other kafka-logging hosts were switched to raid50 (hardware) to provide additional capacity vs raid10. It should appear to the OS... [13:40:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "Rebased because Gerrit and local Git reported conflicts (even though the rebase itself worked without issue). Scheduled for next week; -2i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827968 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große) [13:40:29] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-db1001.eqiad.wmnet with OS bullseye [13:40:49] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37643/console" [puppet] - 10https://gerrit.wikimedia.org/r/844449 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [13:41:40] (03CR) 10Elukey: [V: 03+1 C: 03+1] Provide the cluster_cidr to kube-proxy in wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/844449 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [13:41:50] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [13:42:17] (03PS6) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461 [13:43:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37644/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [13:43:33] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:830874|Add SearchVue to extension-list and config var (T310367)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:43:38] T310367: [L] Deploy SearchVue - https://phabricator.wikimedia.org/T310367 [13:44:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37645/console" [puppet] - 10https://gerrit.wikimedia.org/r/844450 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [13:44:57] (03CR) 10Elukey: [V: 03+1 C: 03+1] Provide the cluster_cidr to kube-proxy in wikikube eqiad [puppet] - 10https://gerrit.wikimedia.org/r/844450 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [13:46:04] (03PS7) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461 [13:46:21] (03CR) 10JMeybohm: [C: 03+2] Provide the cluster_cidr to kube-proxy on masters as well [puppet] - 10https://gerrit.wikimedia.org/r/844446 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [13:46:38] (03CR) 10Filippo Giunchedi: "Generally LGTM! Nicely done." [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [13:47:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37646/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [13:49:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35581 and previous config saved to /var/cache/conftool/dbconfig/20221019-134942-ladsgroup.json [13:49:46] (03CR) 10Elukey: Add known_uid_mapping support to the production-images for spark (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [13:49:48] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:50:55] (03CR) 10David Caro: [C: 03+2] "Yep, this was removed here I626d9c54c6abae9f20bae111c4eb7ac9194a223c and Ib9144f1871a099a019951efded2eec4879c0a3c3" [puppet] - 10https://gerrit.wikimedia.org/r/844455 (owner: 10Majavah) [13:52:26] (03PS8) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461 [13:52:54] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-db1001.eqiad.wmnet with reason: host reimage [13:53:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37647/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [13:53:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P35582 and previous config saved to /var/cache/conftool/dbconfig/20221019-135358-ladsgroup.json [13:54:28] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [13:55:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-db1001.eqiad.wmnet with reason: host reimage [13:56:13] (03PS1) 10Btullis: Add a default value of undefined for the docker uid hash [puppet] - 10https://gerrit.wikimedia.org/r/844484 (https://phabricator.wikimedia.org/T318730) [13:57:19] (03CR) 10Btullis: "This was changed as the result of a comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/844445/6/modules/profile/manifests/doc" [puppet] - 10https://gerrit.wikimedia.org/r/844484 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [13:57:37] 10SRE-Access-Requests, 10Phabricator, 10acl*access-policy-approvers: Add EChetty and to acl*phabricator group on phab - https://phabricator.wikimedia.org/T321197 (10EChetty) [13:58:02] 10SRE-Access-Requests, 10Phabricator, 10acl*access-policy-approvers: Add EChetty and to acl_phabricator group on phab - https://phabricator.wikimedia.org/T321197 (10EChetty) [13:58:20] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add known_uid_mapping support to the production-images for spark (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [13:59:02] (03CR) 10Andrew Bogott: [C: 04-1] "Let's definitely skip Barbican for now (since it's only deployed in codfw1dev and has some security concerns in the current release.)" [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312) (owner: 10Majavah) [13:59:08] 10SRE, 10SRE Observability: certspotter failures on alert1001 - https://phabricator.wikimedia.org/T318911 (10lmata) [14:00:26] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:830874|Add SearchVue to extension-list and config var (T310367)]] (duration: 26m 03s) [14:00:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie) [14:00:31] T310367: [L] Deploy SearchVue - https://phabricator.wikimedia.org/T310367 [14:00:40] 10SRE-Access-Requests, 10Phabricator, 10acl*access-policy-approvers: Add JArguello and to acl_phabricator group on phab - https://phabricator.wikimedia.org/T321198 (10EChetty) [14:01:12] 10SRE-Access-Requests, 10Phabricator, 10acl*access-policy-approvers: Add EChetty to acl_phabricator group on phab - https://phabricator.wikimedia.org/T321197 (10EChetty) [14:01:20] (03Merged) 10jenkins-bot: [SearchVue] Enable extension on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie) [14:01:40] 10SRE-Access-Requests, 10Phabricator, 10acl*access-policy-approvers: Add JArguello to acl_phabricator group on phab - https://phabricator.wikimedia.org/T321198 (10EChetty) [14:02:01] !log UTC afternoon backports done [14:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:18] 10SRE, 10Infrastructure-Foundations, 10netops: Ramp up SV1 IXP - https://phabricator.wikimedia.org/T321193 (10ayounsi) [14:03:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) @Cmjohnson many thanks indeed, that's great. :+1: [14:03:29] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 90 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:04:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P35583 and previous config saved to /var/cache/conftool/dbconfig/20221019-140449-ladsgroup.json [14:05:07] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:06:43] (03CR) 10Hashar: [C: 03+2] Downgrade lcobucci/jwt (4.2.1 => 4.1.5) [vendor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844035 (https://phabricator.wikimedia.org/T321160) (owner: 10Hashar) [14:06:50] I am going to deploy the vendor hotfix https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/844035 [14:06:56] well once CI has merged it :D [14:09:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P35584 and previous config saved to /var/cache/conftool/dbconfig/20221019-140905-ladsgroup.json [14:10:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-db1001.eqiad.wmnet with OS bullseye [14:10:17] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 110, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:12:26] (03PS1) 10Matthias Mullie: Fix value for wgQuickViewMediaRepositorySearchUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844485 [14:12:58] (03CR) 10Jbond: [C: 03+1] "mostly fine but a few nits" [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [14:17:33] (03Abandoned) 10Jbond: wmflib::selector: dummy selector to query things in puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/844464 (owner: 10Jbond) [14:18:25] (03PS3) 10Andrew Bogott: Allow cloud_provider_enabled [puppet] - 10https://gerrit.wikimedia.org/r/825676 (https://phabricator.wikimedia.org/T280792) (owner: 10Vivian Rook) [14:19:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P35585 and previous config saved to /var/cache/conftool/dbconfig/20221019-141955-ladsgroup.json [14:20:05] (03CR) 10Andrew Bogott: [C: 03+2] Allow cloud_provider_enabled [puppet] - 10https://gerrit.wikimedia.org/r/825676 (https://phabricator.wikimedia.org/T280792) (owner: 10Vivian Rook) [14:23:41] (03CR) 10Klausman: [C: 03+1] "The psql/service side LGTM. I'll let Jaime comment on the backup bits." [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [14:23:47] (03Merged) 10jenkins-bot: Downgrade lcobucci/jwt (4.2.1 => 4.1.5) [vendor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844035 (https://phabricator.wikimedia.org/T321160) (owner: 10Hashar) [14:24:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T318950)', diff saved to https://phabricator.wikimedia.org/P35586 and previous config saved to /var/cache/conftool/dbconfig/20221019-142411-ladsgroup.json [14:24:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [14:24:17] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:24:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [14:24:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T318950)', diff saved to https://phabricator.wikimedia.org/P35587 and previous config saved to /var/cache/conftool/dbconfig/20221019-142433-ladsgroup.json [14:25:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [vendor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844035 (https://phabricator.wikimedia.org/T321160) (owner: 10Hashar) [14:25:32] !log hashar@deploy1002 Started scap: Backport for [[gerrit:844035|Downgrade lcobucci/jwt (4.2.1 => 4.1.5) (T321160)]] [14:25:36] T321160: Lcobucci\JWT\Signer\InvalidKeyProvided: Key cannot be empty - https://phabricator.wikimedia.org/T321160 [14:25:58] !log hashar@deploy1002 hashar and hashar: Backport for [[gerrit:844035|Downgrade lcobucci/jwt (4.2.1 => 4.1.5) (T321160)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:26:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T318950)', diff saved to https://phabricator.wikimedia.org/P35588 and previous config saved to /var/cache/conftool/dbconfig/20221019-142643-ladsgroup.json [14:29:00] !log hashar@deploy1002 sync-world aborted: Backport for [[gerrit:844035|Downgrade lcobucci/jwt (4.2.1 => 4.1.5) (T321160)]] (duration: 03m 27s) [14:29:00] !log hashar@deploy1002 backport aborted: (duration: 05m 09s) [14:29:09] grbmbmbl [14:29:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [vendor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844035 (https://phabricator.wikimedia.org/T321160) (owner: 10Hashar) [14:29:35] !log hashar@deploy1002 Started scap: Backport for [[gerrit:844035|Downgrade lcobucci/jwt (4.2.1 => 4.1.5) (T321160)]] [14:29:58] !log hashar@deploy1002 hashar and hashar: Backport for [[gerrit:844035|Downgrade lcobucci/jwt (4.2.1 => 4.1.5) (T321160)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [14:30:18] What's up hashar? [14:32:03] dancy: nothing, I pressed enter or Ctr+C instead of y when validating to continue with sync [14:32:15] oops! [14:32:39] (03PS1) 10Clément Goubert: admin: Add mw-debug namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/844488 (https://phabricator.wikimedia.org/T321201) [14:32:59] I have also learned a lotabout install-world this morning thanks to jnuche ;-] [14:33:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 701 [14:33:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 701 [14:33:50] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 1239 [14:34:00] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:844035|Downgrade lcobucci/jwt (4.2.1 => 4.1.5) (T321160)]] (duration: 04m 25s) [14:34:06] T321160: Lcobucci\JWT\Signer\InvalidKeyProvided: Key cannot be empty - https://phabricator.wikimedia.org/T321160 [14:34:35] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 1239 [14:34:36] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2516 [14:34:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2516 [14:34:53] hashar :) [14:35:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35589 and previous config saved to /var/cache/conftool/dbconfig/20221019-143501-ladsgroup.json [14:35:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1130.eqiad.wmnet with reason: Maintenance [14:35:07] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:35:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1130.eqiad.wmnet with reason: Maintenance [14:35:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T318955)', diff saved to https://phabricator.wikimedia.org/P35590 and previous config saved to /var/cache/conftool/dbconfig/20221019-143523-ladsgroup.json [14:35:53] (03CR) 10Volans: [C: 04-1] "LGTM as approach, two issues inline" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [14:36:56] (03CR) 10Hnowlan: helmfile.d: add thumbor configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:37:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T318955)', diff saved to https://phabricator.wikimedia.org/P35591 and previous config saved to /var/cache/conftool/dbconfig/20221019-143736-ladsgroup.json [14:39:15] (03PS9) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461 [14:39:29] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [14:40:29] there are a few more error logs, I guess I will file task for them later this evening [14:41:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P35592 and previous config saved to /var/cache/conftool/dbconfig/20221019-144150-ladsgroup.json [14:43:58] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [14:49:23] jouncebot: nowandnext [14:49:23] No deployments scheduled for the next 3 hour(s) and 10 minute(s) [14:49:23] In 3 hour(s) and 10 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1800) [14:49:23] In 3 hour(s) and 10 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1800) [14:50:19] !log jnuche@deploy1002 Installing scap version "4.27.1" for 1 hosts [14:50:25] !log jnuche@deploy1002 Installation of scap version "4.27.1" completed for 1 hosts [14:50:39] (03PS1) 10Clément Goubert: hieradata: Add usernames for mw-debug k8s service [puppet] - 10https://gerrit.wikimedia.org/r/844491 (https://phabricator.wikimedia.org/T321201) [14:52:34] RECOVERY - Disk space on aphlict1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops [14:52:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P35593 and previous config saved to /var/cache/conftool/dbconfig/20221019-145242-ladsgroup.json [14:56:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P35594 and previous config saved to /var/cache/conftool/dbconfig/20221019-145658-ladsgroup.json [14:57:40] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-etcd1001.eqiad.wmnet [14:57:41] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [14:59:10] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/844020 [14:59:44] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [14:59:44] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [15:00:06] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [15:00:06] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [15:00:33] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [15:05:09] (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2022-10-18-161910-production [puppet] - 10https://gerrit.wikimedia.org/r/844063 (https://phabricator.wikimedia.org/T316991) (owner: 10BryanDavis) [15:05:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:07:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P35595 and previous config saved to /var/cache/conftool/dbconfig/20221019-150749-ladsgroup.json [15:08:06] !log Forcing puppet runs on cloudweb100[34] to deploy a new version of Striker [15:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:rack/setup/install new machine learning hosts - https://phabricator.wikimedia.org/T314587 (10elukey) Any news? :) [15:10:58] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:12:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T318950)', diff saved to https://phabricator.wikimedia.org/P35596 and previous config saved to /var/cache/conftool/dbconfig/20221019-151204-ladsgroup.json [15:12:09] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [15:22:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T318955)', diff saved to https://phabricator.wikimedia.org/P35597 and previous config saved to /var/cache/conftool/dbconfig/20221019-152256-ladsgroup.json [15:22:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:23:03] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:23:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:23:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35598 and previous config saved to /var/cache/conftool/dbconfig/20221019-152318-ladsgroup.json [15:28:08] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:28:08] !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-etcd1001.eqiad.wmnet on all recursors [15:28:11] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-etcd1001.eqiad.wmnet on all recursors [15:31:12] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10herron) [15:41:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1004.eqiad.wmnet with OS bullseye [15:41:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye [15:45:10] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10herron) p:05Triage→03Medium [15:46:18] 10SRE, 10MediaWiki-Authentication-and-authorization, 10Platform Engineering, 10serviceops: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10hnowlan) [15:46:24] (03CR) 10Btullis: Add postgresql to an-db100[1-2] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [15:47:15] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10herron) a:03Arian_Bozorg Hi @Arian_Bozorg, following the instructions in https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#wmde_access could you please coordinate obtaini... [15:48:00] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10WMDE-leszek) @herron I approve the request on WMDE's behalf. Thanks [15:49:05] (03PS3) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) [15:49:38] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10herron) a:05Arian_Bozorg→03None >>! In T320384#8329638, @WMDE-leszek wrote: > @herron I approve the request on WMDE's behalf. Thanks That was quick! Thank you :) [15:49:42] (03CR) 10CI reject: [V: 04-1] Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu) [15:50:00] 10SRE, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699 (10hnowlan) I think it's still a relevant question, if we can get to this work before RESTbase deprecation takes hold. Rather than open the port I think it makes sense to test disabling the service.... [15:50:12] 10SRE, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699 (10hnowlan) a:03hnowlan [15:50:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35599 and previous config saved to /var/cache/conftool/dbconfig/20221019-155015-ladsgroup.json [15:50:20] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:51:31] (03PS1) 10Herron: admin: add arbo to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/844497 (https://phabricator.wikimedia.org/T320384) [15:51:38] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:51:45] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-etcd1001.eqiad.wmnet [15:51:56] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1004.eqiad.wmnet with OS bullseye [15:52:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye exe... [15:52:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1004.eqiad.wmnet with OS bullseye [15:52:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye [15:53:00] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:53:47] (03PS4) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) [15:53:56] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:54:48] (03CR) 10CI reject: [V: 04-1] Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu) [15:57:34] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:57:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:58:48] (03PS5) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) [15:58:58] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.783 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:59:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:01:54] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-db1002.eqiad.wmnet with OS bullseye [16:02:25] (03CR) 10Dzahn: [C: 03+1] "lgtm, you should add them to 2 groups, 'nda' and 'wmde'" [puppet] - 10https://gerrit.wikimedia.org/r/844497 (https://phabricator.wikimedia.org/T320384) (owner: 10Herron) [16:03:24] (03CR) 10Herron: [C: 03+2] admin: add arbo to ldap only users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844497 (https://phabricator.wikimedia.org/T320384) (owner: 10Herron) [16:03:54] PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:04:20] (03PS3) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) [16:04:22] (03PS2) 10Herron: admin: add arbo to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/844497 (https://phabricator.wikimedia.org/T320384) [16:04:24] (03CR) 10Elukey: [C: 03+1] Add a default value of undefined for the docker uid hash [puppet] - 10https://gerrit.wikimedia.org/r/844484 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [16:05:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P35600 and previous config saved to /var/cache/conftool/dbconfig/20221019-160521-ladsgroup.json [16:06:54] (03PS1) 10Elukey: WIP - coredns: upgrade to 1.8.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) [16:07:35] (03CR) 10Dzahn: "Has it been cleared up if shell access is needed? A comment on the ticket says it's not if this is only for superset. I would suggest to d" [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057) (owner: 10Herron) [16:08:40] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1004.eqiad.wmnet with OS bullseye [16:08:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye exe... [16:08:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1004.eqiad.wmnet with OS bullseye [16:09:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye [16:09:36] (03CR) 10Elukey: "After running docker-pkg locally:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [16:11:14] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10herron) [16:13:37] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10herron) 05Open→03Resolved a:03herron Group membership has been granted. Transitioning this to resolved now, please reopen if any follow up is needed. Thanks! [16:14:22] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-db1002.eqiad.wmnet with reason: host reimage [16:15:10] (03PS2) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844444 (https://phabricator.wikimedia.org/T321082) [16:15:34] (03CR) 10Urbanecm: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844444 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm) [16:17:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-db1002.eqiad.wmnet with reason: host reimage [16:19:14] (03PS4) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) [16:19:46] !log wikitech - added herron to 'content administrators' [16:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:15] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844444 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm) [16:20:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P35601 and previous config saved to /var/cache/conftool/dbconfig/20221019-162028-ladsgroup.json [16:20:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1004.eqiad.wmnet with reason: host reimage [16:21:26] !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [16:22:01] !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [16:22:58] (03PS1) 10Urbanecm: Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/844038 (https://phabricator.wikimedia.org/T321082) [16:23:06] (03CR) 10Urbanecm: [C: 03+2] Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/844038 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm) [16:23:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1004.eqiad.wmnet with reason: host reimage [16:24:16] (03PS2) 10Herron: admin: add damilare to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057) [16:25:50] (03PS1) 10Cwhite: opensearch: upgrade curator to 5.8.5-1~wmf3 [puppet] - 10https://gerrit.wikimedia.org/r/844023 (https://phabricator.wikimedia.org/T304440) [16:26:35] (03Merged) 10jenkins-bot: Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/844038 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm) [16:27:15] !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [16:27:24] (03CR) 10Herron: admin: add damilare to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057) (owner: 10Herron) [16:27:32] !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [16:27:36] (03PS3) 10Herron: admin: add damilare to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057) [16:30:35] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/844053 (https://phabricator.wikimedia.org/T321068) (owner: 10Herron) [16:31:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-db1002.eqiad.wmnet with OS bullseye [16:31:57] (03PS2) 10Herron: admin: add ssh key for hshaikh [puppet] - 10https://gerrit.wikimedia.org/r/844053 (https://phabricator.wikimedia.org/T321068) [16:32:04] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057) (owner: 10Herron) [16:32:06] (03PS1) 10JHathaway: aux-k8s-etcd: mac address for aux-k8s-etcd1001 [puppet] - 10https://gerrit.wikimedia.org/r/844504 [16:32:53] (03CR) 10CI reject: [V: 04-1] aux-k8s-etcd: mac address for aux-k8s-etcd1001 [puppet] - 10https://gerrit.wikimedia.org/r/844504 (owner: 10JHathaway) [16:34:55] (03CR) 10Herron: [C: 03+1] admin: fix duplicate entry for mvolz [puppet] - 10https://gerrit.wikimedia.org/r/843998 (https://phabricator.wikimedia.org/T320937) (owner: 10Dzahn) [16:34:57] (03CR) 10Cwhite: [C: 03+2] "PCC OK: https://puppet-compiler.wmflabs.org/pcc-worker1003/37651/" [puppet] - 10https://gerrit.wikimedia.org/r/844023 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite) [16:35:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35602 and previous config saved to /var/cache/conftool/dbconfig/20221019-163534-ladsgroup.json [16:35:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:35:39] (03CR) 10Herron: [C: 03+1] admin: add missing realname field for eugene_chernov [puppet] - 10https://gerrit.wikimedia.org/r/843999 (https://phabricator.wikimedia.org/T320937) (owner: 10Dzahn) [16:35:40] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:35:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:36:45] (03CR) 10Herron: [C: 03+2] admin: add ssh key for hshaikh [puppet] - 10https://gerrit.wikimedia.org/r/844053 (https://phabricator.wikimedia.org/T321068) (owner: 10Herron) [16:37:21] (03PS4) 10Herron: admin: add damilare to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057) [16:37:30] (03PS2) 10JHathaway: aux-k8s-etcd: mac address for aux-k8s-etcd1001 [puppet] - 10https://gerrit.wikimedia.org/r/844504 (https://phabricator.wikimedia.org/T321134) [16:38:32] (03CR) 10Herron: [C: 03+2] admin: add damilare to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057) (owner: 10Herron) [16:39:25] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:40:09] PROBLEM - SSH on analytics1075.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:40:34] (03CR) 10Dzahn: [C: 03+2] admin: add missing realname field for eugene_chernov [puppet] - 10https://gerrit.wikimedia.org/r/843999 (https://phabricator.wikimedia.org/T320937) (owner: 10Dzahn) [16:40:40] (03CR) 10Dzahn: [C: 03+2] admin: fix duplicate entry for mvolz [puppet] - 10https://gerrit.wikimedia.org/r/843998 (https://phabricator.wikimedia.org/T320937) (owner: 10Dzahn) [16:40:45] (03PS3) 10Dzahn: admin: fix duplicate entry for mvolz [puppet] - 10https://gerrit.wikimedia.org/r/843998 (https://phabricator.wikimedia.org/T320937) [16:42:06] (03PS1) 10Andrew Bogott: Keep Nova API public in eqiad1 but restrict in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/844506 (https://phabricator.wikimedia.org/T319312) [16:42:11] (03PS2) 10Dzahn: admin: add missing realname field for eugene_chernov [puppet] - 10https://gerrit.wikimedia.org/r/843999 (https://phabricator.wikimedia.org/T320937) [16:43:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1004.eqiad.wmnet with OS bullseye [16:43:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye com... [16:43:29] (03CR) 10JHathaway: [C: 03+2] aux-k8s-etcd: mac address for aux-k8s-etcd1001 [puppet] - 10https://gerrit.wikimedia.org/r/844504 (https://phabricator.wikimedia.org/T321134) (owner: 10JHathaway) [16:43:49] (03PS2) 10SBassett: Add registry.gitlab.com/security-products/**/* as allowed images [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961) [16:44:49] (03PS2) 10Andrew Bogott: Keep Nova API public in eqiad1 but restrict in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/844506 (https://phabricator.wikimedia.org/T319312) [16:45:14] (03PS6) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) [16:45:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10herron) 05Open→03Resolved a:03herron The requested group membership has been granted and will propagate fully within the next 30 minute... [16:45:49] (03CR) 10SBassett: Add registry.gitlab.com/security-products/**/* as allowed images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961) (owner: 10SBassett) [16:45:55] (03CR) 10CI reject: [V: 04-1] Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu) [16:46:19] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37655/console" [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu) [16:46:31] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for GFontenelle - https://phabricator.wikimedia.org/T321218 (10GFontenelle_WMF) [16:47:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10herron) 05Open→03Resolved a:03herron The requested access has been granted and will propagate fully within the next 30 minutes. I'll transition thi... [16:48:28] (03PS3) 10Andrew Bogott: Keep Nova API public in eqiad1 but restrict in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/844506 (https://phabricator.wikimedia.org/T319312) [16:49:25] jhathaway: I typed "multiple", so your change is merged [16:49:44] oooh, thanks! [16:56:58] (03CR) 10Hashar: "I have to cherry-pick that on the devtools WMCS project to exercise it." [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [16:59:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:59:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:59:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:59:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:00:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T318955)', diff saved to https://phabricator.wikimedia.org/P35603 and previous config saved to /var/cache/conftool/dbconfig/20221019-170002-ladsgroup.json [17:00:07] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:02:07] (03PS4) 10Andrew Bogott: Keep Nova API public in eqiad1 but restrict in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/844506 (https://phabricator.wikimedia.org/T319312) [17:03:48] RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:03:55] (03CR) 10Andrew Bogott: "pcc: https://puppet-compiler.wmflabs.org/pcc-worker1003/37658/" [puppet] - 10https://gerrit.wikimedia.org/r/844506 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [17:04:14] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/844024 [17:07:22] (03CR) 10Dzahn: [C: 03+1] "lgtm, https://puppet-compiler.wmflabs.org/pcc-worker1002/37657/gitlab-runner2004.codfw.wmnet/index.html ( I don't think we can expect to s" [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto) [17:08:24] (03PS1) 10David Caro: dnsrecursor: fixes to the trigger script and some logs [puppet] - 10https://gerrit.wikimedia.org/r/844507 [17:13:01] (03CR) 10Andrew Bogott: [C: 03+1] dnsrecursor: fixes to the trigger script and some logs [puppet] - 10https://gerrit.wikimedia.org/r/844507 (owner: 10David Caro) [17:15:35] (03PS1) 10David Caro: dnsrecursor: ignore projects outside the default domain [puppet] - 10https://gerrit.wikimedia.org/r/844508 [17:15:46] ("wrong" channel, I know, but...) Hello! I'd love to get https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/844083 (some prep work for T310974) backported and live on en.wiki in the backport window later — would anyone mind giving it a look-over and perhaps +2ing? It is a small patch which adds a `StatsdDataFactory() ->increment` if a page is marked as `NOINDEX` by the PageTriage extension :) [17:15:46] T310974: Extend PageTriageMaxAge for unpatrolled articles at enwiki - https://phabricator.wikimedia.org/T310974 [17:17:06] (03CR) 10Andrew Bogott: [C: 03+1] "I think this is the right solution for now. We aren't immediately planning to have magnum VMs get public IPs so it shouldn't matter." [puppet] - 10https://gerrit.wikimedia.org/r/844508 (owner: 10David Caro) [17:18:00] (03CR) 10David Caro: [C: 03+2] dnsrecursor: ignore projects outside the default domain [puppet] - 10https://gerrit.wikimedia.org/r/844508 (owner: 10David Caro) [17:18:02] (03CR) 10David Caro: [C: 03+2] dnsrecursor: fixes to the trigger script and some logs [puppet] - 10https://gerrit.wikimedia.org/r/844507 (owner: 10David Caro) [17:21:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Cmjohnson) [17:24:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T318955)', diff saved to https://phabricator.wikimedia.org/P35604 and previous config saved to /var/cache/conftool/dbconfig/20221019-172438-ladsgroup.json [17:24:44] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:33:34] (03CR) 10BryanDavis: [C: 03+1] "I have 3 blubber builds based off of this base which all install python3-pip with a comment that says "# FIXME: should be in the base imag" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844319 (owner: 10Giuseppe Lavagetto) [17:39:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P35605 and previous config saved to /var/cache/conftool/dbconfig/20221019-173945-ladsgroup.json [17:40:20] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:41:04] RECOVERY - SSH on analytics1075.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:46:51] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:49:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:54:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [17:54:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P35606 and previous config saved to /var/cache/conftool/dbconfig/20221019-175451-ladsgroup.json [17:54:53] (03PS1) 10Samtar: Hooks: Log to statsd when a page is noindex'd [extensions/PageTriage] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/844040 (https://phabricator.wikimedia.org/T310974) [17:57:19] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [18:00:04] hashar and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Train log triage with CPT . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1800). [18:00:04] hashar and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1800). [18:01:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [18:03:44] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [18:07:33] (03PS1) 10BryanDavis: mono68: Remove expired DST Root CA X3 cert [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/844512 (https://phabricator.wikimedia.org/T311466) [18:07:57] !log aphlict1001 - manually gzip large logfile, logrotate did not run for a day - T321209 [18:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T318955)', diff saved to https://phabricator.wikimedia.org/P35607 and previous config saved to /var/cache/conftool/dbconfig/20221019-180958-ladsgroup.json [18:10:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [18:10:03] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [18:10:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [18:10:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T318955)', diff saved to https://phabricator.wikimedia.org/P35608 and previous config saved to /var/cache/conftool/dbconfig/20221019-181019-ladsgroup.json [18:11:12] (03CR) 10BryanDavis: "Test out a local build of the container with the test program from https://phabricator.wikimedia.org/T292289#7468035. That looks something" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/844512 (https://phabricator.wikimedia.org/T311466) (owner: 10BryanDavis) [18:12:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T318955)', diff saved to https://phabricator.wikimedia.org/P35609 and previous config saved to /var/cache/conftool/dbconfig/20221019-181232-ladsgroup.json [18:27:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P35610 and previous config saved to /var/cache/conftool/dbconfig/20221019-182739-ladsgroup.json [18:42:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P35611 and previous config saved to /var/cache/conftool/dbconfig/20221019-184245-ladsgroup.json [18:44:13] (03PS1) 10Dduvall: docker_registry_ha: Require JWT to have ref_protected claim set to true [puppet] - 10https://gerrit.wikimedia.org/r/844513 [18:50:33] (03PS1) 10Dzahn: Revert "Revert "conftool-data: remove phabricator / git-ssh"" [puppet] - 10https://gerrit.wikimedia.org/r/844041 [18:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:55:47] (03CR) 10Ahmon Dancy: docker_registry_ha: Require JWT to have ref_protected claim set to true (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall) [18:57:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T318955)', diff saved to https://phabricator.wikimedia.org/P35612 and previous config saved to /var/cache/conftool/dbconfig/20221019-185752-ladsgroup.json [18:57:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [18:57:58] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [18:58:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [18:58:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T318955)', diff saved to https://phabricator.wikimedia.org/P35613 and previous config saved to /var/cache/conftool/dbconfig/20221019-185813-ladsgroup.json [19:00:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T318955)', diff saved to https://phabricator.wikimedia.org/P35614 and previous config saved to /var/cache/conftool/dbconfig/20221019-190026-ladsgroup.json [19:08:45] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:09:34] (03PS1) 10JHathaway: aux-k8s-etcd: partman cfg for aux-k8s-etcd [puppet] - 10https://gerrit.wikimedia.org/r/844514 (https://phabricator.wikimedia.org/T321134) [19:10:29] (03CR) 10JHathaway: [C: 03+2] aux-k8s-etcd: partman cfg for aux-k8s-etcd [puppet] - 10https://gerrit.wikimedia.org/r/844514 (https://phabricator.wikimedia.org/T321134) (owner: 10JHathaway) [19:10:52] (03PS1) 10Dzahn: devtools: add profile::mediawiki::scap_client::is_master: true [puppet] - 10https://gerrit.wikimedia.org/r/844515 [19:11:31] (03CR) 10Dzahn: [C: 03+2] devtools: add profile::mediawiki::scap_client::is_master: true [puppet] - 10https://gerrit.wikimedia.org/r/844515 (owner: 10Dzahn) [19:13:45] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:15:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P35615 and previous config saved to /var/cache/conftool/dbconfig/20221019-191533-ladsgroup.json [19:16:22] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:16:52] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:17:37] (03CR) 10Dzahn: [C: 03+2] "did not work. either changes are not pulled to local puppetmaster or for another reason this causes a duplicate declaration. added back in" [puppet] - 10https://gerrit.wikimedia.org/r/844515 (owner: 10Dzahn) [19:17:44] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:22:25] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10Ottomata) Haroon also needs kerberos access for this. I just created a principal for him. @HShaikh check your email and look for instructions. [19:23:19] (03PS1) 10Ottomata: hshaikh - set krb: present [puppet] - 10https://gerrit.wikimedia.org/r/844516 (https://phabricator.wikimedia.org/T321068) [19:26:20] (03CR) 10Ottomata: [C: 03+2] hshaikh - set krb: present [puppet] - 10https://gerrit.wikimedia.org/r/844516 (https://phabricator.wikimedia.org/T321068) (owner: 10Ottomata) [19:30:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P35616 and previous config saved to /var/cache/conftool/dbconfig/20221019-193039-ladsgroup.json [19:45:23] 10ops-codfw, 10DC-Ops: hw troubleshooting: flapping mgmt console for wdqs2005.mgmt.codfw.wmnet - https://phabricator.wikimedia.org/T321237 (10RKemper) [19:45:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T318955)', diff saved to https://phabricator.wikimedia.org/P35617 and previous config saved to /var/cache/conftool/dbconfig/20221019-194546-ladsgroup.json [19:45:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:45:51] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [19:46:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:46:44] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: flapping mgmt console for wdqs2005.mgmt.codfw.wmnet - https://phabricator.wikimedia.org/T321237 (10RKemper) [19:49:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T2000). [20:00:05] TheresNoTime: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] * TheresNoTime wil self-deploy :D [20:00:22] TheresNoTime: i guess you'll self-service? [20:00:31] indeed [20:00:46] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: flapping mgmt console for wdqs2005.mgmt.codfw.wmnet - https://phabricator.wikimedia.org/T321237 (10RKemper) [20:01:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/PageTriage] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/844040 (https://phabricator.wikimedia.org/T310974) (owner: 10Samtar) [20:01:58] (03CR) 10Dduvall: docker_registry_ha: Require JWT to have ref_protected claim set to true (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall) [20:03:35] (03PS2) 10Dduvall: docker_registry_ha: Require JWT to have ref_protected claim set to true [puppet] - 10https://gerrit.wikimedia.org/r/844513 [20:04:16] (03CR) 10Dduvall: docker_registry_ha: Require JWT to have ref_protected claim set to true (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall) [20:07:23] (03Merged) 10jenkins-bot: Hooks: Log to statsd when a page is noindex'd [extensions/PageTriage] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/844040 (https://phabricator.wikimedia.org/T310974) (owner: 10Samtar) [20:07:54] !log samtar@deploy1002 Started scap: Backport for [[gerrit:844040|Hooks: Log to statsd when a page is noindex'd (T310974)]] [20:07:59] T310974: Extend PageTriageMaxAge for unpatrolled articles at enwiki - https://phabricator.wikimedia.org/T310974 [20:08:20] !log samtar@deploy1002 samtar and samtar: Backport for [[gerrit:844040|Hooks: Log to statsd when a page is noindex'd (T310974)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:08:50] (03PS5) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) [20:08:54] * TheresNoTime is testing [20:09:21] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T320817 (10wiki_willy) a:03Papaul [20:10:55] woo, works [20:11:46] (03CR) 10BBlack: [C: 03+1] Revert "Revert "conftool-data: remove phabricator / git-ssh"" [puppet] - 10https://gerrit.wikimedia.org/r/844041 (owner: 10Dzahn) [20:12:14] (03CR) 10BBlack: [C: 03+1] remove git-ssh from common/service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/843522 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [20:14:51] (03CR) 10Dzahn: [C: 03+2] remove git-ssh from common/service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/843522 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [20:15:07] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:844040|Hooks: Log to statsd when a page is noindex'd (T310974)]] (duration: 07m 12s) [20:15:17] !log git-ssh service is being decom'ed - expect some temp pybal alerts [20:17:41] * TheresNoTime has finished their patch, will be around for a while if there's anything needing deploying [20:18:24] (03PS1) 10DDesouza: Broaden audience of Research Incentive Survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844522 (https://phabricator.wikimedia.org/T318333) [20:18:38] hi [20:19:05] TheresNoTime: can you merge a beta change? will add it to deployment page [20:19:12] danisztls: sure :) [20:19:33] (03PS2) 10DDesouza: Broaden audience of Research Incentive Survey on enwiki [beta] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844522 (https://phabricator.wikimedia.org/T318333) [20:21:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844522 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza) [20:21:12] TheresNoTime: done [20:21:36] it's 844522 [20:21:47] merging now :) [20:22:00] (03Merged) 10jenkins-bot: Broaden audience of Research Incentive Survey on enwiki [beta] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844522 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza) [20:22:06] (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:22:06] TheresNoTime: thank you! [20:23:19] danisztls: done :) and a `beta-code-update-eqiad` has just started so it should be live on beta in a few minutes [20:24:14] (03PS1) 10Andrew Bogott: Keystone policy.yaml: allow anyone to get project info [puppet] - 10https://gerrit.wikimedia.org/r/844524 [20:26:32] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:26:54] (03PS6) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) [20:27:06] (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:27:34] !log lvs2010 - restarted pybal, removed git-ssh IP with ipvsadm [20:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:20] (03CR) 10Ahmon Dancy: [C: 03+1] docker_registry_ha: Require JWT to have ref_protected claim set to true [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall) [20:30:07] (03PS1) 10Andrew Bogott: mwopenstackclients: make somewhat domain-aware [puppet] - 10https://gerrit.wikimedia.org/r/844525 [20:30:22] RECOVERY - PyBal backends health check on lvs2008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:30:23] !lvs2008 - systemctl restart pybal.service ; ipvsadm -Dt '208.80.153.250:22' ; ipvsadm -Dt '[2620:0:860:ed1a::3:fa]:22' - T296022 [20:30:24] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [20:32:06] (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:32:55] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: make somewhat domain-aware [puppet] - 10https://gerrit.wikimedia.org/r/844525 (owner: 10Andrew Bogott) [20:33:07] (03CR) 10Andrew Bogott: [C: 03+2] Keystone policy.yaml: allow anyone to get project info [puppet] - 10https://gerrit.wikimedia.org/r/844524 (owner: 10Andrew Bogott) [20:33:11] !log closing UTC late backport window [20:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:18] !lvs1020, lvs1018 - systemctl restart pybal.service ; ipvsadm -Dt '208.80.154.250:22' ; ipvsadm -Dt '[2620:0:861:ed1a::3:16]:22' - T296022 [20:36:40] (03PS1) 10Andrew Bogott: mwopenstackclient: remove stray print() [puppet] - 10https://gerrit.wikimedia.org/r/844526 [20:37:22] (03PS7) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) [20:37:29] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclient: remove stray print() [puppet] - 10https://gerrit.wikimedia.org/r/844526 (owner: 10Andrew Bogott) [20:38:24] !log puppetmaster1001/puppetmaster2001 - delete .git-*.err files in /var/run/confd-template T296022 [20:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:29] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [20:42:06] (ConfdResourceFailed) resolved: (4) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:42:30] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:43:06] jinxer-wm: :) [20:43:48] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Jclark-ctr) Drive will arrive tomorrow. Can it be swapped when it arrives or will it need to be scheduled? [20:44:55] !log lvs1020, lvs1018 - systemctl restart pybal.service ; ipvsadm -Dt '208.80.154.250:22' ; ipvsadm -Dt '[2620:0:861:ed1a::3:16]:22' - T296022 [20:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:01] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [20:45:59] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10BCornwall) 05Open→03Stalled @Vgutierrez Now that HAProxy is used for TLS termination, can this safely be closed? [20:46:07] 10SRE, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10BCornwall) [20:49:35] !log lvs2010, lvs2008 - systemctl restart pybal.service ; ipvsadm -Dt '208.80.153.250:22' ; ipvsadm -Dt '[2620:0:860:ed1a::3:fa]:22' - T296022 [20:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:03] 10SRE, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10BCornwall) [20:52:18] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10BCornwall) 05Stalled→03Invalid Closing as per bblack's recommendation. [20:54:51] (03PS1) 10JHathaway: add srv record for aux-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/844530 (https://phabricator.wikimedia.org/T321134) [20:55:43] (03CR) 10CI reject: [V: 04-1] add srv record for aux-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/844530 (https://phabricator.wikimedia.org/T321134) (owner: 10JHathaway) [21:03:34] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-etcd1002.eqiad.wmnet [21:03:43] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [21:04:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [21:04:54] (03PS8) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) [21:06:45] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:06:45] !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-etcd1002.eqiad.wmnet on all recursors [21:06:48] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-etcd1002.eqiad.wmnet on all recursors [21:11:40] 10SRE-swift-storage, 10Community-Tech, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10Eevans) To summarize what was discussed during a Data Persistence meeting earlier today (in no particular order): * The total (ex... [21:23:53] 10ops-codfw, 10decommission-hardware, 10Discovery-Search (Current work): decommission elastic20[25-36].codfw.wmnet - https://phabricator.wikimedia.org/T321243 (10bking) [21:24:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [21:30:22] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-etcd1002.eqiad.wmnet [21:30:38] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-etcd1003.eqiad.wmnet [21:30:39] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [21:36:56] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:36:56] !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-etcd1003.eqiad.wmnet on all recursors [21:36:59] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-etcd1003.eqiad.wmnet on all recursors [22:00:36] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-etcd1003.eqiad.wmnet [22:06:50] (03PS1) 10JHathaway: aux-k8s-etcd: mac addresses for aux-k8s-etcd100{2,3} [puppet] - 10https://gerrit.wikimedia.org/r/844535 (https://phabricator.wikimedia.org/T321134) [22:08:58] (03CR) 10JHathaway: [C: 03+2] aux-k8s-etcd: mac addresses for aux-k8s-etcd100{2,3} [puppet] - 10https://gerrit.wikimedia.org/r/844535 (https://phabricator.wikimedia.org/T321134) (owner: 10JHathaway) [22:18:19] (03PS2) 10JHathaway: add srv record for aux-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/844530 (https://phabricator.wikimedia.org/T321134) [22:21:13] (03CR) 10JHathaway: [C: 03+2] add srv record for aux-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/844530 (https://phabricator.wikimedia.org/T321134) (owner: 10JHathaway) [22:26:09] (03CR) 10Dzahn: [C: 03+1] "meanwhile also phab1004 is configured to only have read-only DB access. now we must explicitly switch the active phab server in Hiera to m" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/832282 (https://phabricator.wikimedia.org/T313954) (owner: 10Dduvall) [22:44:26] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:49:45] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T320817 (10Papaul) 05Open→03Resolved There is nothing connected to port xe-7/0/9 on fpc7 row B [23:10:11] (03CR) 10Subramanya Sastry: [C: 03+1] Disable wgParserEnableLegacyMediaDOM on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844074 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)