[00:11:03] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1007.eqiad.wmnet with reason: host reimage
[00:14:28] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1007.eqiad.wmnet with reason: host reimage
[00:17:56] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:26:42] <icinga-wm>	 RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:29:59] <wikibugs>	 (03PS3) 10Stang: Fix broken wordmarks in Bengali projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124)
[00:30:19] <wikibugs>	 (03CR) 10Stang: Fix broken wordmarks in Bengali projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124) (owner: 10Stang)
[00:34:05] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1007.eqiad.wmnet with OS bullseye
[00:34:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye co...
[00:34:45] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1009.eqiad.wmnet with OS bullseye
[00:34:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1009.eqiad.wmnet with OS bullseye
[00:43:20] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:39] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1009.eqiad.wmnet with reason: host reimage
[00:50:01] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1009.eqiad.wmnet with reason: host reimage
[00:53:29] <wikibugs>	 (03PS2) 10Jdlrobson: [WIP] Logos: yaml can be populated by buildLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843995
[00:53:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Logos: yaml can be populated by buildLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843995 (owner: 10Jdlrobson)
[00:54:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[00:54:35] <wikibugs>	 (03CR) 10Jdlrobson: "This modifies logos/config.yaml with the new logos we've created. We either need to support local_wordmark / local_tagline directives in t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843995 (owner: 10Jdlrobson)
[01:01:36] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:03:18] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] mariadb: new image for mariadb/mysql backups (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842993 (https://phabricator.wikimedia.org/T254636) (owner: 10BryanDavis)
[01:04:05] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1009.eqiad.wmnet with OS bullseye
[01:04:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1009.eqiad.wmnet with OS bullseye co...
[01:05:00] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1011.eqiad.wmnet with OS bullseye
[01:05:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye
[01:05:38] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:09:06] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 52 probes of 783 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:09:16] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:09:16] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 157 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:11:00] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:13:34] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:14:40] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 14 probes of 783 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:14:52] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 54 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:16:59] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1011.eqiad.wmnet with reason: host reimage
[01:18:50] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:20:39] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1011.eqiad.wmnet with reason: host reimage
[01:35:10] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1011.eqiad.wmnet with OS bullseye
[01:35:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye co...
[01:36:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) 05Open→03Resolved @btullis these have been fixed, I updated the nic firmware and re-ran the image script.
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:39:51] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T317804 (10Cmjohnson) 05Open→03Resolved these are updated with kafka-stretch1001 and 1002
[01:41:35] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:50:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Cmjohnson) @herron for raid setup, are all the disk raid 50?   I do not think that the OS will install with that setup?  There are 8 750GB SSDs   P...
[01:59:49] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:00:57] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:03:29] <icinga-wm>	 PROBLEM - MegaRAID on db2139 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:03:31] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db2139 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T321147 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:03:36] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on db2139 - https://phabricator.wikimedia.org/T321147 (10ops-monitoring-bot)
[02:06:17] <wikibugs>	 (03CR) 10Aftab: [C: 03+1] Fix broken wordmarks in Bengali projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124) (owner: 10Stang)
[02:07:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:54:45] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Papaul) @Jclark-ctr @ayounsi is still looking into how he can  prioritize T304677: Possible DHCP improvments
[03:05:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Thanks for splitting this out! Let's merge this part tomorrow when we're both working." [puppet] - 10https://gerrit.wikimedia.org/r/842454 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[03:28:47] <icinga-wm>	 PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:58:33] <icinga-wm>	 PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:27:31] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:37:05] <icinga-wm>	 RECOVERY - MegaRAID on db2139 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:38:44] <wikibugs>	 (03PS2) 10KartikMistry: Enable specialcontribute campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843003 (https://phabricator.wikimedia.org/T319306)
[04:54:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[04:54:43] <icinga-wm>	 RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:29:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:30:43] <icinga-wm>	 RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:33:21] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2022-10-18-161640-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/842812 (https://phabricator.wikimedia.org/T317224)
[05:34:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:40:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "recheck" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/826859 (https://phabricator.wikimedia.org/T316347) (owner: 10JMeybohm)
[05:45:35] * kart_ updating cxserver..
[05:49:06] <kart_>	 ah. No. I'll wait for sometime. Need to check something before that..
[05:58:31] <icinga-wm>	 PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:07:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add pip to python3-bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844319
[06:40:36] <XioNoX>	 !log enabled graceful-shutdown on drmrs Arelion BGP
[06:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:58:07] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T0700).
[07:00:04] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:17] <urbanecm>	 'morning
[07:00:38] <kart_>	 Moin
[07:00:57] <urbanecm>	 kart_: do you want to self-deploy, or should i deploy for you?
[07:00:58] <kart_>	 urbanecm: I'll go ahead with my patch..
[07:01:07] <urbanecm>	 👍
[07:01:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843003 (https://phabricator.wikimedia.org/T319306) (owner: 10KartikMistry)
[07:02:35] <wikibugs>	 (03Merged) 10jenkins-bot: Enable specialcontribute campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843003 (https://phabricator.wikimedia.org/T319306) (owner: 10KartikMistry)
[07:03:07] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:843003|Enable specialcontribute campaign (T319306)]]
[07:03:13] <stashbot>	 T319306: Adjust visibility for the option to translate in the Persistent Contribution entry point - https://phabricator.wikimedia.org/T319306
[07:03:35] <logmsgbot>	 !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:843003|Enable specialcontribute campaign (T319306)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[07:06:16] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:09:19] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:843003|Enable specialcontribute campaign (T319306)]] (duration: 06m 11s)
[07:09:24] <stashbot>	 T319306: Adjust visibility for the option to translate in the Persistent Contribution entry point - https://phabricator.wikimedia.org/T319306
[07:10:11] <kart_>	 urbanecm: I'm done.
[07:10:55] <urbanecm>	 Ack!
[07:12:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:24:51] <wikibugs>	 (03PS2) 10Urbanecm: Remove GEHomepageImpactModuleEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837114 (owner: 10Kosta Harlan)
[07:27:56] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 101 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:33:12] <wikibugs>	 (03PS1) 10Urbanecm: [growth] Turn mentorship off by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844428 (https://phabricator.wikimedia.org/T321056)
[07:33:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837114 (owner: 10Kosta Harlan)
[07:34:22] <wikibugs>	 (03Merged) 10jenkins-bot: Remove GEHomepageImpactModuleEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837114 (owner: 10Kosta Harlan)
[07:34:43] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:837114|Remove GEHomepageImpactModuleEnabled]]
[07:35:07] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and kharlan: Backport for [[gerrit:837114|Remove GEHomepageImpactModuleEnabled]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[07:39:11] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:837114|Remove GEHomepageImpactModuleEnabled]] (duration: 04m 27s)
[07:39:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844428 (https://phabricator.wikimedia.org/T321056) (owner: 10Urbanecm)
[07:40:18] <wikibugs>	 (03Merged) 10jenkins-bot: [growth] Turn mentorship off by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844428 (https://phabricator.wikimedia.org/T321056) (owner: 10Urbanecm)
[07:40:40] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:844428|[growth] Turn mentorship off by default (T321056)]]
[07:40:45] <stashbot>	 T321056: Turn mentorship off on all wikis where the community has not yet setup a mentorship list - https://phabricator.wikimedia.org/T321056
[07:41:03] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:844428|[growth] Turn mentorship off by default (T321056)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[07:45:54] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:844428|[growth] Turn mentorship off by default (T321056)]] (duration: 05m 14s)
[07:46:04] <stashbot>	 T321056: Turn mentorship off on all wikis where the community has not yet setup a mentorship list - https://phabricator.wikimedia.org/T321056
[07:49:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: remove 'host' label from PybalBackendDown [alerts] - 10https://gerrit.wikimedia.org/r/844429 (https://phabricator.wikimedia.org/T320627)
[07:57:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: remove 'host' label from PybalBackendDown [alerts] - 10https://gerrit.wikimedia.org/r/844429 (https://phabricator.wikimedia.org/T320627) (owner: 10Filippo Giunchedi)
[07:57:52] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 89 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:00:04] <jouncebot>	 hashar and dduvall: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T0800)
[08:04:48] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T321157 - The acknowledgement expires at: 2022-10-20 08:04:23. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:04:48] <icinga-wm>	 ACKNOWLEDGEMENT - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 ayounsi https://phabricator.wikimedia.org/T321157 - The acknowledgement expires at: 2022-10-20 08:04:23. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:04:48] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP ayounsi https://phabricator.wikimedia.org/T321157 - The acknowledgement expires at: 2022-10-20 08:04:23. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:04:48] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T321157 - The acknowledgement expires at: 2022-10-20 08:04:23. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:04:48] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T321157 - The acknowledgement expires at: 2022-10-20 08:04:23. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:06:31] <hashar>	 I am running the train
[08:07:15] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844433 (https://phabricator.wikimedia.org/T320511)
[08:07:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844433 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot)
[08:08:05] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844433 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot)
[08:10:52] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: make allowed_images list configurable in hiera [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730)
[08:12:16] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.6  refs T320511
[08:12:21] <stashbot>	 T320511: 1.40.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T320511
[08:15:54] <logmsgbot>	 !log hashar@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.6  refs T320511 (duration: 03m 37s)
[08:16:26] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2139 - https://phabricator.wikimedia.org/T321147 (10Peachey88)
[08:20:15] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37599/console" [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto)
[08:21:27] <hashar>	 looks like the train is okish
[08:22:32] <wikibugs>	 (03PS2) 10Jelto: gitlab_runner: make allowed_images list configurable in hiera [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730)
[08:22:49] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Remove references to deprecated kubeyaml (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert)
[08:23:23] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37600/console" [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto)
[08:27:59] <wikibugs>	 (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844439 (https://phabricator.wikimedia.org/T321082)
[08:30:14] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) >>! In T291918#7387775, @Joe wrote: >>>! In T291918#7387656, @jijiki wrote: >> Naming things is hard though, I do not agree with the `kube` prefix, in the future af...
[08:30:42] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "With I112110d2553a41e839f9990c39ac2a872135c588 allowed_images for Trusted Runners and Shared Runners will be separated. I can rebase this " [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961) (owner: 10SBassett)
[08:35:05] <wikibugs>	 (03Restored) 10Matthias Mullie: Add SearchVue to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie)
[08:35:26] <wikibugs>	 (03Restored) 10Matthias Mullie: [SearchVue] Enable extension on beta ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie)
[08:35:37] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844439 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm)
[08:43:10] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Clement_Goubert) >>! In T291918#7387656, @jijiki wrote: > Naming things is hard though, I do not agree with the `kube` prefix, in the future after baremetal mediawiki se...
[08:43:22] <urbanecm>	 hi kostajh, should we include https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/844430 in the version bump as well? 
[08:44:21] <kostajh>	 urbanecm: I was going to suggest it, but it seems like either way is fine
[08:44:44] <kostajh>	 There’s a chance it wouldn’t work as I intend which would mean more time to create a revert, build a new image, etc
[08:44:45] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Nice to be able to remove stuff :)" [homer/public] - 10https://gerrit.wikimedia.org/r/843519 (https://phabricator.wikimedia.org/T320962) (owner: 10Ayounsi)
[08:45:12] <urbanecm>	 kostajh: i see. well, at worst i'll try out a revert as well :))
[08:45:30] <urbanecm>	 but if you say there's a chance it doesn't work, perhaps worth doing it in two deployments?
[08:49:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ayounsi) I don't fully understand your comment above.   I see that the alerts above are gone, but are now replaced with: ` test_enabled_not_connected xe-0/0/17  Interface enab...
[08:50:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Management: remove access/wifi exceptions [homer/public] - 10https://gerrit.wikimedia.org/r/843519 (https://phabricator.wikimedia.org/T320962) (owner: 10Ayounsi)
[08:51:12] <wikibugs>	 (03Merged) 10jenkins-bot: Management: remove access/wifi exceptions [homer/public] - 10https://gerrit.wikimedia.org/r/843519 (https://phabricator.wikimedia.org/T320962) (owner: 10Ayounsi)
[08:51:58] <icinga-wm>	 RECOVERY - Check systemd state on mw1439 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:53:35] <wikibugs>	 (03PS2) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235)
[08:54:18] <wikibugs>	 (03Abandoned) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 (owner: 10David Caro)
[08:54:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[08:55:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle)
[08:55:34] <wikibugs>	 (03PS1) 10Ayounsi: Management: remove NAT include [homer/public] - 10https://gerrit.wikimedia.org/r/844443 (https://phabricator.wikimedia.org/T320962)
[08:56:21] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Management: remove NAT include [homer/public] - 10https://gerrit.wikimedia.org/r/844443 (https://phabricator.wikimedia.org/T320962) (owner: 10Ayounsi)
[08:56:56] <wikibugs>	 (03Merged) 10jenkins-bot: Management: remove NAT include [homer/public] - 10https://gerrit.wikimedia.org/r/844443 (https://phabricator.wikimedia.org/T320962) (owner: 10Ayounsi)
[08:59:00] <wikibugs>	 (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844444 (https://phabricator.wikimedia.org/T321082)
[08:59:26] <kostajh>	 urbanecm: yeah two separate deployments might make sense here
[08:59:57] <urbanecm>	 kostajh: done :). can i start now? or do you prefer waiting for later in the day?
[09:01:42] <XioNoX>	 !log remove DHCP server and access zone on mr1-eqiad - T320962
[09:01:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:47] <stashbot>	 T320962: Decommission eqiad cage WiFi - https://phabricator.wikimedia.org/T320962
[09:01:55] <wikibugs>	 (03PS3) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235)
[09:03:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle)
[09:06:58] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on db2139 - https://phabricator.wikimedia.org/T321147 (10Ladsgroup) It's a backup source.
[09:09:19] <kostajh>	 urbanecm: go for it!
[09:09:25] <urbanecm>	 okay!
[09:09:43] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844439 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm)
[09:11:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Figure out a captcha option for IDM - https://phabricator.wikimedia.org/T320809 (10Aklapper)
[09:12:26] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[09:12:56] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844439 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm)
[09:13:22] <urbanecm>	 patch is at deploy1002 already, running `helmfile -e staging -i apply ` in the service's dir
[09:13:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Figure out a captcha option for IDM - https://phabricator.wikimedia.org/T320809 (10Aklapper) There's lots of (partially heated) past discussion. See stuff like `T289607` (no news for a year), `T250227`, `T6845`, etc etc. Phabricator itself has no captchas involved.
[09:13:32] <urbanecm>	 (following https://wikitech.wikimedia.org/wiki/Add_Link#Deployment_2)
[09:13:34] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply
[09:13:50] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on db2139 - https://phabricator.wikimedia.org/T321147 (10jcrespo) This is weird, //Rebuild// happens when a new disk is added. It is now in a good state: Raid status: OK: optimal, 1 logical, 10 physical, WriteBack policy
[09:13:53] <urbanecm>	 the only changed thing is the version, continuing
[09:14:12] <wikibugs>	 (03PS1) 10Btullis: Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730)
[09:14:22] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[09:14:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[09:14:52] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
[09:15:16] <wikibugs>	 (03PS4) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235)
[09:16:29] <urbanecm>	 `curl "https://staging.svc.eqiad.wmnet:4005/v1/linkrecommendations/wikipedia/be_x_old/Barack_Obama"` returns `{"message":"Page not found: Barack_Obama"}`, while production timeouts
[09:16:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Remove all nutcracker templates and refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/843878 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[09:16:48] <wikibugs>	 (03PS2) 10Btullis: Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730)
[09:16:51] <urbanecm>	 cswiki's example from https://wikitech.wikimedia.org/wiki/Add_Link#Deployment_2 also works fine
[09:16:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle)
[09:17:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[09:17:33] <urbanecm>	 `service-checker-swagger staging.svc.eqiad.wmnet https://staging.svc.eqiad.wmnet:4005 -t 2 -s /apispec_1.json` also works fine
[09:17:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mwdebug: Remove nutcracker config values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/843880 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[09:18:11] <urbanecm>	 kostajh: going with eqiad, unless i need to check anything else on staging?
[09:18:35] <kostajh>	 urbanecm: sounds good
[09:18:38] <urbanecm>	 doing
[09:18:42] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply
[09:18:54] <urbanecm>	 diff seems fine, continuing
[09:19:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:19:48] <wikibugs>	 (03PS5) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235)
[09:20:03] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Decommission eqiad cage WiFi - https://phabricator.wikimedia.org/T320962 (10ayounsi) Access security zone, DHCP server, NAT config removed from the routers.  New DHCP relay feature enabled instead of the old bootp one.  Netbo...
[09:20:38] <wikibugs>	 (03PS3) 10Btullis: Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730)
[09:20:47] <kostajh>	 urbanecm: cool. sometimes the eqiad one takes a few minutes to roll out
[09:20:52] <urbanecm>	 ack
[09:21:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle)
[09:21:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[09:21:22] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply
[09:21:41] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Add SSH key for sstefanova to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934) (owner: 10Slavina Stefanova)
[09:21:49] <urbanecm>	 works fine in production as well, continuing with codfw
[09:21:50] <wikibugs>	 (03CR) 10David Caro: [V: 03+2 C: 03+2] Add SSH key for sstefanova to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934) (owner: 10Slavina Stefanova)
[09:21:54] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply
[09:22:58] <wikibugs>	 (03PS6) 10David Caro: labstore: Send prom stats for getent_check [puppet] - 10https://gerrit.wikimedia.org/r/813898
[09:23:30] <wikibugs>	 (03PS7) 10David Caro: labstore: Send prom stats for getent_check [puppet] - 10https://gerrit.wikimedia.org/r/813898 (https://phabricator.wikimedia.org/T313444)
[09:23:36] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply
[09:23:52] <urbanecm>	 and seems T321082 is now resolved :)
[09:23:52] <stashbot>	 T321082: Requests to be-x-old.wikipedia.org result in HTTP 504 Gateway Timeout - https://phabricator.wikimedia.org/T321082
[09:24:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:24:32] <wikibugs>	 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10aborrero) I'm trying to capture this project also in https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProp...
[09:24:40] <wikibugs>	 (03PS8) 10David Caro: wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (https://phabricator.wikimedia.org/T313444)
[09:26:08] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/832259 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:27:19] <wikibugs>	 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10taavi) >>! In T314847#8326550, @cmooney wrote: >>> /32 Service IPs should be from the cloud realm public /24 (185.15.56.0/24) if the s...
[09:27:27] <wikibugs>	 (03PS6) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235)
[09:28:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle)
[09:29:05] <wikibugs>	 (03PS1) 10JMeybohm: Provide the cluster_cidr to kube-proxy on masters as well [puppet] - 10https://gerrit.wikimedia.org/r/844446 (https://phabricator.wikimedia.org/T300500)
[09:32:22] <wikibugs>	 (03PS7) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235)
[09:34:08] <wikibugs>	 (03PS4) 10Btullis: Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730)
[09:34:36] <wikibugs>	 (03PS1) 10JMeybohm: Provide the cluster_cidr to kube-proxy in wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/844449 (https://phabricator.wikimedia.org/T300500)
[09:34:38] <wikibugs>	 (03PS1) 10JMeybohm: Provide the cluster_cidr to kube-proxy in wikikube eqiad [puppet] - 10https://gerrit.wikimedia.org/r/844450 (https://phabricator.wikimedia.org/T300500)
[09:40:59] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert)
[09:43:00] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[09:43:10] <wikibugs>	 (03PS5) 10Btullis: Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730)
[09:44:17] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37604/console" [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[09:45:11] <hoo>	 _joe_: Any news on the maintenance script patch?
[09:45:43] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) `mw-main` is probably the least misleading one, yes. I would like `mw-web` more, but it's going to mislead a lot of people into thinking it's just requests to wiki...
[09:46:09] <_joe_>	 hoo: sorry, no, I've been too busy the last couple days :/
[09:46:16] <wikibugs>	 (03CR) 10David Caro: P:wmcs: unify toolsdb profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah)
[09:46:19] <_joe_>	 my bad
[09:47:16] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Clement_Goubert) This may be a stupid question but why would the api requests coming from browsers not go to the endpoint mapped to `mw-api-ext`?
[09:47:56] <hoo>	 No worries, just wanted to make sure its not forgotten
[09:48:03] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10jijiki) >>! In T291918#8328103, @Clement_Goubert wrote: >>>! In T291918#7387656, @jijiki wrote: >> Naming things is hard though, I do not agree with the `kube` prefix, i...
[09:48:05] <wikibugs>	 (03PS2) 10David Caro: P:toolforge: use puppetdb for grid hba data [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah)
[09:48:34] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: maintenance::wikidata: Update cron with lb and lb-pool [puppet] - 10https://gerrit.wikimedia.org/r/841148 (owner: 10Hoo man)
[09:50:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:toolforge: use puppetdb for grid hba data [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah)
[09:53:45] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) >>! In T291918#8328319, @Clement_Goubert wrote: > This may be a stupid question but why would the api requests coming from browsers not go to the endpoint mapped to...
[09:54:31] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[09:56:14] <wikibugs>	 (03CR) 10David Caro: P:toolforge: use puppetdb for grid hba data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah)
[09:58:03] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[09:58:06] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[09:58:28] <wikibugs>	 (03PS2) 10Hnowlan: admin: add thumbor namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196)
[10:00:31] <wikibugs>	 (03PS3) 10Majavah: P:toolforge: use puppetdb for grid hba data [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163)
[10:00:49] <wikibugs>	 (03CR) 10Majavah: P:toolforge: use puppetdb for grid hba data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah)
[10:01:31] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Clement_Goubert) Given both of your answers, I think `mw-web` is actually the better choice, barring calling it `mw-real-users` which is kind of weird. The API calls fro...
[10:01:56] <wikibugs>	 (03PS1) 10Volans: wmnet: remove subnet used for eqiad's wifi [dns] - 10https://gerrit.wikimedia.org/r/844451 (https://phabricator.wikimedia.org/T320962)
[10:02:12] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Remove all nutcracker templates and refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/843878 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[10:02:23] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mwdebug: Remove nutcracker config values [deployment-charts] - 10https://gerrit.wikimedia.org/r/843880 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[10:04:09] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] admin: add thumbor namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[10:04:11] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mwdebug: Remove nutcracker config values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/843880 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[10:04:29] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on db2139 - https://phabricator.wikimedia.org/T321147 (10jcrespo) 05Open→03Resolved a:03jcrespo Resolved- this is a backup host- it is ok to ignore it unless it reappears.
[10:04:59] <wikibugs>	 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10cmooney) >>! In T314847#8328277, @taavi wrote: > Your comment was written in a way that made me understand that everything used in cod...
[10:05:18] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "The prefix has been removed from Netbox" [dns] - 10https://gerrit.wikimedia.org/r/844451 (https://phabricator.wikimedia.org/T320962) (owner: 10Volans)
[10:05:20] <wikibugs>	 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10aborrero) >>! In T314847#8328277, @taavi wrote: >>>! In T314847#8328272, @aborrero wrote: >> HAproxy uses LVS/ipvsadm for them under t...
[10:06:34] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Remove all nutcracker templates and refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/843878 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[10:06:36] <wikibugs>	 (03Merged) 10jenkins-bot: mwdebug: Remove nutcracker config values [deployment-charts] - 10https://gerrit.wikimedia.org/r/843880 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[10:06:46] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326)
[10:07:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:07:45] <wikibugs>	 (03Merged) 10jenkins-bot: admin: add thumbor namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[10:08:40] <wikibugs>	 (03PS5) 10Hnowlan: helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196)
[10:08:50] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:09:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:10:38] <wikibugs>	 (03CR) 10Jelto: "With I112110d2553a41e839f9990c39ac2a872135c588 allowed_images for Trusted Runners and Shared Runners will be separated. I can rebase this " [puppet] - 10https://gerrit.wikimedia.org/r/842857 (https://phabricator.wikimedia.org/T320825) (owner: 10Brennen Bearnes)
[10:13:06] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[10:17:33] <claime>	 !log Deploying mediawiki helm chart v0.2.4 on k8s-experimental mwdebug - T321042
[10:17:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:38] <stashbot>	 T321042: Remove nutcracker from mediawiki chart - https://phabricator.wikimedia.org/T321042
[10:17:43] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[10:18:25] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[10:18:42] <wikibugs>	 (03PS1) 10Majavah: openstack: remove unused nova::placement manifests [puppet] - 10https://gerrit.wikimedia.org/r/844455
[10:18:56] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[10:19:49] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37607/console" [puppet] - 10https://gerrit.wikimedia.org/r/844455 (owner: 10Majavah)
[10:20:14] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[10:20:33] <jnuche>	 jouncebot: nowandnext
[10:20:33] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 39 minute(s)
[10:20:33] <jouncebot>	 In 2 hour(s) and 39 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1300)
[10:22:06] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 92 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:28:08] <wikibugs>	 (03PS1) 10Majavah: P:openstack: expose remaining APIs to the internet [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312)
[10:28:08] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 77 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:29:55] <wikibugs>	 (03PS2) 10Majavah: P:openstack: expose remaining APIs to the internet [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312)
[10:31:40] <wikibugs>	 (03PS1) 10Jbond: P:lvs::configueration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458
[10:31:44] <wikibugs>	 (03PS3) 10Majavah: P:openstack: expose remaining APIs to the internet [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312)
[10:32:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack: expose remaining APIs to the internet [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312) (owner: 10Majavah)
[10:32:44] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37611/console" [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312) (owner: 10Majavah)
[10:32:46] <wikibugs>	 (03PS2) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458
[10:33:47] <wikibugs>	 (03PS4) 10Majavah: P:openstack: expose remaining APIs to the internet [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312)
[10:33:49] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042)
[10:34:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond)
[10:34:48] <wikibugs>	 (03PS3) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458
[10:36:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond)
[10:36:39] <wikibugs>	 (03PS4) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458
[10:37:52] <wikibugs>	 (03PS5) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458
[10:39:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond)
[10:40:54] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:42:13] <wikibugs>	 (03PS2) 10Clément Goubert: kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042)
[10:42:40] <wikibugs>	 (03PS1) 10Btullis: Add postgresql replication password for new an-db servers [labs/private] - 10https://gerrit.wikimedia.org/r/844460 (https://phabricator.wikimedia.org/T319440)
[10:42:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[10:43:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: P:lvs::configuration: Store all site data in an accessible structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond)
[10:43:32] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add postgresql replication password for new an-db servers [labs/private] - 10https://gerrit.wikimedia.org/r/844460 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[10:44:33] <wikibugs>	 (03PS3) 10Clément Goubert: kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042)
[10:44:39] <wikibugs>	 (03PS6) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458
[10:45:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[10:45:58] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37619/console" [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[10:47:14] <wikibugs>	 (03PS4) 10Clément Goubert: kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042)
[10:54:19] <wikibugs>	 (03PS5) 10Clément Goubert: kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042)
[10:54:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[10:56:50] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37623/console" [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[10:57:20] <wikibugs>	 (03PS7) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458
[10:58:25] <wikibugs>	 (03PS6) 10Clément Goubert: kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042)
[11:01:00] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37624/console" [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert)
[11:02:00] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:02:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[11:03:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[11:03:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35553 and previous config saved to /var/cache/conftool/dbconfig/20221019-110308-ladsgroup.json
[11:03:13] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[11:05:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[11:05:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[11:05:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35554 and previous config saved to /var/cache/conftool/dbconfig/20221019-110552-ladsgroup.json
[11:05:57] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[11:06:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[11:06:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[11:06:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T314041)', diff saved to https://phabricator.wikimedia.org/P35555 and previous config saved to /var/cache/conftool/dbconfig/20221019-110635-ladsgroup.json
[11:06:40] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[11:07:19] <wikibugs>	 (03PS6) 10Hnowlan: helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196)
[11:07:46] <Emperor>	 !log upload wmf-beamer-style 0.2 to apt
[11:07:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35556 and previous config saved to /var/cache/conftool/dbconfig/20221019-110902-ladsgroup.json
[11:10:18] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:10:39] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:10:41] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:11:27] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:12:48] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:13:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 38): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37626/console" [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond)
[11:13:38] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:14:17] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:16:01] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/844015 (https://phabricator.wikimedia.org/T321177)
[11:16:05] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/844016 (https://phabricator.wikimedia.org/T321177)
[11:17:10] <wikibugs>	 (03CR) 10Volans: wmnet: Update s4-master alias (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/844016 (https://phabricator.wikimedia.org/T321177) (owner: 10Gerrit maintenance bot)
[11:17:33] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/844017 (https://phabricator.wikimedia.org/T321178)
[11:17:37] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/844018 (https://phabricator.wikimedia.org/T321178)
[11:18:35] <wikibugs>	 10SRE: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10hnowlan) This _appears_ to have abated somewhat? https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13&from=1664190035000&to=now
[11:21:56] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:22:56] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:23:20] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[11:24:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P35557 and previous config saved to /var/cache/conftool/dbconfig/20221019-112409-ladsgroup.json
[11:25:10] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[11:25:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: P:lvs::configuration: Store all site data in an accessible structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond)
[11:29:22] <wikibugs>	 (03PS2) 10Matthias Mullie: Add SearchVue to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367)
[11:29:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35558 and previous config saved to /var/cache/conftool/dbconfig/20221019-112925-ladsgroup.json
[11:29:30] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[11:29:38] <wikibugs>	 (03PS3) 10Matthias Mullie: Add SearchVue to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367)
[11:30:03] <jnuche>	 jouncebot: nowandnext
[11:30:03] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 29 minute(s)
[11:30:03] <jouncebot>	 In 1 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1300)
[11:30:37] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.27.1" for 553 hosts
[11:39:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P35559 and previous config saved to /var/cache/conftool/dbconfig/20221019-113915-ladsgroup.json
[11:41:03] <wikibugs>	 (03PS8) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458
[11:42:09] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Clement_Goubert) Since there seems to be consensus on everything but `mw-{app,main,web}`, I'll consider these other service names as valid going forward unless told othe...
[11:42:48] <wikibugs>	 (03CR) 10Jbond: P:lvs::configuration: Store all site data in an accessible structure (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond)
[11:43:39] <wikibugs>	 (03PS1) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461
[11:43:43] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.27.1" for 552 hosts
[11:43:46] <wikibugs>	 (03PS3) 10Matthias Mullie: [SearchVue] Enable extension on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367)
[11:43:59] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.27.1" completed for 552 hosts
[11:44:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P35560 and previous config saved to /var/cache/conftool/dbconfig/20221019-114431-ladsgroup.json
[11:45:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37633/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond)
[11:46:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Please test thsi first in toolsbeta with a livehack in the puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah)
[11:47:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Patch Set 1: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah)
[11:47:26] <wikibugs>	 (03PS2) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461
[11:48:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37634/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond)
[11:49:58] <wikibugs>	 (03CR) 10Volans: "my 2 cents inline" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond)
[11:54:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35561 and previous config saved to /var/cache/conftool/dbconfig/20221019-115421-ladsgroup.json
[11:54:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[11:54:27] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[11:54:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[11:54:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35562 and previous config saved to /var/cache/conftool/dbconfig/20221019-115443-ladsgroup.json
[11:55:35] <wikibugs>	 (03PS3) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461
[11:56:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37635/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond)
[11:59:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P35563 and previous config saved to /var/cache/conftool/dbconfig/20221019-115938-ladsgroup.json
[12:00:27] <wikibugs>	 (03PS4) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461
[12:01:23] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37636/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond)
[12:04:35] <wikibugs>	 (03Abandoned) 10Cparle: Alert for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/842420 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle)
[12:14:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35564 and previous config saved to /var/cache/conftool/dbconfig/20221019-121444-ladsgroup.json
[12:14:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[12:14:50] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[12:15:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[12:15:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T318955)', diff saved to https://phabricator.wikimedia.org/P35565 and previous config saved to /var/cache/conftool/dbconfig/20221019-121506-ladsgroup.json
[12:15:16] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37637/console" [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah)
[12:19:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T314041)', diff saved to https://phabricator.wikimedia.org/P35566 and previous config saved to /var/cache/conftool/dbconfig/20221019-121939-ladsgroup.json
[12:19:45] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[12:26:31] <wikibugs>	 (03PS1) 10Jbond: wmflib::selector: dummy selector to query things in puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/844464
[12:27:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::selector: dummy selector to query things in puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/844464 (owner: 10Jbond)
[12:27:17] <XioNoX>	 !log remove cr4-ulsfo SV8 RS sessions
[12:27:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:43] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[12:34:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P35567 and previous config saved to /var/cache/conftool/dbconfig/20221019-123446-ladsgroup.json
[12:39:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T318955)', diff saved to https://phabricator.wikimedia.org/P35568 and previous config saved to /var/cache/conftool/dbconfig/20221019-123946-ladsgroup.json
[12:39:52] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[12:45:52] <wikibugs>	 10SRE, 10MediaWiki-Authentication-and-authorization, 10Platform Engineering: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10Aklapper)
[12:49:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P35569 and previous config saved to /var/cache/conftool/dbconfig/20221019-124952-ladsgroup.json
[12:51:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35570 and previous config saved to /var/cache/conftool/dbconfig/20221019-125101-ladsgroup.json
[12:51:06] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[12:54:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[12:54:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P35571 and previous config saved to /var/cache/conftool/dbconfig/20221019-125452-ladsgroup.json
[12:56:18] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809 (10JMeybohm) p:05Medium→03Low
[12:58:28] <wikibugs>	 10SRE, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699 (10LSobanski) @hnowlan Is this something that still needs to happen and if yes, who would own the next step?
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1300). Please do the needful.
[13:00:05] <jouncebot>	 matthiasmullie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:14] <matthiasmullie>	 o/
[13:00:21] <Lucas_WMDE>	 I can’t deploy yet (maybe later if nobody else is available)
[13:00:43] <matthiasmullie>	 I have 3 patches.
[13:00:48] <matthiasmullie>	 The 1.40.0-wmf.6 backport doesn't need to be tested on mwdebug - it's specific to that branch & wikipedias, so there's no place to test yet. Happy to self-deploy this one.
[13:01:01] <matthiasmullie>	 The other 2 are about enabling a new extension on beta (+ extension-list inclusion) & don't directly impact prod, but IDK whether scap is needed in this case... :p Guidance on how to do these is much appreciated!
[13:02:55] <wikibugs>	 10SRE, 10PyBal, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Cleanup pybal Prometheus metrics on monitor stop() - https://phabricator.wikimedia.org/T321191 (10fgiunchedi)
[13:04:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T314041)', diff saved to https://phabricator.wikimedia.org/P35572 and previous config saved to /var/cache/conftool/dbconfig/20221019-130459-ladsgroup.json
[13:05:04] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[13:05:06] <wikibugs>	 10SRE, 10MediaWiki-Authentication-and-authorization, 10Platform Engineering, 10serviceops: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10LSobanski)
[13:05:20] <Lucas_WMDE>	 matthiasmullie: at least extension-list, CS.php and IS.php should still be synced I think
[13:05:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Clean up monitor metrics on stop() [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/844469 (https://phabricator.wikimedia.org/T321191)
[13:05:36] <Lucas_WMDE>	 you can probably let `scap backport` do everything, and IIRC it’ll skip IS-labs.php on its own
[13:05:46] <Lucas_WMDE>	 feel free to self-service
[13:05:51] <wikibugs>	 (03PS17) 10Btullis: Add postgresql to an-db100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440)
[13:06:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P35573 and previous config saved to /var/cache/conftool/dbconfig/20221019-130607-ladsgroup.json
[13:06:37] <matthiasmullie>	 Okay; I'll get started & scap them all until/unless someone stops me!
[13:06:49] <matthiasmullie>	 Thanks, Lucas_WMDE
[13:07:17] <wikibugs>	 (03CR) 10Btullis: "Adding Jaime to reviewers, particularly with reference to the new bacula client configuration." [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[13:07:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843913 (https://phabricator.wikimedia.org/T320337) (owner: 10Matthias Mullie)
[13:08:50] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929 (10LSobanski)
[13:09:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P35574 and previous config saved to /var/cache/conftool/dbconfig/20221019-130959-ladsgroup.json
[13:11:42] <wikibugs>	 10SRE, 10API Platform, 10serviceops: Block non-browser requests that use generic agents - https://phabricator.wikimedia.org/T319423 (10LSobanski)
[13:14:30] <wikibugs>	 (03PS1) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167)
[13:14:40] <wikibugs>	 10SRE, 10Cloud-Services, 10Developer-Advocacy, 10Infrastructure-Foundations, 10LDAP: Create a single application to provision and manage developer (LDAP) accounts - https://phabricator.wikimedia.org/T179463 (10LSobanski)
[13:15:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu)
[13:15:13] <hashar>	 jouncebot: now
[13:15:13] <jouncebot>	 For the next 0 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1300)
[13:15:40] <wikibugs>	 (03PS5) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461
[13:18:30] <wikibugs>	 (03PS2) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167)
[13:19:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu)
[13:21:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P35575 and previous config saved to /var/cache/conftool/dbconfig/20221019-132114-ladsgroup.json
[13:25:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T318955)', diff saved to https://phabricator.wikimedia.org/P35576 and previous config saved to /var/cache/conftool/dbconfig/20221019-132505-ladsgroup.json
[13:25:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[13:25:11] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[13:25:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[13:25:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35577 and previous config saved to /var/cache/conftool/dbconfig/20221019-132527-ladsgroup.json
[13:26:46] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 100 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:27:22] <wikibugs>	 (03Merged) 10jenkins-bot: Add default value for search-thumbnail-extra-namespaces [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843913 (https://phabricator.wikimedia.org/T320337) (owner: 10Matthias Mullie)
[13:27:42] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8220
[13:27:46] <logmsgbot>	 !log mlitn@deploy1002 Started scap: Backport for [[gerrit:843913|Add default value for search-thumbnail-extra-namespaces (T320337)]]
[13:27:51] <stashbot>	 T320337: [M] Special:Search results should have a way to suppress thumbnails - https://phabricator.wikimedia.org/T320337
[13:28:08] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[13:28:11] <logmsgbot>	 !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:843913|Add default value for search-thumbnail-extra-namespaces (T320337)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[13:28:57] <logmsgbot>	 !log hashar@deploy1002 backport Cancelled
[13:29:04] <Lucas_WMDE>	 o/
[13:29:18] <Lucas_WMDE>	 ok, I’m around now if needed :)
[13:29:24] <hashar>	 I did not even start that one! :D
[13:30:10] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[13:30:17] <wikibugs>	 (03PS1) 10Hashar: Downgrade lcobucci/jwt (4.2.1 => 4.1.5) [vendor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844035 (https://phabricator.wikimedia.org/T321160)
[13:30:42] <wikibugs>	 (03CR) 10Jcrespo: "Please note there where some issues with postgres backups: https://phabricator.wikimedia.org/T316655 so we may need to redo them soon- but" [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[13:31:13] <matthiasmullie>	 Lucas_WMDE: First patch syncing now; I can just go ahead and `scap backport` both of the other (beta, extension-list etc) patches myself (unless things fall apart :p), ok?
[13:31:23] <Lucas_WMDE>	 ok, sure!
[13:31:25] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8220
[13:32:27] <logmsgbot>	 !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:843913|Add default value for search-thumbnail-extra-namespaces (T320337)]] (duration: 04m 41s)
[13:33:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie)
[13:33:59] <wikibugs>	 (03Merged) 10jenkins-bot: Add SearchVue to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie)
[13:34:16] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "I like this, and I think it can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah)
[13:34:23] <logmsgbot>	 !log mlitn@deploy1002 Started scap: Backport for [[gerrit:830874|Add SearchVue to extension-list and config var (T310367)]]
[13:34:29] <stashbot>	 T310367: [L] Deploy SearchVue - https://phabricator.wikimedia.org/T310367
[13:36:05] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37642/console" [puppet] - 10https://gerrit.wikimedia.org/r/844446 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm)
[13:36:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35578 and previous config saved to /var/cache/conftool/dbconfig/20221019-133620-ladsgroup.json
[13:36:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[13:36:26] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[13:36:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[13:36:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T318950)', diff saved to https://phabricator.wikimedia.org/P35579 and previous config saved to /var/cache/conftool/dbconfig/20221019-133642-ladsgroup.json
[13:37:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Provide the cluster_cidr to kube-proxy on masters as well [puppet] - 10https://gerrit.wikimedia.org/r/844446 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm)
[13:38:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T318950)', diff saved to https://phabricator.wikimedia.org/P35580 and previous config saved to /var/cache/conftool/dbconfig/20221019-133852-ladsgroup.json
[13:39:25] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Add config for redirect badges on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827968 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große)
[13:40:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10herron) Yes the other kafka-logging hosts were switched to raid50 (hardware) to provide additional capacity vs raid10.  It should appear to the OS...
[13:40:18] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "Rebased because Gerrit and local Git reported conflicts (even though the rebase itself worked without issue). Scheduled for next week; -2i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827968 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große)
[13:40:29] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-db1001.eqiad.wmnet with OS bullseye
[13:40:49] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37643/console" [puppet] - 10https://gerrit.wikimedia.org/r/844449 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm)
[13:41:40] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+1] Provide the cluster_cidr to kube-proxy in wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/844449 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm)
[13:41:50] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Add known_uid_mapping support to the production-images for spark [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[13:42:17] <wikibugs>	 (03PS6) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461
[13:43:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37644/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond)
[13:43:33] <logmsgbot>	 !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:830874|Add SearchVue to extension-list and config var (T310367)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[13:43:38] <stashbot>	 T310367: [L] Deploy SearchVue - https://phabricator.wikimedia.org/T310367
[13:44:06] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37645/console" [puppet] - 10https://gerrit.wikimedia.org/r/844450 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm)
[13:44:57] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+1] Provide the cluster_cidr to kube-proxy in wikikube eqiad [puppet] - 10https://gerrit.wikimedia.org/r/844450 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm)
[13:46:04] <wikibugs>	 (03PS7) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461
[13:46:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Provide the cluster_cidr to kube-proxy on masters as well [puppet] - 10https://gerrit.wikimedia.org/r/844446 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm)
[13:46:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Generally LGTM! Nicely done." [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle)
[13:47:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37646/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond)
[13:49:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35581 and previous config saved to /var/cache/conftool/dbconfig/20221019-134942-ladsgroup.json
[13:49:46] <wikibugs>	 (03CR) 10Elukey: Add known_uid_mapping support to the production-images for spark (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[13:49:48] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[13:50:55] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Yep, this was removed here I626d9c54c6abae9f20bae111c4eb7ac9194a223c and Ib9144f1871a099a019951efded2eec4879c0a3c3" [puppet] - 10https://gerrit.wikimedia.org/r/844455 (owner: 10Majavah)
[13:52:26] <wikibugs>	 (03PS8) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461
[13:52:54] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-db1001.eqiad.wmnet with reason: host reimage
[13:53:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37647/console" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond)
[13:53:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P35582 and previous config saved to /var/cache/conftool/dbconfig/20221019-135358-ladsgroup.json
[13:54:28] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond)
[13:55:30] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-db1001.eqiad.wmnet with reason: host reimage
[13:56:13] <wikibugs>	 (03PS1) 10Btullis: Add a default value of undefined for the docker uid hash [puppet] - 10https://gerrit.wikimedia.org/r/844484 (https://phabricator.wikimedia.org/T318730)
[13:57:19] <wikibugs>	 (03CR) 10Btullis: "This was changed as the result of a comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/844445/6/modules/profile/manifests/doc" [puppet] - 10https://gerrit.wikimedia.org/r/844484 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[13:57:37] <wikibugs>	 10SRE-Access-Requests, 10Phabricator, 10acl*access-policy-approvers: Add EChetty and  to acl*phabricator group on phab - https://phabricator.wikimedia.org/T321197 (10EChetty)
[13:58:02] <wikibugs>	 10SRE-Access-Requests, 10Phabricator, 10acl*access-policy-approvers: Add EChetty and  to acl_phabricator group on phab - https://phabricator.wikimedia.org/T321197 (10EChetty)
[13:58:20] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Add known_uid_mapping support to the production-images for spark (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[13:59:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "Let's definitely skip Barbican for now (since it's only deployed in codfw1dev and has some security concerns in the current release.)" [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312) (owner: 10Majavah)
[13:59:08] <wikibugs>	 10SRE, 10SRE Observability: certspotter failures on alert1001 - https://phabricator.wikimedia.org/T318911 (10lmata)
[14:00:26] <logmsgbot>	 !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:830874|Add SearchVue to extension-list and config var (T310367)]] (duration: 26m 03s)
[14:00:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie)
[14:00:31] <stashbot>	 T310367: [L] Deploy SearchVue - https://phabricator.wikimedia.org/T310367
[14:00:40] <wikibugs>	 10SRE-Access-Requests, 10Phabricator, 10acl*access-policy-approvers: Add JArguello and to acl_phabricator group on phab - https://phabricator.wikimedia.org/T321198 (10EChetty)
[14:01:12] <wikibugs>	 10SRE-Access-Requests, 10Phabricator, 10acl*access-policy-approvers: Add EChetty to acl_phabricator group on phab - https://phabricator.wikimedia.org/T321197 (10EChetty)
[14:01:20] <wikibugs>	 (03Merged) 10jenkins-bot: [SearchVue] Enable extension on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie)
[14:01:40] <wikibugs>	 10SRE-Access-Requests, 10Phabricator, 10acl*access-policy-approvers: Add JArguello to acl_phabricator group on phab - https://phabricator.wikimedia.org/T321198 (10EChetty)
[14:02:01] <matthiasmullie>	 !log UTC afternoon backports done
[14:02:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Ramp up SV1 IXP - https://phabricator.wikimedia.org/T321193 (10ayounsi)
[14:03:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) @Cmjohnson many thanks indeed, that's great. :+1:
[14:03:29] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 90 probes of 696 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:04:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P35583 and previous config saved to /var/cache/conftool/dbconfig/20221019-140449-ladsgroup.json
[14:05:07] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:06:43] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Downgrade lcobucci/jwt (4.2.1 => 4.1.5) [vendor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844035 (https://phabricator.wikimedia.org/T321160) (owner: 10Hashar)
[14:06:50] <hashar>	 I am going to deploy the vendor hotfix https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/844035
[14:06:56] <hashar>	 well once CI has merged it :D
[14:09:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P35584 and previous config saved to /var/cache/conftool/dbconfig/20221019-140905-ladsgroup.json
[14:10:06] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-db1001.eqiad.wmnet with OS bullseye
[14:10:17] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 110, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:12:26] <wikibugs>	 (03PS1) 10Matthias Mullie: Fix value for wgQuickViewMediaRepositorySearchUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844485
[14:12:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "mostly fine but a few nits" [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[14:17:33] <wikibugs>	 (03Abandoned) 10Jbond: wmflib::selector: dummy selector to query things in puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/844464 (owner: 10Jbond)
[14:18:25] <wikibugs>	 (03PS3) 10Andrew Bogott: Allow cloud_provider_enabled [puppet] - 10https://gerrit.wikimedia.org/r/825676 (https://phabricator.wikimedia.org/T280792) (owner: 10Vivian Rook)
[14:19:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P35585 and previous config saved to /var/cache/conftool/dbconfig/20221019-141955-ladsgroup.json
[14:20:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Allow cloud_provider_enabled [puppet] - 10https://gerrit.wikimedia.org/r/825676 (https://phabricator.wikimedia.org/T280792) (owner: 10Vivian Rook)
[14:23:41] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] "The psql/service side LGTM. I'll let Jaime comment on the backup bits." [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[14:23:47] <wikibugs>	 (03Merged) 10jenkins-bot: Downgrade lcobucci/jwt (4.2.1 => 4.1.5) [vendor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844035 (https://phabricator.wikimedia.org/T321160) (owner: 10Hashar)
[14:24:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T318950)', diff saved to https://phabricator.wikimedia.org/P35586 and previous config saved to /var/cache/conftool/dbconfig/20221019-142411-ladsgroup.json
[14:24:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[14:24:17] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[14:24:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[14:24:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T318950)', diff saved to https://phabricator.wikimedia.org/P35587 and previous config saved to /var/cache/conftool/dbconfig/20221019-142433-ladsgroup.json
[14:25:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [vendor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844035 (https://phabricator.wikimedia.org/T321160) (owner: 10Hashar)
[14:25:32] <logmsgbot>	 !log hashar@deploy1002 Started scap: Backport for [[gerrit:844035|Downgrade lcobucci/jwt (4.2.1 => 4.1.5) (T321160)]]
[14:25:36] <stashbot>	 T321160: Lcobucci\JWT\Signer\InvalidKeyProvided: Key cannot be empty - https://phabricator.wikimedia.org/T321160
[14:25:58] <logmsgbot>	 !log hashar@deploy1002 hashar and hashar: Backport for [[gerrit:844035|Downgrade lcobucci/jwt (4.2.1 => 4.1.5) (T321160)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[14:26:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T318950)', diff saved to https://phabricator.wikimedia.org/P35588 and previous config saved to /var/cache/conftool/dbconfig/20221019-142643-ladsgroup.json
[14:29:00] <logmsgbot>	 !log hashar@deploy1002 sync-world aborted: Backport for [[gerrit:844035|Downgrade lcobucci/jwt (4.2.1 => 4.1.5) (T321160)]] (duration: 03m 27s)
[14:29:00] <logmsgbot>	 !log hashar@deploy1002 backport aborted:  (duration: 05m 09s)
[14:29:09] <hashar>	 grbmbmbl
[14:29:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [vendor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844035 (https://phabricator.wikimedia.org/T321160) (owner: 10Hashar)
[14:29:35] <logmsgbot>	 !log hashar@deploy1002 Started scap: Backport for [[gerrit:844035|Downgrade lcobucci/jwt (4.2.1 => 4.1.5) (T321160)]]
[14:29:58] <logmsgbot>	 !log hashar@deploy1002 hashar and hashar: Backport for [[gerrit:844035|Downgrade lcobucci/jwt (4.2.1 => 4.1.5) (T321160)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[14:30:18] <dancy>	 What's up hashar?
[14:32:03] <hashar>	 dancy: nothing, I pressed enter or Ctr+C instead of y when validating to continue with sync
[14:32:15] <dancy>	 oops!
[14:32:39] <wikibugs>	 (03PS1) 10Clément Goubert: admin: Add mw-debug namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/844488 (https://phabricator.wikimedia.org/T321201)
[14:32:59] <hashar>	 I have also learned a lotabout install-world this morning thanks to jnuche  ;-]
[14:33:26] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 701
[14:33:49] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 701
[14:33:50] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 1239
[14:34:00] <logmsgbot>	 !log hashar@deploy1002 Finished scap: Backport for [[gerrit:844035|Downgrade lcobucci/jwt (4.2.1 => 4.1.5) (T321160)]] (duration: 04m 25s)
[14:34:06] <stashbot>	 T321160: Lcobucci\JWT\Signer\InvalidKeyProvided: Key cannot be empty - https://phabricator.wikimedia.org/T321160
[14:34:35] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 1239
[14:34:36] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2516
[14:34:37] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2516
[14:34:53] <jnuche>	 hashar :)
[14:35:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35589 and previous config saved to /var/cache/conftool/dbconfig/20221019-143501-ladsgroup.json
[14:35:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[14:35:07] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[14:35:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[14:35:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T318955)', diff saved to https://phabricator.wikimedia.org/P35590 and previous config saved to /var/cache/conftool/dbconfig/20221019-143523-ladsgroup.json
[14:35:53] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "LGTM as approach, two issues inline" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond)
[14:36:56] <wikibugs>	 (03CR) 10Hnowlan: helmfile.d: add thumbor configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[14:37:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T318955)', diff saved to https://phabricator.wikimedia.org/P35591 and previous config saved to /var/cache/conftool/dbconfig/20221019-143736-ladsgroup.json
[14:39:15] <wikibugs>	 (03PS9) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461
[14:39:29] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond)
[14:40:29] <hashar>	 there are a few more error logs, I guess I will file task for them later this evening
[14:41:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P35592 and previous config saved to /var/cache/conftool/dbconfig/20221019-144150-ladsgroup.json
[14:43:58] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond)
[14:49:23] <jnuche>	 jouncebot: nowandnext
[14:49:23] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 10 minute(s)
[14:49:23] <jouncebot>	 In 3 hour(s) and 10 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1800)
[14:49:23] <jouncebot>	 In 3 hour(s) and 10 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1800)
[14:50:19] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.27.1" for 1 hosts
[14:50:25] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.27.1" completed for 1 hosts
[14:50:39] <wikibugs>	 (03PS1) 10Clément Goubert: hieradata: Add usernames for mw-debug k8s service [puppet] - 10https://gerrit.wikimedia.org/r/844491 (https://phabricator.wikimedia.org/T321201)
[14:52:34] <icinga-wm>	 RECOVERY - Disk space on aphlict1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops
[14:52:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P35593 and previous config saved to /var/cache/conftool/dbconfig/20221019-145242-ladsgroup.json
[14:56:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P35594 and previous config saved to /var/cache/conftool/dbconfig/20221019-145658-ladsgroup.json
[14:57:40] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-etcd1001.eqiad.wmnet
[14:57:41] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox
[14:59:10] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/844020
[14:59:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[14:59:44] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[15:00:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[15:00:06] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[15:00:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[15:05:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2022-10-18-161910-production [puppet] - 10https://gerrit.wikimedia.org/r/844063 (https://phabricator.wikimedia.org/T316991) (owner: 10BryanDavis)
[15:05:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:07:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P35595 and previous config saved to /var/cache/conftool/dbconfig/20221019-150749-ladsgroup.json
[15:08:06] <bd808>	 !log Forcing puppet runs on cloudweb100[34] to deploy a new version of Striker
[15:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:rack/setup/install new machine learning hosts - https://phabricator.wikimedia.org/T314587 (10elukey) Any news? :)
[15:10:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:12:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T318950)', diff saved to https://phabricator.wikimedia.org/P35596 and previous config saved to /var/cache/conftool/dbconfig/20221019-151204-ladsgroup.json
[15:12:09] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[15:22:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T318955)', diff saved to https://phabricator.wikimedia.org/P35597 and previous config saved to /var/cache/conftool/dbconfig/20221019-152256-ladsgroup.json
[15:22:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[15:23:03] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[15:23:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[15:23:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35598 and previous config saved to /var/cache/conftool/dbconfig/20221019-152318-ladsgroup.json
[15:28:08] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:28:08] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-etcd1001.eqiad.wmnet on all recursors
[15:28:11] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-etcd1001.eqiad.wmnet on all recursors
[15:31:12] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10herron)
[15:41:40] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1004.eqiad.wmnet with OS bullseye
[15:41:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye
[15:45:10] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10herron) p:05Triage→03Medium
[15:46:18] <wikibugs>	 10SRE, 10MediaWiki-Authentication-and-authorization, 10Platform Engineering, 10serviceops: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10hnowlan)
[15:46:24] <wikibugs>	 (03CR) 10Btullis: Add postgresql to an-db100[1-2] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[15:47:15] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10herron) a:03Arian_Bozorg Hi @Arian_Bozorg, following the instructions in https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#wmde_access could you please coordinate obtaini...
[15:48:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10WMDE-leszek) @herron I approve the request on WMDE's behalf. Thanks
[15:49:05] <wikibugs>	 (03PS3) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167)
[15:49:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10herron) a:05Arian_Bozorg→03None >>! In T320384#8329638, @WMDE-leszek wrote: > @herron I approve the request on WMDE's behalf. Thanks  That was quick!  Thank you :)
[15:49:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu)
[15:50:00] <wikibugs>	 10SRE, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699 (10hnowlan) I think it's still a relevant question, if we can get to this work before RESTbase deprecation takes hold. Rather than open the port I think it makes sense to test disabling the service....
[15:50:12] <wikibugs>	 10SRE, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699 (10hnowlan) a:03hnowlan
[15:50:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35599 and previous config saved to /var/cache/conftool/dbconfig/20221019-155015-ladsgroup.json
[15:50:20] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[15:51:31] <wikibugs>	 (03PS1) 10Herron: admin: add arbo to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/844497 (https://phabricator.wikimedia.org/T320384)
[15:51:38] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:51:45] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-etcd1001.eqiad.wmnet
[15:51:56] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1004.eqiad.wmnet with OS bullseye
[15:52:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye exe...
[15:52:10] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1004.eqiad.wmnet with OS bullseye
[15:52:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye
[15:53:00] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:53:47] <wikibugs>	 (03PS4) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167)
[15:53:56] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:54:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[15:54:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu)
[15:57:34] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:57:50] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:58:48] <wikibugs>	 (03PS5) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167)
[15:58:58] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.783 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:59:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[16:01:54] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-db1002.eqiad.wmnet with OS bullseye
[16:02:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm, you should add them to 2 groups, 'nda' and 'wmde'" [puppet] - 10https://gerrit.wikimedia.org/r/844497 (https://phabricator.wikimedia.org/T320384) (owner: 10Herron)
[16:03:24] <wikibugs>	 (03CR) 10Herron: [C: 03+2] admin: add arbo to ldap only users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844497 (https://phabricator.wikimedia.org/T320384) (owner: 10Herron)
[16:03:54] <icinga-wm>	 PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:04:20] <wikibugs>	 (03PS3) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412)
[16:04:22] <wikibugs>	 (03PS2) 10Herron: admin: add arbo to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/844497 (https://phabricator.wikimedia.org/T320384)
[16:04:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add a default value of undefined for the docker uid hash [puppet] - 10https://gerrit.wikimedia.org/r/844484 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[16:05:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P35600 and previous config saved to /var/cache/conftool/dbconfig/20221019-160521-ladsgroup.json
[16:06:54] <wikibugs>	 (03PS1) 10Elukey: WIP - coredns: upgrade to 1.8.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159)
[16:07:35] <wikibugs>	 (03CR) 10Dzahn: "Has it been cleared up if shell access is needed? A comment on the ticket says it's not if this is only for superset. I would suggest to d" [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057) (owner: 10Herron)
[16:08:40] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1004.eqiad.wmnet with OS bullseye
[16:08:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye exe...
[16:08:55] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1004.eqiad.wmnet with OS bullseye
[16:09:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye
[16:09:36] <wikibugs>	 (03CR) 10Elukey: "After running docker-pkg locally:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[16:11:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10herron)
[16:13:37] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10herron) 05Open→03Resolved a:03herron Group membership has been granted.  Transitioning this to resolved now, please reopen if any follow up is needed.  Thanks!
[16:14:22] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-db1002.eqiad.wmnet with reason: host reimage
[16:15:10] <wikibugs>	 (03PS2) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844444 (https://phabricator.wikimedia.org/T321082)
[16:15:34] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844444 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm)
[16:17:02] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-db1002.eqiad.wmnet with reason: host reimage
[16:19:14] <wikibugs>	 (03PS4) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412)
[16:19:46] <mutante>	 !log wikitech - added herron to 'content administrators'
[16:19:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:15] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/844444 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm)
[16:20:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P35601 and previous config saved to /var/cache/conftool/dbconfig/20221019-162028-ladsgroup.json
[16:20:33] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1004.eqiad.wmnet with reason: host reimage
[16:21:26] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply
[16:22:01] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
[16:22:58] <wikibugs>	 (03PS1) 10Urbanecm: Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/844038 (https://phabricator.wikimedia.org/T321082)
[16:23:06] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/844038 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm)
[16:23:58] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1004.eqiad.wmnet with reason: host reimage
[16:24:16] <wikibugs>	 (03PS2) 10Herron: admin: add damilare to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057)
[16:25:50] <wikibugs>	 (03PS1) 10Cwhite: opensearch: upgrade curator to 5.8.5-1~wmf3 [puppet] - 10https://gerrit.wikimedia.org/r/844023 (https://phabricator.wikimedia.org/T304440)
[16:26:35] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/844038 (https://phabricator.wikimedia.org/T321082) (owner: 10Urbanecm)
[16:27:15] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply
[16:27:24] <wikibugs>	 (03CR) 10Herron: admin: add damilare to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057) (owner: 10Herron)
[16:27:32] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
[16:27:36] <wikibugs>	 (03PS3) 10Herron: admin: add damilare to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057)
[16:30:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/844053 (https://phabricator.wikimedia.org/T321068) (owner: 10Herron)
[16:31:25] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-db1002.eqiad.wmnet with OS bullseye
[16:31:57] <wikibugs>	 (03PS2) 10Herron: admin: add ssh key for hshaikh [puppet] - 10https://gerrit.wikimedia.org/r/844053 (https://phabricator.wikimedia.org/T321068)
[16:32:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057) (owner: 10Herron)
[16:32:06] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s-etcd: mac address for aux-k8s-etcd1001 [puppet] - 10https://gerrit.wikimedia.org/r/844504
[16:32:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] aux-k8s-etcd: mac address for aux-k8s-etcd1001 [puppet] - 10https://gerrit.wikimedia.org/r/844504 (owner: 10JHathaway)
[16:34:55] <wikibugs>	 (03CR) 10Herron: [C: 03+1] admin: fix duplicate entry for mvolz [puppet] - 10https://gerrit.wikimedia.org/r/843998 (https://phabricator.wikimedia.org/T320937) (owner: 10Dzahn)
[16:34:57] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] "PCC OK: https://puppet-compiler.wmflabs.org/pcc-worker1003/37651/" [puppet] - 10https://gerrit.wikimedia.org/r/844023 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite)
[16:35:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35602 and previous config saved to /var/cache/conftool/dbconfig/20221019-163534-ladsgroup.json
[16:35:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[16:35:39] <wikibugs>	 (03CR) 10Herron: [C: 03+1] admin: add missing realname field for eugene_chernov [puppet] - 10https://gerrit.wikimedia.org/r/843999 (https://phabricator.wikimedia.org/T320937) (owner: 10Dzahn)
[16:35:40] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[16:35:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[16:36:45] <wikibugs>	 (03CR) 10Herron: [C: 03+2] admin: add ssh key for hshaikh [puppet] - 10https://gerrit.wikimedia.org/r/844053 (https://phabricator.wikimedia.org/T321068) (owner: 10Herron)
[16:37:21] <wikibugs>	 (03PS4) 10Herron: admin: add damilare to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057)
[16:37:30] <wikibugs>	 (03PS2) 10JHathaway: aux-k8s-etcd: mac address for aux-k8s-etcd1001 [puppet] - 10https://gerrit.wikimedia.org/r/844504 (https://phabricator.wikimedia.org/T321134)
[16:38:32] <wikibugs>	 (03CR) 10Herron: [C: 03+2] admin: add damilare to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057) (owner: 10Herron)
[16:39:25] <icinga-wm>	 PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:40:09] <icinga-wm>	 PROBLEM - SSH on analytics1075.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:40:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admin: add missing realname field for eugene_chernov [puppet] - 10https://gerrit.wikimedia.org/r/843999 (https://phabricator.wikimedia.org/T320937) (owner: 10Dzahn)
[16:40:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admin: fix duplicate entry for mvolz [puppet] - 10https://gerrit.wikimedia.org/r/843998 (https://phabricator.wikimedia.org/T320937) (owner: 10Dzahn)
[16:40:45] <wikibugs>	 (03PS3) 10Dzahn: admin: fix duplicate entry for mvolz [puppet] - 10https://gerrit.wikimedia.org/r/843998 (https://phabricator.wikimedia.org/T320937)
[16:42:06] <wikibugs>	 (03PS1) 10Andrew Bogott: Keep Nova API public in eqiad1 but restrict in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/844506 (https://phabricator.wikimedia.org/T319312)
[16:42:11] <wikibugs>	 (03PS2) 10Dzahn: admin: add missing realname field for eugene_chernov [puppet] - 10https://gerrit.wikimedia.org/r/843999 (https://phabricator.wikimedia.org/T320937)
[16:43:24] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1004.eqiad.wmnet with OS bullseye
[16:43:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye com...
[16:43:29] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s-etcd: mac address for aux-k8s-etcd1001 [puppet] - 10https://gerrit.wikimedia.org/r/844504 (https://phabricator.wikimedia.org/T321134) (owner: 10JHathaway)
[16:43:49] <wikibugs>	 (03PS2) 10SBassett: Add registry.gitlab.com/security-products/**/* as allowed images [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961)
[16:44:49] <wikibugs>	 (03PS2) 10Andrew Bogott: Keep Nova API public in eqiad1 but restrict in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/844506 (https://phabricator.wikimedia.org/T319312)
[16:45:14] <wikibugs>	 (03PS6) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167)
[16:45:37] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10herron) 05Open→03Resolved a:03herron The requested group membership has been granted and will propagate fully within the next 30 minute...
[16:45:49] <wikibugs>	 (03CR) 10SBassett: Add registry.gitlab.com/security-products/**/* as allowed images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961) (owner: 10SBassett)
[16:45:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu)
[16:46:19] <wikibugs>	 (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37655/console" [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu)
[16:46:31] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for GFontenelle - https://phabricator.wikimedia.org/T321218 (10GFontenelle_WMF)
[16:47:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10herron) 05Open→03Resolved a:03herron The requested access has been granted and will propagate fully within the next 30 minutes. I'll transition thi...
[16:48:28] <wikibugs>	 (03PS3) 10Andrew Bogott: Keep Nova API public in eqiad1 but restrict in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/844506 (https://phabricator.wikimedia.org/T319312)
[16:49:25] <mutante>	 jhathaway: I typed "multiple", so your change is merged
[16:49:44] <jhathaway>	 oooh, thanks!
[16:56:58] <wikibugs>	 (03CR) 10Hashar: "I have to cherry-pick that on the devtools WMCS project to exercise it." [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar)
[16:59:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[16:59:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[16:59:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[16:59:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:00:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T318955)', diff saved to https://phabricator.wikimedia.org/P35603 and previous config saved to /var/cache/conftool/dbconfig/20221019-170002-ladsgroup.json
[17:00:07] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[17:02:07] <wikibugs>	 (03PS4) 10Andrew Bogott: Keep Nova API public in eqiad1 but restrict in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/844506 (https://phabricator.wikimedia.org/T319312)
[17:03:48] <icinga-wm>	 RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:03:55] <wikibugs>	 (03CR) 10Andrew Bogott: "pcc: https://puppet-compiler.wmflabs.org/pcc-worker1003/37658/" [puppet] - 10https://gerrit.wikimedia.org/r/844506 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[17:04:14] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/844024
[17:07:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm, https://puppet-compiler.wmflabs.org/pcc-worker1002/37657/gitlab-runner2004.codfw.wmnet/index.html ( I don't think we can expect to s" [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto)
[17:08:24] <wikibugs>	 (03PS1) 10David Caro: dnsrecursor: fixes to the trigger script and some logs [puppet] - 10https://gerrit.wikimedia.org/r/844507
[17:13:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] dnsrecursor: fixes to the trigger script and some logs [puppet] - 10https://gerrit.wikimedia.org/r/844507 (owner: 10David Caro)
[17:15:35] <wikibugs>	 (03PS1) 10David Caro: dnsrecursor: ignore projects outside the default domain [puppet] - 10https://gerrit.wikimedia.org/r/844508
[17:15:46] <TheresNoTime>	 ("wrong" channel, I know, but...) Hello! I'd love to get https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/844083 (some prep work for T310974) backported and live on en.wiki in the backport window later — would anyone mind giving it a look-over and perhaps +2ing? It is a small patch which adds a `StatsdDataFactory() ->increment` if a page is marked as `NOINDEX` by the PageTriage extension :)
[17:15:46] <stashbot>	 T310974: Extend PageTriageMaxAge for unpatrolled articles at enwiki - https://phabricator.wikimedia.org/T310974
[17:17:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "I think this is the right solution for now. We aren't immediately planning to have magnum VMs get public IPs so it shouldn't matter." [puppet] - 10https://gerrit.wikimedia.org/r/844508 (owner: 10David Caro)
[17:18:00] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] dnsrecursor: ignore projects outside the default domain [puppet] - 10https://gerrit.wikimedia.org/r/844508 (owner: 10David Caro)
[17:18:02] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] dnsrecursor: fixes to the trigger script and some logs [puppet] - 10https://gerrit.wikimedia.org/r/844507 (owner: 10David Caro)
[17:21:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Cmjohnson)
[17:24:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T318955)', diff saved to https://phabricator.wikimedia.org/P35604 and previous config saved to /var/cache/conftool/dbconfig/20221019-172438-ladsgroup.json
[17:24:44] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[17:33:34] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "I have 3 blubber builds based off of this base which all install python3-pip with a comment that says "# FIXME: should be in the base imag" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844319 (owner: 10Giuseppe Lavagetto)
[17:39:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P35605 and previous config saved to /var/cache/conftool/dbconfig/20221019-173945-ladsgroup.json
[17:40:20] <icinga-wm>	 RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:41:04] <icinga-wm>	 RECOVERY - SSH on analytics1075.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:46:51] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[17:49:48] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:54:25] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED
[17:54:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P35606 and previous config saved to /var/cache/conftool/dbconfig/20221019-175451-ladsgroup.json
[17:54:53] <wikibugs>	 (03PS1) 10Samtar: Hooks: Log to statsd when a page is noindex'd [extensions/PageTriage] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/844040 (https://phabricator.wikimedia.org/T310974)
[17:57:19] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED
[18:00:04] <jouncebot>	 hashar and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Train log triage with CPT . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1800).
[18:00:04] <jouncebot>	 hashar and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T1800).
[18:01:55] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED
[18:03:44] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED
[18:07:33] <wikibugs>	 (03PS1) 10BryanDavis: mono68: Remove expired DST Root CA X3 cert [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/844512 (https://phabricator.wikimedia.org/T311466)
[18:07:57] <mutante>	 !log aphlict1001 - manually gzip large logfile, logrotate did not run for a day - T321209
[18:08:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T318955)', diff saved to https://phabricator.wikimedia.org/P35607 and previous config saved to /var/cache/conftool/dbconfig/20221019-180958-ladsgroup.json
[18:10:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[18:10:03] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[18:10:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[18:10:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T318955)', diff saved to https://phabricator.wikimedia.org/P35608 and previous config saved to /var/cache/conftool/dbconfig/20221019-181019-ladsgroup.json
[18:11:12] <wikibugs>	 (03CR) 10BryanDavis: "Test out a local build of the container with the test program from https://phabricator.wikimedia.org/T292289#7468035. That looks something" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/844512 (https://phabricator.wikimedia.org/T311466) (owner: 10BryanDavis)
[18:12:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T318955)', diff saved to https://phabricator.wikimedia.org/P35609 and previous config saved to /var/cache/conftool/dbconfig/20221019-181232-ladsgroup.json
[18:27:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P35610 and previous config saved to /var/cache/conftool/dbconfig/20221019-182739-ladsgroup.json
[18:42:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P35611 and previous config saved to /var/cache/conftool/dbconfig/20221019-184245-ladsgroup.json
[18:44:13] <wikibugs>	 (03PS1) 10Dduvall: docker_registry_ha: Require JWT to have ref_protected claim set to true [puppet] - 10https://gerrit.wikimedia.org/r/844513
[18:50:33] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "conftool-data: remove phabricator / git-ssh"" [puppet] - 10https://gerrit.wikimedia.org/r/844041
[18:54:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[18:55:47] <wikibugs>	 (03CR) 10Ahmon Dancy: docker_registry_ha: Require JWT to have ref_protected claim set to true (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall)
[18:57:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T318955)', diff saved to https://phabricator.wikimedia.org/P35612 and previous config saved to /var/cache/conftool/dbconfig/20221019-185752-ladsgroup.json
[18:57:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[18:57:58] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[18:58:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[18:58:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T318955)', diff saved to https://phabricator.wikimedia.org/P35613 and previous config saved to /var/cache/conftool/dbconfig/20221019-185813-ladsgroup.json
[19:00:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T318955)', diff saved to https://phabricator.wikimedia.org/P35614 and previous config saved to /var/cache/conftool/dbconfig/20221019-190026-ladsgroup.json
[19:08:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:09:34] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s-etcd: partman cfg for aux-k8s-etcd [puppet] - 10https://gerrit.wikimedia.org/r/844514 (https://phabricator.wikimedia.org/T321134)
[19:10:29] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s-etcd: partman cfg for aux-k8s-etcd [puppet] - 10https://gerrit.wikimedia.org/r/844514 (https://phabricator.wikimedia.org/T321134) (owner: 10JHathaway)
[19:10:52] <wikibugs>	 (03PS1) 10Dzahn: devtools: add profile::mediawiki::scap_client::is_master: true [puppet] - 10https://gerrit.wikimedia.org/r/844515
[19:11:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] devtools: add profile::mediawiki::scap_client::is_master: true [puppet] - 10https://gerrit.wikimedia.org/r/844515 (owner: 10Dzahn)
[19:13:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:15:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P35615 and previous config saved to /var/cache/conftool/dbconfig/20221019-191533-ladsgroup.json
[19:16:22] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:16:52] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:17:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "did not work. either changes are not pulled to local puppetmaster or for another reason this causes a duplicate declaration. added back in" [puppet] - 10https://gerrit.wikimedia.org/r/844515 (owner: 10Dzahn)
[19:17:44] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:22:25] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10Ottomata) Haroon also needs kerberos access for this.  I just created a principal for him.  @HShaikh check your email and look for instructions.
[19:23:19] <wikibugs>	 (03PS1) 10Ottomata: hshaikh - set krb: present [puppet] - 10https://gerrit.wikimedia.org/r/844516 (https://phabricator.wikimedia.org/T321068)
[19:26:20] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] hshaikh - set krb: present [puppet] - 10https://gerrit.wikimedia.org/r/844516 (https://phabricator.wikimedia.org/T321068) (owner: 10Ottomata)
[19:30:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P35616 and previous config saved to /var/cache/conftool/dbconfig/20221019-193039-ladsgroup.json
[19:45:23] <wikibugs>	 10ops-codfw, 10DC-Ops: hw troubleshooting: flapping mgmt console for wdqs2005.mgmt.codfw.wmnet - https://phabricator.wikimedia.org/T321237 (10RKemper)
[19:45:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T318955)', diff saved to https://phabricator.wikimedia.org/P35617 and previous config saved to /var/cache/conftool/dbconfig/20221019-194546-ladsgroup.json
[19:45:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:45:51] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[19:46:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:46:44] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: flapping mgmt console for wdqs2005.mgmt.codfw.wmnet - https://phabricator.wikimedia.org/T321237 (10RKemper)
[19:49:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221019T2000).
[20:00:05] <jouncebot>	 TheresNoTime: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:16] * TheresNoTime wil self-deploy :D
[20:00:22] <urbanecm>	 TheresNoTime: i guess you'll self-service?
[20:00:31] <TheresNoTime>	 indeed
[20:00:46] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: flapping mgmt console for wdqs2005.mgmt.codfw.wmnet - https://phabricator.wikimedia.org/T321237 (10RKemper)
[20:01:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/PageTriage] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/844040 (https://phabricator.wikimedia.org/T310974) (owner: 10Samtar)
[20:01:58] <wikibugs>	 (03CR) 10Dduvall: docker_registry_ha: Require JWT to have ref_protected claim set to true (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall)
[20:03:35] <wikibugs>	 (03PS2) 10Dduvall: docker_registry_ha: Require JWT to have ref_protected claim set to true [puppet] - 10https://gerrit.wikimedia.org/r/844513
[20:04:16] <wikibugs>	 (03CR) 10Dduvall: docker_registry_ha: Require JWT to have ref_protected claim set to true (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall)
[20:07:23] <wikibugs>	 (03Merged) 10jenkins-bot: Hooks: Log to statsd when a page is noindex'd [extensions/PageTriage] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/844040 (https://phabricator.wikimedia.org/T310974) (owner: 10Samtar)
[20:07:54] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:844040|Hooks: Log to statsd when a page is noindex'd (T310974)]]
[20:07:59] <stashbot>	 T310974: Extend PageTriageMaxAge for unpatrolled articles at enwiki - https://phabricator.wikimedia.org/T310974
[20:08:20] <logmsgbot>	 !log samtar@deploy1002 samtar and samtar: Backport for [[gerrit:844040|Hooks: Log to statsd when a page is noindex'd (T310974)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[20:08:50] <wikibugs>	 (03PS5) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412)
[20:08:54] * TheresNoTime is testing
[20:09:21] <wikibugs>	 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T320817 (10wiki_willy) a:03Papaul
[20:10:55] <TheresNoTime>	 woo, works
[20:11:46] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] Revert "Revert "conftool-data: remove phabricator / git-ssh"" [puppet] - 10https://gerrit.wikimedia.org/r/844041 (owner: 10Dzahn)
[20:12:14] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] remove git-ssh from common/service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/843522 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[20:14:51] <wikibugs_>	 (03CR) 10Dzahn: [C: 03+2] remove git-ssh from common/service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/843522 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[20:15:07] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:844040|Hooks: Log to statsd when a page is noindex'd (T310974)]] (duration: 07m 12s)
[20:15:17] <mutante>	 !log git-ssh service is being decom'ed - expect some temp pybal alerts 
[20:17:41] * TheresNoTime has finished their patch, will be around for a while if there's anything needing deploying
[20:18:24] <wikibugs>	 (03PS1) 10DDesouza: Broaden audience of Research Incentive Survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844522 (https://phabricator.wikimedia.org/T318333)
[20:18:38] <danisztls>	 hi
[20:19:05] <danisztls>	 TheresNoTime: can you merge a beta change? will add it to deployment page
[20:19:12] <TheresNoTime>	 danisztls: sure :)
[20:19:33] <wikibugs>	 (03PS2) 10DDesouza: Broaden audience of Research Incentive Survey on enwiki [beta] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844522 (https://phabricator.wikimedia.org/T318333)
[20:21:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844522 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza)
[20:21:12] <danisztls>	 TheresNoTime: done
[20:21:36] <danisztls>	 it's 844522
[20:21:47] <TheresNoTime>	 merging now :)
[20:22:00] <wikibugs>	 (03Merged) 10jenkins-bot: Broaden audience of Research Incentive Survey on enwiki [beta] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844522 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza)
[20:22:06] <jinxer-wm>	 (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:22:06] <danisztls>	 TheresNoTime: thank you!
[20:23:19] <TheresNoTime>	 danisztls: done :) and a `beta-code-update-eqiad` has just started so it should be live on beta in a few minutes
[20:24:14] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone policy.yaml: allow anyone to get project info [puppet] - 10https://gerrit.wikimedia.org/r/844524
[20:26:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:26:54] <wikibugs>	 (03PS6) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412)
[20:27:06] <jinxer-wm>	 (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:27:34] <mutante>	 !log lvs2010 - restarted pybal, removed git-ssh IP with ipvsadm
[20:27:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:20] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] docker_registry_ha: Require JWT to have ref_protected claim set to true [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall)
[20:30:07] <wikibugs>	 (03PS1) 10Andrew Bogott: mwopenstackclients: make somewhat domain-aware [puppet] - 10https://gerrit.wikimedia.org/r/844525
[20:30:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:30:23] <mutante>	 !lvs2008 - systemctl restart pybal.service ; ipvsadm -Dt '208.80.153.250:22' ; ipvsadm -Dt '[2620:0:860:ed1a::3:fa]:22' - T296022
[20:30:24] <stashbot>	 T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022
[20:32:06] <jinxer-wm>	 (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:32:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: make somewhat domain-aware [puppet] - 10https://gerrit.wikimedia.org/r/844525 (owner: 10Andrew Bogott)
[20:33:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Keystone policy.yaml: allow anyone to get project info [puppet] - 10https://gerrit.wikimedia.org/r/844524 (owner: 10Andrew Bogott)
[20:33:11] <TheresNoTime>	 !log closing UTC late backport window
[20:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:18] <mutante>	 !lvs1020, lvs1018 - systemctl restart pybal.service ; ipvsadm -Dt '208.80.154.250:22' ; ipvsadm -Dt '[2620:0:861:ed1a::3:16]:22' - T296022
[20:36:40] <wikibugs>	 (03PS1) 10Andrew Bogott: mwopenstackclient: remove stray print() [puppet] - 10https://gerrit.wikimedia.org/r/844526
[20:37:22] <wikibugs>	 (03PS7) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412)
[20:37:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclient: remove stray print() [puppet] - 10https://gerrit.wikimedia.org/r/844526 (owner: 10Andrew Bogott)
[20:38:24] <mutante>	 !log puppetmaster1001/puppetmaster2001 - delete .git-*.err files in /var/run/confd-template T296022
[20:38:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:29] <stashbot>	 T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022
[20:42:06] <jinxer-wm>	 (ConfdResourceFailed) resolved: (4) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:42:30] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:43:06] <mutante>	 jinxer-wm: :)
[20:43:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Jclark-ctr) Drive will arrive tomorrow. Can it be swapped when it arrives or will it need to be scheduled?
[20:44:55] <mutante>	 !log lvs1020, lvs1018 - systemctl restart pybal.service ; ipvsadm -Dt '208.80.154.250:22' ; ipvsadm -Dt '[2620:0:861:ed1a::3:16]:22' - T296022
[20:45:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:01] <stashbot>	 T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022
[20:45:59] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10BCornwall) 05Open→03Stalled @Vgutierrez Now that HAProxy is used for TLS termination, can this safely be closed?
[20:46:07] <wikibugs>	 10SRE, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10BCornwall)
[20:49:35] <mutante>	 !log lvs2010, lvs2008 - systemctl restart pybal.service ; ipvsadm -Dt '208.80.153.250:22' ; ipvsadm -Dt '[2620:0:860:ed1a::3:fa]:22' - T296022
[20:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:03] <wikibugs>	 10SRE, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10BCornwall)
[20:52:18] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10BCornwall) 05Stalled→03Invalid Closing as per bblack's recommendation.
[20:54:51] <wikibugs>	 (03PS1) 10JHathaway: add srv record for aux-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/844530 (https://phabricator.wikimedia.org/T321134)
[20:55:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add srv record for aux-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/844530 (https://phabricator.wikimedia.org/T321134) (owner: 10JHathaway)
[21:03:34] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-etcd1002.eqiad.wmnet
[21:03:43] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox
[21:04:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[21:04:54] <wikibugs>	 (03PS8) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412)
[21:06:45] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:06:45] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-etcd1002.eqiad.wmnet on all recursors
[21:06:48] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-etcd1002.eqiad.wmnet on all recursors
[21:11:40] <wikibugs>	 10SRE-swift-storage, 10Community-Tech, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10Eevans) To summarize what was discussed during a Data Persistence meeting earlier today (in no particular order):  * The total (ex...
[21:23:53] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10Discovery-Search (Current work): decommission elastic20[25-36].codfw.wmnet - https://phabricator.wikimedia.org/T321243 (10bking)
[21:24:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[21:30:22] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-etcd1002.eqiad.wmnet
[21:30:38] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-etcd1003.eqiad.wmnet
[21:30:39] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox
[21:36:56] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:36:56] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-etcd1003.eqiad.wmnet on all recursors
[21:36:59] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-etcd1003.eqiad.wmnet on all recursors
[22:00:36] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-etcd1003.eqiad.wmnet
[22:06:50] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s-etcd: mac addresses for aux-k8s-etcd100{2,3} [puppet] - 10https://gerrit.wikimedia.org/r/844535 (https://phabricator.wikimedia.org/T321134)
[22:08:58] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s-etcd: mac addresses for aux-k8s-etcd100{2,3} [puppet] - 10https://gerrit.wikimedia.org/r/844535 (https://phabricator.wikimedia.org/T321134) (owner: 10JHathaway)
[22:18:19] <wikibugs>	 (03PS2) 10JHathaway: add srv record for aux-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/844530 (https://phabricator.wikimedia.org/T321134)
[22:21:13] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] add srv record for aux-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/844530 (https://phabricator.wikimedia.org/T321134) (owner: 10JHathaway)
[22:26:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "meanwhile also phab1004 is configured to only have read-only DB access. now we must explicitly switch the active phab server in Hiera to m" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/832282 (https://phabricator.wikimedia.org/T313954) (owner: 10Dduvall)
[22:44:26] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:49:45] <wikibugs>	 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T320817 (10Papaul) 05Open→03Resolved There is nothing connected to port xe-7/0/9 on fpc7 row B
[23:10:11] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Disable wgParserEnableLegacyMediaDOM on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844074 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)