[00:00:25] <wikibugs>	 (03Merged) 10jenkins-bot: php74: add many TTF fonts [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/838939 (https://phabricator.wikimedia.org/T310435) (owner: 10BryanDavis)
[00:13:44] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] lower TTL for phabricator from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838916 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn)
[00:14:09] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] lower TTL for gerrit,gerrit-replica from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn)
[00:23:00] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:35:05] <logmsgbot>	 !log reedy@deploy1002 Started deploy [integration/docroot@13687ed]: More minor updates
[00:35:35] <logmsgbot>	 !log reedy@deploy1002 Finished deploy [integration/docroot@13687ed]: More minor updates (duration: 00m 30s)
[00:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[00:50:57] <wikibugs>	 (03PS1) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223)
[00:51:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[00:55:13] <wikibugs>	 (03CR) 10Jdlrobson: "Hey James, Reedy and Tyler" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[00:58:08] <wikibugs>	 (03CR) 10Jdlrobson: Automate icon generation (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[01:03:06] <logmsgbot>	 !log reedy@deploy1002 Started deploy [integration/docroot@5cd2243]: Minor fixes
[01:03:18] <logmsgbot>	 !log reedy@deploy1002 Finished deploy [integration/docroot@5cd2243]: Minor fixes (duration: 00m 12s)
[01:12:28] <logmsgbot>	 !log reedy@deploy1002 Started deploy [integration/docroot@dc380cb]: Update jQuery
[01:12:39] <logmsgbot>	 !log reedy@deploy1002 Finished deploy [integration/docroot@dc380cb]: Update jQuery (duration: 00m 11s)
[01:24:12] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:23:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[04:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:52:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:54:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:56:48] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (20) node(s) change every puppet run: an-test-client1001, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, phab1004, releases1002, releases2002, stat1004, stat1005, stat1007, stat1008 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[05:35:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:40:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:46:19] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Joe) a:03Joe
[05:55:40] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Performance Issue: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10Joe) 05Open→03Resolved a:03Joe Tentatively resolving because we've moved past php 7.2 and we seem to have reverted the php 7.2-only st...
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T0600).
[06:04:33] <wikibugs>	 10SRE, 10MediaWiki-Parser, 10serviceops-radar, 10Performance-Team (Radar): purgeParserCache.php: Cannot purge this kind of parser cache - https://phabricator.wikimedia.org/T250231 (10Joe)
[06:11:42] <wikibugs>	 (03PS3) 10Elukey: role::kafka::logging: final clean up after migrating to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130)
[06:23:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[06:24:01] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 22616
[06:24:07] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37465/console" [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[06:24:39] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 22616
[06:25:54] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 6079
[06:26:59] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 6079
[06:27:26] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Remove php 7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/839324
[06:30:44] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:46:27] <wikibugs>	 (03PS2) 10Muehlenhoff: wmcs::kubeadm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838831 (https://phabricator.wikimedia.org/T308013)
[06:54:16] <wikibugs>	 (03CR) 10Muehlenhoff: "One comment inline, rest looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan)
[06:54:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] wmcs::kubeadm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838831 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[06:55:00] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] wmcs::metricsinfra: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838832 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[06:57:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1012.eqiad.wmnet with OS bullseye
[06:57:56] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1012.eqiad.wmnet with OS bullseye
[06:58:40] <wikibugs>	 (03PS3) 10Muehlenhoff: swift: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838830 (https://phabricator.wikimedia.org/T308013)
[06:59:50] <wikibugs>	 (03PS2) 10Muehlenhoff: bgpalerter: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837070
[07:00:05] <jouncebot>	 Amir1, apergos, and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T0700).
[07:00:12] <apergos>	 morning! there are no trainees signed up for the window and no deployments on the calendar for the window either. 
[07:05:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] swift: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838830 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[07:06:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] bgpalerter: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837070 (owner: 10Muehlenhoff)
[07:11:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1012.eqiad.wmnet with reason: host reimage
[07:14:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1012.eqiad.wmnet with reason: host reimage
[07:15:31] <moritzm>	 !log draining ganeti1005 T311687
[07:15:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:15:35] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[07:30:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1012.eqiad.wmnet with OS bullseye
[07:30:37] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1012.eqiad.wmnet with OS bullseye completed: - ganeti1012 (**PASS**)   - Downtimed on...
[07:36:40] <moritzm>	 !log draining ganeti1026 T311687
[07:36:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:45] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[07:42:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1012.eqiad.wmnet
[07:47:12] <wikibugs>	 (03PS1) 10Cathal Mooney: Depool esams in gdns prior to reboot of line card [dns] - 10https://gerrit.wikimedia.org/r/839396 (https://phabricator.wikimedia.org/T318783)
[07:48:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Depool esams in gdns prior to reboot of line card [dns] - 10https://gerrit.wikimedia.org/r/839396 (https://phabricator.wikimedia.org/T318783) (owner: 10Cathal Mooney)
[07:49:22] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Depool esams in gdns prior to reboot of line card [dns] - 10https://gerrit.wikimedia.org/r/839396 (https://phabricator.wikimedia.org/T318783) (owner: 10Cathal Mooney)
[07:50:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1012.eqiad.wmnet
[07:50:11] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::logging: final clean up after migrating to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[07:50:17] <topranks>	 !log De-pooling esams in advance of cr2-esams line card reboot
[07:50:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:05] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10Aklapper) @Arnoldokoth: This isn't resolved yet, see https://wikitech.wikimedia.org/wiki/SRE/LDAP#Add_a_user_to_a_group
[07:56:07] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for MHorsey - https://phabricator.wikimedia.org/T318729 (10Aklapper) @Arnoldokoth: This isn't resolved yet, see https://wikitech.wikimedia.org/wiki/SRE/LDAP#Add_a_user_to_a_group
[07:57:44] <wikibugs>	 (03PS1) 10KartikMistry: ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839411 (https://phabricator.wikimedia.org/T319156)
[07:59:28] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10SLyngshede-WMF) a:03SLyngshede-WMF
[08:00:08] <elukey>	 !log delete /etc/kafka/ssl/kafka_logging-eqiad_broker.keystore.jks on kafka-logging1001 and restart (old puppet cert + settings deleted)
[08:00:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:36] <wikibugs>	 (03CR) 10KartikMistry: "If I've understand correctly, 89% is OK when task says by 10% stricter (default is 99%)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839411 (https://phabricator.wikimedia.org/T319156) (owner: 10KartikMistry)
[08:01:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1012.eqiad.wmnet to cluster eqiad and group C
[08:05:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add records for ATS percent usage [puppet] - 10https://gerrit.wikimedia.org/r/838911 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[08:06:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you for the review! This will self-deploy on puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/838911 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[08:06:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Remove ATS 8-specific metrics [puppet] - 10https://gerrit.wikimedia.org/r/838886 (owner: 10BCornwall)
[08:07:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, LGTM overall!" [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[08:09:03] <elukey>	 !log kafka logging old cert cleanup - `cumin 'A:kafka-logging' 'rm -f /etc/kafka/ssl/kafka_logging-eqiad_broker.keystore.jks'`
[08:09:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for gdal [puppet] - 10https://gerrit.wikimedia.org/r/838842 (owner: 10Muehlenhoff)
[08:09:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I think we should be fine even if some exporter restarts" [puppet] - 10https://gerrit.wikimedia.org/r/838834 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:10:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad
[08:10:29] <elukey>	 !log restart kafka on kafka-logging1002 to reload the conifg (cleanup old super.users related to past keystore)
[08:10:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:11] <wikibugs>	 10SRE, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 4 others: Make SwiftFileBackend::doStoreInternal defer the opening of file handles to stay in the concurrency limit - https://phabricator.wikimedia.org/T230245 (10MatthewVernon)
[08:12:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: confd: export template status as Prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi)
[08:12:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad
[08:12:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:15:01] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 244, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:20:41] <topranks>	 ^^ Arelion circuit IC-314533, carrier maintenance it seems, ref PWIC223124.
[08:20:51] <topranks>	 I'm proceeding with reboot of cr2-esams line card
[08:21:28] <wikibugs>	 (03CR) 10Elukey: "Hi Ben! One thing that may be good to do is to split the Docker file into multiple ones, see what it has been done for istio for example. " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[08:21:34] <topranks>	 !log disabling external BGP sessions on cr2-esams prior to line card reboot
[08:21:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:29] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:24:22] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cr2-esams,cr2-esams IPv6,re0.cr2-esams.mgmt with reason: line card reboot
[08:24:29] <wikibugs>	 (03CR) 10Elukey: "This operator image may go under the same spark namespace, if we had it, to find all docker images in one place (see comments on the spark" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[08:24:36] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr2-esams,cr2-esams IPv6,re0.cr2-esams.mgmt with reason: line card reboot
[08:25:04] <topranks>	 !log disabling OSPF on cr2-esams
[08:25:17] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gitlab: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838829 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:25:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1029 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/838840 (https://phabricator.wikimedia.org/T299459) (owner: 10Muehlenhoff)
[08:26:49] <icinga-wm>	 PROBLEM - Host lvs3006 is DOWN: PING CRITICAL - Packet loss = 100%
[08:26:49] <icinga-wm>	 PROBLEM - Host lvs3005 is DOWN: PING CRITICAL - Packet loss = 100%
[08:27:05] <icinga-wm>	 PROBLEM - Host lvs3007 is DOWN: PING CRITICAL - Packet loss = 100%
[08:27:09] <icinga-wm>	 PROBLEM - Host prometheus3001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:27:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:37] <icinga-wm>	 PROBLEM - Host ping3002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:26] <volans>	 topranks: expected!?!?!?
[08:28:37] <icinga-wm>	 PROBLEM - Host dns3002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:41] <icinga-wm>	 PROBLEM - Host bast3005 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:41] <icinga-wm>	 PROBLEM - Host durum3001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:43] <icinga-wm>	 PROBLEM - Host cp3050 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:43] <icinga-wm>	 PROBLEM - Host cp3051 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:43] <icinga-wm>	 PROBLEM - Host cp3054 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:43] <icinga-wm>	 PROBLEM - Host cp3062 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:43] <icinga-wm>	 PROBLEM - Host cp3061 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:44] <icinga-wm>	 PROBLEM - Host cp3052 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:44] <icinga-wm>	 PROBLEM - Host cp3056 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:45] <icinga-wm>	 PROBLEM - Host cp3058 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:45] <icinga-wm>	 PROBLEM - Host cp3059 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:46] <icinga-wm>	 PROBLEM - Host cp3057 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:46] <icinga-wm>	 PROBLEM - Host cp3063 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:47] <icinga-wm>	 PROBLEM - Host cp3065 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:47] <icinga-wm>	 PROBLEM - Host cp3064 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:48] <icinga-wm>	 PROBLEM - Host ganeti3001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:48] <icinga-wm>	 PROBLEM - Host cp3060 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:49] <icinga-wm>	 PROBLEM - Host ganeti3002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:49] <icinga-wm>	 PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:50] <icinga-wm>	 PROBLEM - Host dns3001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:50] <icinga-wm>	 PROBLEM - Host doh3001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:51] <icinga-wm>	 PROBLEM - Host durum3002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:51] <icinga-wm>	 PROBLEM - Host ncredir3001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:51] <icinga-wm>	 PROBLEM - Host doh3002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:53] <icinga-wm>	 PROBLEM - Host ganeti3003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:57] <icinga-wm>	 PROBLEM - Host cp3055 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:57] * volans preparing depool patch
[08:28:57] <icinga-wm>	 PROBLEM - Host install3001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:59] <topranks>	 No... some disruption is not unexpected but OSPF should converge quickly
[08:29:13] <topranks>	 Traffic from US getting to cr3-esams as expected but not getting next-hop / reply
[08:29:37] <icinga-wm>	 PROBLEM - Host ncredir3002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:29:37] <icinga-wm>	 PROBLEM - Host netflow3002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:29:46] <elukey>	 topranks: o/ I was about to ask, I have trouble ssh-ing to a bastion via init7 -> cr3-esams
[08:30:02] <volans>	 ah it's already depooled, you got me for a sec
[08:30:05] <jouncebot>	 hoo: gettimeofday() says it's time for Wikibase client unexpectedUnconnectedPage page prop format conversion. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T0830)
[08:30:05] <elukey>	 (bast3005 then I tried bast1003)
[08:30:05] <icinga-wm>	 PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect -
[08:30:05] <icinga-wm>	 , AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:30:23] <icinga-wm>	 PROBLEM - Host upload-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:31] <wikibugs>	 (03PS1) 10Muehlenhoff: eventlogging: Update includes to current styleguide [puppet] - 10https://gerrit.wikimedia.org/r/839428
[08:30:37] <icinga-wm>	 PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:37] <icinga-wm>	 PROBLEM - Host ripe-atlas-esams is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:38] <icinga-wm>	 PROBLEM - Host ripe-atlas-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:38] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:45] <topranks>	 Something odd happening, connection to the OOB has dropped on me.
[08:30:50] * topranks invetigating
[08:30:59] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:31:31] <volans>	 topranks: can I help? do you need an incident doc? do we need to do something about ns2.w.o?
[08:32:45] <icinga-wm>	 RECOVERY - Host cp3054 is UP: PING OK - Packet loss = 0%, RTA = 107.04 ms
[08:32:45] <icinga-wm>	 RECOVERY - Host cp3055 is UP: PING OK - Packet loss = 0%, RTA = 107.13 ms
[08:32:45] <icinga-wm>	 RECOVERY - Host cp3050 is UP: PING OK - Packet loss = 0%, RTA = 107.16 ms
[08:32:45] <icinga-wm>	 RECOVERY - Host cp3059 is UP: PING OK - Packet loss = 0%, RTA = 107.03 ms
[08:32:45] <icinga-wm>	 RECOVERY - Host cp3051 is UP: PING OK - Packet loss = 0%, RTA = 107.40 ms
[08:32:46] <icinga-wm>	 RECOVERY - Host cp3064 is UP: PING OK - Packet loss = 0%, RTA = 107.04 ms
[08:32:46] <icinga-wm>	 RECOVERY - Host cp3058 is UP: PING OK - Packet loss = 0%, RTA = 107.47 ms
[08:32:47] <icinga-wm>	 RECOVERY - Host cp3056 is UP: PING OK - Packet loss = 0%, RTA = 106.98 ms
[08:32:47] <icinga-wm>	 RECOVERY - Host cp3052 is UP: PING OK - Packet loss = 0%, RTA = 107.05 ms
[08:32:48] <icinga-wm>	 RECOVERY - Host cp3053 is UP: PING OK - Packet loss = 0%, RTA = 107.42 ms
[08:32:48] <icinga-wm>	 RECOVERY - Host cp3060 is UP: PING OK - Packet loss = 0%, RTA = 107.07 ms
[08:32:49] <icinga-wm>	 RECOVERY - Host cp3057 is UP: PING OK - Packet loss = 0%, RTA = 107.14 ms
[08:32:49] <icinga-wm>	 RECOVERY - Host cp3063 is UP: PING OK - Packet loss = 0%, RTA = 107.07 ms
[08:32:49] <jinxer-wm>	 (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:32:49] <icinga-wm>	 RECOVERY - Host lvs3007 is UP: PING OK - Packet loss = 0%, RTA = 107.07 ms
[08:32:50] <icinga-wm>	 RECOVERY - Host cp3061 is UP: PING OK - Packet loss = 0%, RTA = 107.21 ms
[08:32:51] <icinga-wm>	 RECOVERY - Host cp3065 is UP: PING OK - Packet loss = 0%, RTA = 107.06 ms
[08:32:51] <icinga-wm>	 RECOVERY - Host cp3062 is UP: PING OK - Packet loss = 0%, RTA = 107.06 ms
[08:32:52] <icinga-wm>	 RECOVERY - Host durum3002 is UP: PING OK - Packet loss = 0%, RTA = 107.44 ms
[08:32:52] <icinga-wm>	 RECOVERY - Host ncredir3001 is UP: PING OK - Packet loss = 0%, RTA = 107.43 ms
[08:32:53] <icinga-wm>	 RECOVERY - Host lvs3005 is UP: PING OK - Packet loss = 0%, RTA = 107.12 ms
[08:32:53] <icinga-wm>	 RECOVERY - Host dns3001 is UP: PING OK - Packet loss = 0%, RTA = 107.50 ms
[08:32:54] <icinga-wm>	 RECOVERY - Host ping3002 is UP: PING WARNING - Packet loss = 80%, RTA = 862.40 ms
[08:32:54] <icinga-wm>	 RECOVERY - Host dns3002 is UP: PING OK - Packet loss = 0%, RTA = 107.26 ms
[08:32:55] <icinga-wm>	 RECOVERY - Host doh3001 is UP: PING OK - Packet loss = 0%, RTA = 107.41 ms
[08:32:55] <icinga-wm>	 RECOVERY - Host doh3002 is UP: PING OK - Packet loss = 0%, RTA = 107.43 ms
[08:32:55] <icinga-wm>	 RECOVERY - Host ganeti3003 is UP: PING OK - Packet loss = 0%, RTA = 107.35 ms
[08:32:56] <icinga-wm>	 RECOVERY - Host lvs3006 is UP: PING OK - Packet loss = 0%, RTA = 107.64 ms
[08:32:59] <icinga-wm>	 RECOVERY - Host ncredir3002 is UP: PING OK - Packet loss = 0%, RTA = 107.48 ms
[08:33:01] <icinga-wm>	 RECOVERY - Host bast3005 is UP: PING OK - Packet loss = 0%, RTA = 107.48 ms
[08:33:05] <icinga-wm>	 RECOVERY - Host ganeti3002 is UP: PING OK - Packet loss = 0%, RTA = 107.15 ms
[08:33:05] <icinga-wm>	 RECOVERY - Host install3001 is UP: PING OK - Packet loss = 0%, RTA = 107.65 ms
[08:33:09] <icinga-wm>	 RECOVERY - Host netflow3002 is UP: PING OK - Packet loss = 0%, RTA = 107.29 ms
[08:33:13] <icinga-wm>	 RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 19, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:33:19] <icinga-wm>	 RECOVERY - Host prometheus3001 is UP: PING OK - Packet loss = 0%, RTA = 107.35 ms
[08:33:21] <icinga-wm>	 RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 107.15 ms
[08:33:29] <icinga-wm>	 RECOVERY - Host durum3001 is UP: PING OK - Packet loss = 0%, RTA = 107.74 ms
[08:33:33] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:33:33] <icinga-wm>	 RECOVERY - Host ganeti3001 is UP: PING OK - Packet loss = 0%, RTA = 107.08 ms
[08:34:02] <jynus>	 there was a small increase in 5XX and NELs
[08:34:16] <jynus>	 https://grafana.wikimedia.org/goto/EfwM6U44z?orgId=1
[08:35:07] <volans>	 yep
[08:35:11] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:35:29] <icinga-wm>	 RECOVERY - Host upload-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 107.15 ms
[08:35:47] <icinga-wm>	 RECOVERY - Host ripe-atlas-esams is UP: PING OK - Packet loss = 0%, RTA = 107.27 ms
[08:35:47] <icinga-wm>	 RECOVERY - Host ripe-atlas-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 107.44 ms
[08:35:47] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 107.03 ms
[08:41:03] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 245, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:41:03] <topranks>	 volans: I should have done VRRP first which meant this took longer than it ought to flip over.
[08:41:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Slst2020) 05Open→03Resolved a:03Slst2020 Thank you, closing now!
[08:41:03] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:41:03] <moritzm>	 !log installing puma security updates
[08:41:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:03] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 216 probes of 634 (alerts on 90) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:41:03] <volans>	 jynus: the interesting part is that I can't see that increase in logstash
[08:41:03] <volans>	 https://logstash.wikimedia.org/app/dashboards#/view/ee6432c0-82a9-11eb-9d45-739221ba7fb6?_g=h@42b0d52&_a=h@c3f9414
[08:41:03] <volans>	 for NEL
[08:48:30] <volans>	 as for 5xx... still looking
[08:48:50] <volans>	 the one in the grafana home uses varnish_requests
[08:48:57] <volans>	 with method!="PURGE", status=~"5.."
[08:50:11] <wikibugs>	 (03CR) 10Btullis: Add a spark-operator production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[08:50:15] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:50:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:51:28] <XioNoX>	 most of the NEL errors are from one russian ISP, my guess is that they don't respect the DNS TTL, and were still going to esams
[08:51:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:51:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:51:51] <XioNoX>	 maybe a DNS MITM or the such
[08:52:24] <volans>	 5xx seems that we just got some 502s on esams itself
[08:52:25] <volans>	 https://grafana-rw.wikimedia.org/d/000000464/varnish-aggregate-client-status-code?orgId=1&from=now-1h&to=now&var-site=codfw&var-site=drmrs&var-site=eqiad&var-site=eqsin&var-site=esams&var-site=ulsfo&var-cache_type=varnish-text&var-cache_type=varnish-upload&var-status_type=5&var-method=GET&var-method=HEAD&var-method=POST&viewPanel=2
[08:52:35] <topranks>	 XioNoX: yes I'm inclided to agreed, they are for IP address of upload-lb.esams.wikimedia.org., CNAME should have moved them to drmrs
[08:52:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:52:42] <volans>	 (compare with just selecting esams, the others don't affect the 502 spike) but is small 
[08:52:43] <wikibugs>	 (03PS2) 10JMeybohm: k8s: Remove all debian version if-guarding [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943)
[08:52:49] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:53:15] <topranks>	 volans: yes.  makes sense in my head I think I will proceed with reboot and then set things back to normal.
[08:53:44] <logmsgbot>	 !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for three wikis (duration: 04m 03s)
[08:53:48] <volans>	 topranks: ack
[08:54:45] <topranks>	 !log rebooting line card fpc 0 on cr2-esams (T318783)
[08:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:49] <stashbot>	 T318783: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783
[08:55:13] <hoo>	 Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for ruwiktionary
[08:55:26] <wikibugs>	 (03PS1) 10Muehlenhoff: docker: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/839447
[08:56:49] <hoo>	 Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for specieswiki
[08:56:51] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:57:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:58:37] <wikibugs>	 (03PS1) 10Muehlenhoff: labs_bootstrapvz: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/839449
[08:58:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:58:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:58:55] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:59:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] labs_bootstrapvz: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/839449 (owner: 10Muehlenhoff)
[08:59:23] <_joe_>	 !log uploaded new php 7.4 packages T318918
[08:59:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:00:19] <hoo>	 !log Ran extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for ruwiktionary
[09:00:28] <hoo>	 !log Ran extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for specieswiki
[09:01:06] <hoo>	 !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for cebwiki
[09:01:27] <wikibugs>	 (03PS2) 10Muehlenhoff: labs_bootstrapvz: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/839449
[09:03:40] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 84 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37467/console" [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:04:21] <stashbot>	 T318918: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918
[09:04:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:55] <topranks>	 !log re-pooling esams after cr2-esams line card reboot
[09:05:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:15] <wikibugs>	 (03PS1) 10Cathal Mooney: Revert "Depool esams in gdns prior to reboot of line card" [dns] - 10https://gerrit.wikimedia.org/r/839022
[09:06:28] <wikibugs>	 (03CR) 10Btullis: Add a new production image for spark version 3.3.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[09:06:57] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:08:37] <wikibugs>	 (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for nine wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839454
[09:08:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) Reboot completed sucessfully, currently router not showing any alarms: ` root@re0.cr2-esams> show system alarms                                       No alarms currently active `...
[09:11:34] <wikibugs>	 (03PS3) 10Muehlenhoff: prometheus: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838834 (https://phabricator.wikimedia.org/T308013)
[09:12:13] <wikibugs>	 (03CR) 10Hashar: [C: 04-2] "Gerrit 3.4.6 has been released and includes my patch to add a public getter \o/  I will get our instance upgraded via T319513." [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[09:13:09] <wikibugs>	 (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for nine wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839454 (owner: 10Hoo man)
[09:14:00] <wikibugs>	 (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for nine wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839454 (owner: 10Hoo man)
[09:15:56] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Revert "Depool esams in gdns prior to reboot of line card" [dns] - 10https://gerrit.wikimedia.org/r/839022 (owner: 10Cathal Mooney)
[09:16:24] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data Pipelines, 10Foundational Technology Requests, 10User-fgiunchedi: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10fgiunchedi) I have resumed work on this a little bit and produced a worked example using...
[09:17:03] <icinga-wm>	 RECOVERY - Check systemd state on ganeti1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:18:25] <logmsgbot>	 !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for nine wikis (duration: 03m 41s)
[09:18:44] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 04-1] "We still have buster hosts with kubernetes-client installed. I'll create the versioned components there as well and copy the packages so w" [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:19:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:20:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:20:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:20:54] <wikibugs>	 (03CR) 10Muehlenhoff: prometheus: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838834 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:20:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] prometheus: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838834 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:21:34] <_joe_>	 !log installed the upgraded php package to mw1414, T318918
[09:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:38] <stashbot>	 T318918: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918
[09:21:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:21:50] <wikibugs>	 (03PS3) 10JMeybohm: k8s: Remove all debian version if-guarding [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943)
[09:21:52] <wikibugs>	 (03PS1) 10JMeybohm: aptrepo: Create versioned kubernets components for buster [puppet] - 10https://gerrit.wikimedia.org/r/839456 (https://phabricator.wikimedia.org/T307943)
[09:22:03] <wikibugs>	 (03PS1) 10Jelto: buildkit: add no_proxy for wmf domains [puppet] - 10https://gerrit.wikimedia.org/r/839457 (https://phabricator.wikimedia.org/T308271)
[09:22:05] <icinga-wm>	 PROBLEM - Check systemd state on mw2368 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:22:17] <hoo>	 !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for nlwiktionary, ruwiki, jawiki
[09:22:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:28] <wikibugs>	 (03PS2) 10JMeybohm: aptrepo: Create versioned kubernetes components for buster [puppet] - 10https://gerrit.wikimedia.org/r/839456 (https://phabricator.wikimedia.org/T307943)
[09:22:30] <wikibugs>	 (03PS4) 10JMeybohm: k8s: Remove all debian version if-guarding [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943)
[09:23:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] wmcs::metricsinfra: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838832 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:23:57] <icinga-wm>	 RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:24:07] <icinga-wm>	 RECOVERY - Check systemd state on mw2368 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:24:17] <_joe_>	 not sure what happened there tbh
[09:27:17] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37468/console" [puppet] - 10https://gerrit.wikimedia.org/r/839457 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto)
[09:28:10] <hoo>	 !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for viwiki, metawiki, frwiktionary
[09:28:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:47] <icinga-wm>	 RECOVERY - Check systemd state on mw2290 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:30:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/839456 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:31:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) >>! In T319184#8288137, @cmooney wrote: > [..] > Anyway thought I'd mention just in case you weren't aware.    Thanks, double checking this now....
[09:32:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] aptrepo: Create versioned kubernetes components for buster [puppet] - 10https://gerrit.wikimedia.org/r/839456 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:32:21] <moritzm>	 !log installing python-oslo.utils security updates
[09:32:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:40] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "Like discussed in yesterdays troubleshooting session." [puppet] - 10https://gerrit.wikimedia.org/r/839457 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto)
[09:34:07] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1005.eqiad.wmnet
[09:35:37] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/839447 (owner: 10Muehlenhoff)
[09:39:35] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:40:49] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 7 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37470/console" [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:41:30] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1005.eqiad.wmnet
[09:44:50] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+1] "Looks like releases and contint don't use packages_from_future. Not ideal but fine for now. We will have to refactor the version selection" [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:48:03] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] buildkit: add no_proxy for wmf domains [puppet] - 10https://gerrit.wikimedia.org/r/839457 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto)
[09:48:29] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10SLyngshede-WMF) In spicerack we'll add a "skip_acked=False" to the wait_for_optimal and "acked" properties to HostStatus and HostsStatus datatypes.  When skip_a...
[09:51:25] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:52:35] <hoo>	 !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for itwiki, arzwiki, ptwiki
[09:52:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:09] <icinga-wm>	 RECOVERY - Check systemd state on mw1426 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:55:45] <wikibugs>	 (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839468
[09:56:11] <icinga-wm>	 RECOVERY - Check systemd state on mw1434 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:56:26] <wikibugs>	 (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839468 (owner: 10Hoo man)
[09:56:45] <icinga-wm>	 RECOVERY - Check systemd state on mw1446 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:56:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2319 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:57:07] <icinga-wm>	 RECOVERY - Check systemd state on mw2387 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:57:11] <wikibugs>	 (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839468 (owner: 10Hoo man)
[09:57:19] <moritzm>	 !log installing glib2.0 security updates on buster
[09:57:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:05] <jouncebot>	 mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T1000).
[10:00:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Jmads out of all services on: 1213 hosts
[10:01:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jmads out of all services on: 1213 hosts
[10:02:00] <logmsgbot>	 !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for all wikis (duration: 03m 39s)
[10:02:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[10:03:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[10:03:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[10:03:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[10:05:11] <icinga-wm>	 RECOVERY - Disk space on moscovium is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=moscovium&var-datasource=eqiad+prometheus/ops
[10:05:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Jmads out of all services on: 799 hosts
[10:06:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jmads out of all services on: 799 hosts
[10:06:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging NOkafor out of all services on: 799 hosts
[10:07:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging NOkafor out of all services on: 799 hosts
[10:07:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging NOkafor out of all services on: 1213 hosts
[10:07:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging NOkafor out of all services on: 1213 hosts
[10:11:52] <hoo>	 !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all remaining wikis
[10:11:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:07] <moritzm>	 !log installing ruby-rack security updates
[10:16:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:47] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) Plan of action: General overview before/after. Red: deactivated/removed. Green: activated/added. {F35550079}  We're...
[10:23:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[10:23:38] <wikibugs>	 (03PS1) 10Volans: sre.dns.wipe-cache: add sudo to the command [cookbooks] - 10https://gerrit.wikimedia.org/r/839474 (https://phabricator.wikimedia.org/T244840)
[10:26:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/839474 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans)
[10:30:43] <elukey>	 !log restart kafka on kafka-logging1003 to reload the conifg (cleanup old super.users related to past keystore)
[10:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:15] <jinxer-wm>	 (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold
[10:37:42] <wikibugs>	 10SRE, 10Traffic, 10conftool, 10Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794 (10jbond) While implementing the the [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/768723/31/modules/varnish/templates/upload-fr...
[10:40:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/839474 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans)
[10:40:45] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:41:15] <jinxer-wm>	 (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold
[10:45:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/839428 (owner: 10Muehlenhoff)
[10:47:01] <icinga-wm>	 PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1005 is CRITICAL: CRITICAL: no netns defined? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[10:49:23] <wikibugs>	 (03PS5) 10Jbond: C:postgress::server: add replication slot support [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217)
[10:49:45] <wikibugs>	 (03CR) 10Jbond: C:postgress::server: add replication slot support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond)
[10:50:06] <wikibugs>	 (03PS6) 10Jbond: C:postgress::server: add replication slot support [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217)
[10:51:39] <_joe_>	 !log installing the upgraded php package everywhere, T318918
[10:51:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:44] <stashbot>	 T318918: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918
[10:52:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 8 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37471/console" [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond)
[10:53:34] <wikibugs>	 (03PS4) 10Jbond: O:puppetdb: enable postgress slots for replication [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217)
[10:53:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:postgress::server: add replication slot support [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond)
[10:57:56] <wikibugs>	 (03PS1) 10Urbanecm: eswiki: Enable Growth mentorship for 25% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839485 (https://phabricator.wikimedia.org/T285235)
[10:58:13] <jbond>	 !log disable puppet temporarily to deploy a puppetdb change 814824
[10:58:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:21] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37472/console" [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond)
[11:00:50] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Allow partioning the cache storage in several volumes [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748)
[11:01:08] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] O:puppetdb: enable postgress slots for replication [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond)
[11:02:51] <wikibugs>	 (03PS2) 10Vgutierrez: trafficserver: Allow partitiooning the cache storage in several volumes [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748)
[11:03:43] <wikibugs>	 (03PS3) 10Vgutierrez: trafficserver: Allow partitiooning the cache storage in several volumes [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748)
[11:04:43] <wikibugs>	 (03PS1) 10Jbond: P:postgresql:: master: correct sql statment [puppet] - 10https://gerrit.wikimedia.org/r/839488
[11:04:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] P:postgresql:: master: correct sql statment [puppet] - 10https://gerrit.wikimedia.org/r/839488 (owner: 10Jbond)
[11:05:04] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37473/console" [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[11:07:23] <wikibugs>	 (03PS1) 10Jbond: P:postgresql::master: use correct unless statment [puppet] - 10https://gerrit.wikimedia.org/r/839489
[11:07:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:postgresql::master: use correct unless statment [puppet] - 10https://gerrit.wikimedia.org/r/839489 (owner: 10Jbond)
[11:07:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] P:postgresql::master: use correct unless statment [puppet] - 10https://gerrit.wikimedia.org/r/839489 (owner: 10Jbond)
[11:08:13] <wikibugs>	 (03PS4) 10Vgutierrez: trafficserver: Allow partitioning the cache storage in several volumes [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748)
[11:08:15] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Enable cache partitioning in cp6008 [puppet] - 10https://gerrit.wikimedia.org/r/839490 (https://phabricator.wikimedia.org/T317748)
[11:12:30] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524)
[11:13:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) (owner: 10Arturo Borrero Gonzalez)
[11:13:25] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524)
[11:14:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) (owner: 10Arturo Borrero Gonzalez)
[11:14:53] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524)
[11:15:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) (owner: 10Arturo Borrero Gonzalez)
[11:15:53] <wikibugs>	 (03PS1) 10Jbond: P:puppetdb: correct slot name on master [puppet] - 10https://gerrit.wikimedia.org/r/839494
[11:16:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] P:puppetdb: correct slot name on master [puppet] - 10https://gerrit.wikimedia.org/r/839494 (owner: 10Jbond)
[11:16:48] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524)
[11:18:51] <icinga-wm>	 PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:21:19] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Joe) 05Open→03Resolved
[11:22:29] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts aqs1004.eqiad.wmnet
[11:27:32] <btullis>	 !log cold-reset the BMC on analytics1076
[11:27:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:40] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[11:27:45] <jbond>	 !log switch puppetdb replication to use replications slots
[11:27:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:08] <jbond>	 !log enable puppet post deploy  puppetdb change 814824
[11:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:57] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:29:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Stop assigning the PHP_ENGINE cookie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839499 (https://phabricator.wikimedia.org/T271736)
[11:30:11] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Joe)
[11:30:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:postgress::server: add replication slot support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond)
[11:31:30] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond)
[11:31:34] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb postgress: Improve postgress standby server - https://phabricator.wikimedia.org/T313217 (10jbond) 05Open→03Resolved a:03jbond puppetdb has now been migrated to use replication slots
[11:31:39] <wikibugs>	 (03PS1) 10Matthias Mullie: Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839500 (https://phabricator.wikimedia.org/T306883)
[11:32:25] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:32:26] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1004.eqiad.wmnet
[11:33:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) (owner: 10Arturo Borrero Gonzalez)
[11:34:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/37474/" [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) (owner: 10Arturo Borrero Gonzalez)
[11:35:46] <wikibugs>	 (03PS8) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116)
[11:44:56] <wikibugs>	 10SRE, 10conftool: Add suopport to use different vsthrottle keys - https://phabricator.wikimedia.org/T319533 (10jbond)
[11:45:03] <wikibugs>	 10SRE, 10conftool: Add suopport to use different vsthrottle keys - https://phabricator.wikimedia.org/T319533 (10jbond) p:05Triage→03Medium
[11:55:15] <icinga-wm>	 PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1005 is CRITICAL: CRITICAL: no netns defined? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[12:28:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Merge tag 'v3.4.6' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839504 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar)
[12:32:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet
[12:32:04] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL:
[12:32:04] <icinga-wm>	 t per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:32:53] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Add sessioncookie bit to X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/839512 (https://phabricator.wikimedia.org/T319324)
[12:32:54] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[12:34:53] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye
[12:36:50] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[12:39:00] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:40:38] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons.
[12:42:59] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye
[12:43:27] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye
[12:45:22] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1029.eqiad.wmnet
[12:47:01] <wikibugs>	 (03PS3) 10Hashar: Merge tag 'v3.4.6' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839504 (https://phabricator.wikimedia.org/T319513)
[12:47:37] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "lookup for http_proxy fields returns empty string. Added some comments in-line." [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[12:52:27] <icinga-wm>	 ACKNOWLEDGEMENT - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is C
[12:52:27] <icinga-wm>	  Test Get per file requests returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:52:27] <icinga-wm>	 ACKNOWLEDGEMENT - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITIC
[12:52:28] <icinga-wm>	  Get per article page views returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:52:29] <icinga-wm>	 ACKNOWLEDGEMENT - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is C
[12:52:30] <icinga-wm>	  Test Get per file requests returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:52:31] <icinga-wm>	 ACKNOWLEDGEMENT - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITIC
[12:52:32] <icinga-wm>	  Get per article page views returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:52:55] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Lucas_Werkmeister_WMDE) 05Resolved→03Open As far as I can tell, this is done in production (thanks Joe!), but not yet in CI – a change I just...
[12:53:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero)
[12:54:00] <wikibugs>	 (03PS1) 10Hashar: Update Gerrit to v3.4.6 [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839515 (https://phabricator.wikimedia.org/T319513)
[12:54:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update Gerrit to v3.4.6 [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839515 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar)
[12:54:40] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts aqs1006.eqiad.wmnet
[12:56:22] <wikibugs>	 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) 05Open→03Resolved a:03elukey The kafka logging clusters have the new PKI configurati...
[12:56:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1026.eqiad.wmnet with reason: Downtime for removal from Ganeti cluster and eventual bullseye reimage
[12:56:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1026.eqiad.wmnet with reason: Downtime for removal from Ganeti cluster and eventual bullseye reimage
[12:58:14] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye
[12:59:01] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.4.6' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839504 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar)
[12:59:29] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye
[12:59:41] <wikibugs>	 10SRE, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Consider adding X-Analytics subfield for 'has a session cookie' - https://phabricator.wikimedia.org/T319324 (10Vgutierrez)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T1300). Please do the needful.
[13:00:05] <jouncebot>	 stephanebisson and matthiasmullie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:12] <Lucas_WMDE>	 o/
[13:00:25] <urbanecm>	 o/
[13:00:26] <stephanebisson>	 Hello
[13:00:27] <urbanecm>	 I can deploy today
[13:00:29] <hueitan>	 hello
[13:01:07] <wikibugs>	 (03PS5) 10Urbanecm: Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson)
[13:01:16] <stephanebisson>	 if matthiasmullie is around, we could start with his patch. I need some time to get ready
[13:01:21] <urbanecm>	 okay
[13:01:31] <urbanecm>	 matthiasmullie: hi, are you around?
[13:02:02] <icinga-wm>	 PROBLEM - Host ganeti1029 is DOWN: PING CRITICAL - Packet loss = 100%
[13:03:52] <vgutierrez>	 ^^ expected?
[13:04:07] <urbanecm>	 matthiasmullie: ping #2, are you around for your deployment?
[13:04:10] <vgutierrez>	 hmm right, that's moritzm 
[13:04:11] <stephanebisson>	 urbanecm, ok we can do mine
[13:04:15] <urbanecm>	 okay
[13:04:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson)
[13:04:33] <moritzm>	 ganeti1029 is expired downtime, all is well
[13:04:38] <icinga-wm>	 RECOVERY - Host ganeti1029 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[13:04:55] <moritzm>	 I'm glad Icinga concurs :-)
[13:05:13] <wikibugs>	 (03Merged) 10jenkins-bot: Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson)
[13:05:45] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:826882|Explicit config for Wikistories discovery module (T314582)]]
[13:05:49] <stashbot>	 T314582: Make Wikistories configurable for public release - https://phabricator.wikimedia.org/T314582
[13:06:08] <logmsgbot>	 !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudnet1006.eqiad.wmnet with OS bullseye
[13:06:10] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and sbisson: Backport for [[gerrit:826882|Explicit config for Wikistories discovery module (T314582)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[13:06:20] <wikibugs>	 (03PS7) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815)
[13:06:21] <urbanecm>	 stephanebisson: can you check it at a debug server?
[13:06:24] <matthiasmullie>	 o/
[13:06:27] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye
[13:06:37] <matthiasmullie>	 @urbanecm sorry for showing up late, missed notification :p
[13:06:40] <urbanecm>	 no worries!
[13:06:41] <stephanebisson>	 urbanecm mwdebug1002?
[13:06:51] <urbanecm>	 stephanebisson: yup!
[13:06:52] <wikibugs>	 (03CR) 10BCornwall: ats: Alert on high connection/request count (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[13:07:14] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) just putting a noted here.  after looking at the [[ https://galaxy.ansible.com/dellemc/openm...
[13:07:53] <wikibugs>	 (03Merged) 10jenkins-bot: Merge tag 'v3.4.6' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839504 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar)
[13:08:10] <stephanebisson>	 urbanecm looks good, you can sync
[13:08:12] <icinga-wm>	 PROBLEM - configured eth on ganeti1029 is CRITICAL: public reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[13:08:16] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[13:08:17] <urbanecm>	 stephanebisson: great, syncing
[13:08:33] <wikibugs>	 (03CR) 10Hashar: "recheck" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839515 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar)
[13:09:01] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839500 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie)
[13:09:05] <wikibugs>	 (03PS2) 10Urbanecm: Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839500 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie)
[13:09:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[13:09:11] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839500 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie)
[13:09:54] <wikibugs>	 (03Merged) 10jenkins-bot: Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839500 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie)
[13:11:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:12:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:12:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:12:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet
[13:12:22] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:826882|Explicit config for Wikistories discovery module (T314582)]] (duration: 06m 37s)
[13:12:26] <stashbot>	 T314582: Make Wikistories configurable for public release - https://phabricator.wikimedia.org/T314582
[13:12:37] <urbanecm>	 stephanebisson: your patch's live!
[13:12:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839500 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie)
[13:12:46] <stephanebisson>	 urbanecm thank you!
[13:12:50] <urbanecm>	 no problem!
[13:12:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:13:07] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:839500|Show thumbnails on Special:Search for NS_FILE + PageImages (T306883)]]
[13:13:10] <stashbot>	 T306883: [L] Searchers see thumbnails next to search results on the special:search page - https://phabricator.wikimedia.org/T306883
[13:13:30] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and mlitn: Backport for [[gerrit:839500|Show thumbnails on Special:Search for NS_FILE + PageImages (T306883)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[13:13:30] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Verified the config and the volume.config output file." [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[13:13:54] <urbanecm>	 matthiasmullie: can you check at mwdebug1002 please?
[13:13:58] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] trafficserver: Enable cache partitioning in cp6008 [puppet] - 10https://gerrit.wikimedia.org/r/839490 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[13:14:12] <matthiasmullie>	 urbanecm: LGTM!
[13:14:20] <urbanecm>	 that was quick, syncing!
[13:14:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:14:41] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:14:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) I don't think it's true to say the VRRP is over VXLAN here, the VRRP...
[13:15:53] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Allow partitioning the cache storage in several volumes [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[13:15:58] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye
[13:16:31] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:16:32] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1006.eqiad.wmnet
[13:16:41] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye
[13:17:25] <matthiasmullie>	 @urbanecm thanks!
[13:17:56] <vgutierrez>	 !log partition ats-be cache in cp6008 - T317748
[13:17:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:17:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:00] <stashbot>	 T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748
[13:18:14] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Update Gerrit to v3.4.6 [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839515 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar)
[13:18:19] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:839500|Show thumbnails on Special:Search for NS_FILE + PageImages (T306883)]] (duration: 05m 12s)
[13:18:23] <stashbot>	 T306883: [L] Searchers see thumbnails next to search results on the special:search page - https://phabricator.wikimedia.org/T306883
[13:18:36] <wikibugs>	 (03Merged) 10jenkins-bot: Update Gerrit to v3.4.6 [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839515 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar)
[13:18:48] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Enable cache partitioning in cp6008 [puppet] - 10https://gerrit.wikimedia.org/r/839490 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[13:18:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:18:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:19:11] <urbanecm>	 matthiasmullie: should be live!
[13:19:16] <icinga-wm>	 RECOVERY - configured eth on ganeti1029 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[13:19:27] <urbanecm>	 anything else?
[13:19:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet
[13:19:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:19:59] <wikibugs>	 (03CR) 10Hnowlan: [C: 04-1] Update the logic to run test coverage (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik)
[13:20:16] <moritzm>	 !log draining ganeti1014 T311687
[13:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:20] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[13:20:53] <urbanecm>	 !log UTC afternoon backport window done
[13:20:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:58] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:25:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti1030 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/839521 (https://phabricator.wikimedia.org/T299459)
[13:36:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) cableid c220756659 fpc2 - fpc8.
[13:41:15] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons.
[13:42:07] <elukey>	 \o/
[13:46:08] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) Both clusters are running PKI and today I have also ran the following clean up steps:  1) removed the old puppet ce...
[13:47:01] <wikibugs>	 (03PS1) 10Hnowlan: Add missing prod dependencies [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/839548 (https://phabricator.wikimedia.org/T233196)
[13:48:19] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts aqs1007.eqiad.wmnet
[13:56:23] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Remove all debian version if-guarding [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[14:00:19] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 3 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[14:00:26] <wikibugs>	 (03PS1) 10Clément Goubert: Add build instructions in debian/README [debs/helm3] - 10https://gerrit.wikimedia.org/r/839550
[14:00:32] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10hashar) James has made the necessary CI updates and I have deployed them.
[14:01:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add build instructions in debian/README [debs/helm3] - 10https://gerrit.wikimedia.org/r/839550 (owner: 10Clément Goubert)
[14:01:15] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[14:03:51] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:03:52] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts aqs1007.eqiad.wmnet
[14:04:21] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Joe) 05Open→03Resolved Sorry @LucasWerkmeister I assumed this task was about updating production. Re-resolving then :)
[14:04:41] <wikibugs>	 (03PS1) 10Hashar: Remove gerrit2001 from deployment targets [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839551 (https://phabricator.wikimedia.org/T243027)
[14:04:57] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Remove gerrit2001 from deployment targets [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839551 (https://phabricator.wikimedia.org/T243027) (owner: 10Hashar)
[14:05:21] <wikibugs>	 (03Merged) 10jenkins-bot: Remove gerrit2001 from deployment targets [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839551 (https://phabricator.wikimedia.org/T243027) (owner: 10Hashar)
[14:06:22] <hashar>	 I am going to upgrade Gerrit from 3.4.5 to 3.4.6
[14:07:46] <vgutierrez>	 !log updating HAProxy to version 2.4.19 in ulsfo
[14:07:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:09] <wikibugs>	 (03PS1) 10Clément Goubert: Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/839554
[14:08:32] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@132ac68]: Gerrit to 3.4.6 on gerrit2002
[14:08:33] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Lucas_Werkmeister_WMDE) The change for T316923 is passing in CI now (currently going through test-and-submit), so I think this is indeed done. Th...
[14:08:43] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@132ac68]: Gerrit to 3.4.6 on gerrit2002 (duration: 00m 10s)
[14:08:44] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage
[14:10:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/839554 (owner: 10Clément Goubert)
[14:11:31] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10aborrero)
[14:12:24] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, see one in-line comment" [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn)
[14:12:34] <hashar>	 !log Upgrading primary Gerrit # T319513
[14:12:38] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage
[14:12:38] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@132ac68]: Gerrit to 3.4.6 on gerrit1001
[14:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:39] <stashbot>	 T319513: Upgrade Gerrit to 3.4.6 - https://phabricator.wikimedia.org/T319513
[14:12:46] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@132ac68]: Gerrit to 3.4.6 on gerrit1001 (duration: 00m 08s)
[14:13:55] <XioNoX>	 !log move asw2-c-eqiad<->cr1 link to new 40G link - T313385
[14:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:27] <_joe_>	 hashar: gerrit is still down FWIW
[14:15:24] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:15:41] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/838916 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn)
[14:15:50] <icinga-wm>	 PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:36] <hashar>	 !log Gerrit upgraded from 3.4.5 to 3.4.6 # T319513
[14:16:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:52] <icinga-wm>	 RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:18:02] <wikibugs>	 (03PS1) 10Jcrespo: Revert "dbbackups: Test mariadb 10.6 on a (currently passive) backup source" [puppet] - 10https://gerrit.wikimedia.org/r/839566 (https://phabricator.wikimedia.org/T318062)
[14:18:26] <hashar>	  Invalid cookie header: "set-cookie: WMF-Last-Access=06-Oct-2022;Path=/;HttpOnly;secure;Expires=Mon, 07 Nov 2022 12:00:00 GMT". Invalid 'expires' attribute: Mon, 07 Nov 2022 12:00:00 GMT
[14:18:29] <hashar>	 fun :)
[14:18:44] <hashar>	 looks like that cookie is set for all of wikimedia.org and ends up hitting Gerrit as well
[14:19:13] <vgutierrez>	 hashar: what's issuing the error?
[14:19:25] <vgutierrez>	 that cookie is set by varnish
[14:19:31] <hashar>	 the Jetty  server in Gerrit
[14:19:38] <wikibugs>	 (03PS2) 10Jcrespo: Revert "dbbackups: Test mariadb 10.6 on a (currently passive) backup source" [puppet] - 10https://gerrit.wikimedia.org/r/839566 (https://phabricator.wikimedia.org/T318062)
[14:19:42] <vgutierrez>	 that's a bogus client messing with you
[14:19:42] <wikibugs>	 (03PS1) 10Elukey: admin_ng: set proper TLS egress origination settings for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/839587
[14:19:44] <wikibugs>	 (03PS1) 10Elukey: ml-services: move eventgate config to TLS egress origination [deployment-charts] - 10https://gerrit.wikimedia.org/r/839588
[14:19:56] <vgutierrez>	 a client should send "Cookie" rather than set-cookie
[14:19:58] <hashar>	 yeah it looks harmless, we had it before the Gerrit upgrade
[14:20:18] <vgutierrez>	 set-cookies is meant to be used by a server, not an UA
[14:20:53] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "dbbackups: Test mariadb 10.6 on a (currently passive) backup source" [puppet] - 10https://gerrit.wikimedia.org/r/839566 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo)
[14:20:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:21:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:22:38] <hashar>	 vgutierrez: oh nice.  I have no idea from where it comes from though, maybe I will dig into it later :)  It is a single user so far so probably not a concern in any way
[14:22:41] <hashar>	 thanks!
[14:23:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[14:25:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:25:21] <vgutierrez>	 hashar: the CDN sets WMF-Last-Access here, https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/analytics.inc.vcl.erb#L55-L62
[14:25:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:26:04] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) So, we have a need to move on this pretty quickly, as we have 16 new cache hosts in ulsfo pending installs on this, and then 16 more in eqsin righ...
[14:26:29] <wikibugs>	 (03CR) 10Elukey: Add a new production image for spark version 3.3.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[14:27:03] <vgutierrez>	 hashar: and by RFC 6265 https://httpwg.org/specs/rfc6265.html#sane-set-cookie is intended as a server -> client header
[14:28:02] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cloudnet1006.eqiad.wmnet with OS bullseye
[14:29:20] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) I'll  take care of "Create a buster-based 4.19+5.10 boot image " tomorrow.
[14:29:47] <hashar>	 vgutierrez: I think Gerrit is internally confused somehow cause I see that message for ssh commands or clients doing a `git push` over ssh
[14:29:56] <hashar>	 vgutierrez: thanks for the refs :]
[14:30:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:30:12] <vgutierrez>	 hashar: that's weird
[14:30:21] <XioNoX>	 !log moving eqiad row C vrrp mastership to cr1-eqiad
[14:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) Ok yeah I see what is going on.  Cloudnet1005 is running VXLAN over U...
[14:31:10] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) >>! In T319067#8290850, @MoritzMuehlenhoff wrote: > I'll  take care of "Create a buster-based 4.19+5.10 boot image " tomorrow.  Thank you!
[14:31:30] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:31:56] <wikibugs>	 (03PS1) 10MVernon: swift: restore ms-be1059 to production [puppet] - 10https://gerrit.wikimedia.org/r/839591 (https://phabricator.wikimedia.org/T307667)
[14:32:56] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.dns.wipe-cache: add sudo to the command [cookbooks] - 10https://gerrit.wikimedia.org/r/839474 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans)
[14:34:49] <hashar>	 vgutierrez: yeah turns out I already filed a task for that https://phabricator.wikimedia.org/T273605
[14:56:13] <bblack>	 !log eqiad front edge depooled in DNS
[14:56:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:24] <bblack>	 takes 10 minutes or so to take full effect anyways
[14:56:30] <volans>	 as usual
[14:56:32] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:56:42] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:56:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:56:50] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:56:50] <jhathaway>	 no recovery yet on the status page graphs, https://grafana.wikimedia.org/d/3u6RLsL7k/status-page?orgId=1&from=now-1h&to=now
[14:56:57] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[14:57:04] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:57:22] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[14:57:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:58:00] <bblack>	 we could also depool that side of A/A at the mediawiki level
[14:58:03] <jinxer-wm>	 (ProbeDown) resolved: (6) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:58:05] <vgutierrez>	 ats-be 5xx are going back to normal
[14:58:14] <vgutierrez>	 https://grafana.wikimedia.org/d/000000479/cdn-frontend-traffic?orgId=1&var-site=All&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5&viewPanel=14
[14:58:17] <bblack>	 otherwise e.g. drmrs+esams traffic are still hitting mw in eqiad too
[14:58:19] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:58:26] <bblack>	 but if it's resolving, no point
[14:58:31] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] prometheus: Remove ATS 8-specific metrics [puppet] - 10https://gerrit.wikimedia.org/r/838886 (owner: 10BCornwall)
[14:58:31] <volans>	 bblack: yes but unless we failover writes will still go all to eqiad
[14:58:33] <jhathaway>	 looks to be resolving
[14:58:34] <XioNoX>	 I'm wondering why it took so long to recover
[14:58:44] <bblack>	 volans: yeah but we could save the reads! :)
[14:59:01] <XioNoX>	 but looks like row D lost connectivity during the interface move, while it shouldn't have
[14:59:08] <volans>	 not nice
[14:59:12] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[14:59:23] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:00:52] <volans>	 graphs looks mostly at the recovered values
[15:01:19] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] prometheus: Add records for ATS percent usage [puppet] - 10https://gerrit.wikimedia.org/r/838911 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[15:01:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: /dev/sdg failed in thanos-be2004 - https://phabricator.wikimedia.org/T318422 (10Papaul) 05Open→03Resolved disk replaced
[15:01:32] <jhathaway>	 yeah, looks like full recovery
[15:01:38] <wikibugs>	 (03PS2) 10BCornwall: prometheus: Add records for ATS percent usage [puppet] - 10https://gerrit.wikimedia.org/r/838911 (https://phabricator.wikimedia.org/T292815)
[15:01:42] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage
[15:01:54] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts aqs1008.eqiad.wmnet
[15:01:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:02:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] swift: restore ms-be1059 to production [puppet] - 10https://gerrit.wikimedia.org/r/839591 (https://phabricator.wikimedia.org/T307667) (owner: 10MVernon)
[15:02:25] <XioNoX>	 we will continue in a future window
[15:02:55] <jinxer-wm>	 (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[15:02:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:03:30] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[15:03:35] <bblack>	 XioNoX: so we're stable for now?
[15:03:58] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[15:04:30] <XioNoX>	 bblack: yeah everything is 100% back to normal on the network side
[15:04:42] <bblack>	 ok, any objection to reverting the dns depool?
[15:04:48] <XioNoX>	 no objection
[15:04:54] <volans>	 +1
[15:05:16] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage
[15:05:19] <wikibugs>	 (03PS1) 10BBlack: Revert "depool eqiad front edge" [dns] - 10https://gerrit.wikimedia.org/r/839567
[15:05:28] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "depool eqiad front edge" [dns] - 10https://gerrit.wikimedia.org/r/839567 (owner: 10BBlack)
[15:05:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: set proper TLS egress origination settings for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/839587 (owner: 10Elukey)
[15:06:18] <XioNoX>	 the tldr, is that disabling an interface caused traffic to be blackholed instead of failing over to the other interface, I believe things would have converged eventually or even did converge before the rollback
[15:06:34] <XioNoX>	 I'll write an incident report
[15:07:01] <jhathaway>	 blackholes are the worst for fast failover :(
[15:07:17] <jynus>	 there was a dbproxy failover
[15:07:32] <jynus>	 2 actually
[15:07:32] <wikibugs>	 (03CR) 10BCornwall: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[15:07:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[15:07:56] <XioNoX>	 jynus: ah? which ones?
[15:07:59] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[15:08:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:08:20] <jynus>	 dbproxy1016 and dbproxy1017, not sure what service they are and what they point to
[15:08:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:08:31] <XioNoX>	 the funny thing is that this change is much less risky than the router upgrades we did the previous weeks :)
[15:08:39] <wikibugs>	 (03CR) 10Jforrester: scap/dsh: remove parsoid service, replaced by parsoid-php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn)
[15:09:00] <XioNoX>	 jinxer-wm: looks like they're both in row D so that makes sens
[15:09:05] <XioNoX>	 er jynus ^
[15:09:16] <volans>	 it's all jinxer-wm fault
[15:09:27] <jynus>	 db1159 is considered down
[15:09:43] <XioNoX>	 jinxer-wm: still?
[15:09:45] <XioNoX>	 er!
[15:09:46] <jynus>	 that is m3
[15:09:49] <XioNoX>	 jynus: 
[15:09:51] <jynus>	 (phabricator)
[15:10:01] <jynus>	 not sure if the active one
[15:10:12] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:10:13] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1008.eqiad.wmnet
[15:10:46] <jynus>	 I think not, dbproxy1020 was active, dbproxy1016 was passive
[15:11:05] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts aqs1009.eqiad.wmnet
[15:11:39] <jynus>	 checking now dbproxy1017
[15:12:35] <volans>	 anything we can do to help?
[15:12:51] <jynus>	 by chance also passive! :-D
[15:12:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:12:57] <volans>	 yay
[15:13:11] <jynus>	 if dbproxy1021, m5 db would have been down/read only
[15:13:55] <jynus>	 XioNoX: if confirmed no more changes affecting that, I will reload the proxy config (it doesn't reconnect to the original dbs to avoid flapping)
[15:14:07] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[15:14:13] <wikibugs>	 (03PS1) 10Volans: mypy: remove upper limit [software/spicerack] - 10https://gerrit.wikimedia.org/r/839602
[15:15:08] <XioNoX>	 jynus: yeah, everything is back to normal
[15:15:23] <jynus>	 ok, will log and reload config on those proxies
[15:16:58] <jynus>	 !log reload haproxy config on dbproxy1016, dbproxy1017
[15:17:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:18] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) Row C got moved to the new linecards with no issues, but moving cr1<->row D caused an outage.  As row C cleanup, @Jclark-ctr can you rem...
[15:17:25] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[15:18:03] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[15:18:03] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[15:19:41] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:19:42] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1009.eqiad.wmnet
[15:21:09] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: restore ms-be1059 to production [puppet] - 10https://gerrit.wikimedia.org/r/839591 (https://phabricator.wikimedia.org/T307667) (owner: 10MVernon)
[15:21:18] <jynus>	 https://grafana.wikimedia.org/goto/nnetPwV4k?orgId=1
[15:21:51] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:22:04] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:23:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Allow SRE to send annotated and signed tags [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711 (owner: 10Hashar)
[15:24:58] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10dancy)
[15:26:54] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) > With this method d-i will only see the two SSD disks and as such will have no way...
[15:27:53] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder)
[15:28:03] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10dancy)
[15:28:33] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1005.eqiad.wmnet with OS bullseye
[15:28:41] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[15:29:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) @aborrero thanks.  Reading briefly through the docs I have a better u...
[15:31:43] <wikibugs>	 (03PS1) 10Btullis: Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277)
[15:32:20] <wikibugs>	 (03PS2) 10Elukey: ml-services: move eventgate config to TLS egress origination [deployment-charts] - 10https://gerrit.wikimedia.org/r/839588
[15:32:22] <wikibugs>	 (03PS1) 10Elukey: admin_ng: fix eventgate's egress TLS origin config on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/839626
[15:32:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[15:35:47] <wikibugs>	 (03PS2) 10Btullis: Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277)
[15:35:56] <wikibugs>	 (03PS1) 10Volans: sre.hosts.reimage: support different installers [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067)
[15:36:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) >>! In T319539#8291916, @cmooney wrote: > I gather the hypervisor ho...
[15:38:41] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10Volans) > * Add support for it (it being whatever it takes to switch to 5.10) to the reimage cookbook stuff @BBlack  the above patch should have all that...
[15:38:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: fix eventgate's egress TLS origin config on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/839626 (owner: 10Elukey)
[15:39:32] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37477/console" [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[15:41:47] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[15:41:47] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[15:42:00] <vgutierrez>	 :_)
[15:42:09] * volans still here although not paged for 12 minutes
[15:42:22] <XioNoX>	 vgutierrez: happy oncall
[15:42:26] * jhathaway here as well
[15:43:22] <volans>	 quite some spikes https://librenms.wikimedia.org/graphs/device=140/type=device_bits/from=1664984587/legend=yes/popup_title=Device+Traffic/
[15:43:29] <volans>	 reaching the 10G
[15:43:31] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudnet1003/1004: make them spare servers [puppet] - 10https://gerrit.wikimedia.org/r/839628 (https://phabricator.wikimedia.org/T319682)
[15:44:26] <volans>	 that's Arelion
[15:44:37] <volans>	 https://librenms.wikimedia.org/device/device=140/tab=port/port=16840/
[15:44:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] cloudnet1003/1004: make them spare servers [puppet] - 10https://gerrit.wikimedia.org/r/839628 (https://phabricator.wikimedia.org/T319682) (owner: 10Arturo Borrero Gonzalez)
[15:44:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:45:07] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:45:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) > But we do have keepalived running on cloudgw servers. So we may wan...
[15:45:46] <jhathaway>	 volans: also being discussed in #wikimedia-sre
[15:45:55] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: move eventgate config to TLS egress origination [deployment-charts] - 10https://gerrit.wikimedia.org/r/839588 (owner: 10Elukey)
[15:46:47] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[15:46:47] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[15:47:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[15:47:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet1003/1004: make them spare servers [puppet] - 10https://gerrit.wikimedia.org/r/839628 (https://phabricator.wikimedia.org/T319682) (owner: 10Arturo Borrero Gonzalez)
[15:47:50] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1003.eqiad.wmnet with OS bullseye
[15:48:01] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1004.eqiad.wmnet with OS bullseye
[15:49:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[15:51:07] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[15:51:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[15:52:38] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[15:52:58] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[15:53:24] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[15:54:55] <wikibugs>	 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10akosiaris) Hi! >>! In T301505#8240830, @Novem_Linguae wrote: > In general, shouldn't phabricator tickets be one ticket = one cause? This one seems like i...
[15:56:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:57:53] <topranks>	 !log Applying explicit BFD mode configuration to cr4-ulsfo for Anycast BGP groups.
[15:57:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:57] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:00:05] <jouncebot>	 jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T1600). Please do the needful.
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:01:21] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:01:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:05:07] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] lower TTL for phabricator from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838916 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn)
[16:05:19] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) This has been completed smoothly!  I deleted the following VC cables from Netbox: 0315 0316 0317 0318 0320  Please...
[16:06:56] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) 05Open→03Resolved a:03ayounsi Sub-task completed successfully nothing more to do here.
[16:09:15] <wikibugs>	 (03PS1) 10Cathal Mooney: Add explicit BFD session mode (single/multi-hop) to Anycast groups [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501)
[16:09:16] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for MHorsey - https://phabricator.wikimedia.org/T318729 (10Arnoldokoth) Ooh, sorry I missed that step. I have added you to the wmf-nda group as well. Thanks @Aklapper
[16:10:10] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10Arnoldokoth) I have added to the wmf-nda group as well. Thanks @Aklapper
[16:18:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet
[16:21:00] <wikibugs>	 (03PS2) 10JMeybohm: Update to Kubernetes v1.23.12 [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943)
[16:21:33] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:26:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet
[16:27:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi)
[16:34:02] <wikibugs>	 (03CR) 10JMeybohm: Add a spark-operator production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[16:36:33] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:36:55] <wikibugs>	 (03CR) 10Btullis: Add a new production image for spark version 3.3.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[16:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:43:38] <wikibugs>	 (03CR) 10Vlad.shapik: Update the logic to run test coverage (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik)
[16:45:55] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) Also looks like the optic or fiber needs to be replaced, error rate is high: https://librenms.wikimedia.org/device/device=162/tab=port/p...
[16:50:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10cmooney) Diff if the above patch is merged (running from my laptop with updated template): ` Changes for 8 devices: ['c...
[16:53:16] <wikibugs>	 (03CR) 10Vlad.shapik: [C: 03+1] Add missing prod dependencies [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/839548 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[16:57:08] <wikibugs>	 (03PS3) 10Btullis: Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277)
[16:57:50] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder)
[17:00:03] <wikibugs>	 (03CR) 10Btullis: "Note that the Cassandra 3 cluster is still using a role called aqs_next - which is why it's safe to delete the aqs role. I will rename the" [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[17:00:05] <jouncebot>	 bd808: gettimeofday() says it's time for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T1700)
[17:08:27] <icinga-wm>	 PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:12:03] <wikibugs>	 (03PS1) 10Ssingh: Revert "bird: temporarily disable validate_cmd for bird.conf" [puppet] - 10https://gerrit.wikimedia.org/r/839571
[17:12:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "bird: temporarily disable validate_cmd for bird.conf" [puppet] - 10https://gerrit.wikimedia.org/r/839571 (owner: 10Ssingh)
[17:14:08] <wikibugs>	 (03PS2) 10Ssingh: Revert "bird: temporarily disable validate_cmd for bird.conf" [puppet] - 10https://gerrit.wikimedia.org/r/839571
[17:15:12] <wikibugs>	 (03CR) 10Btullis: Add a spark-operator production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[17:16:17] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "bird: temporarily disable validate_cmd for bird.conf" [puppet] - 10https://gerrit.wikimedia.org/r/839571 (owner: 10Ssingh)
[17:16:44] <wikibugs>	 (03CR) 10Btullis: Add a new production image for spark version 3.3.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[17:22:49] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:27:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Remove gerrit2001 from deployment targets [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839551 (https://phabricator.wikimedia.org/T243027) (owner: 10Hashar)
[17:28:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] lower TTL for phabricator from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838916 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn)
[17:29:44] <wikibugs>	 (03CR) 10Dzahn: lower TTL for gerrit,gerrit-replica from 600 to 300 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn)
[17:29:48] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Tested on an M1 with Python 3.10 and looks good:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/839602 (owner: 10Volans)
[17:30:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] lower TTL for gerrit,gerrit-replica from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn)
[17:30:19] <wikibugs>	 (03PS2) 10Dzahn: lower TTL for gerrit,gerrit-replica from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319)
[17:31:23] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/838835 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro)
[17:31:52] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Thanks for working on this and fixing it in the configuration!" [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) (owner: 10Cathal Mooney)
[17:32:08] <wikibugs>	 (03CR) 10Dzahn: lower TTL for gerrit,gerrit-replica from 600 to 300 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn)
[17:32:39] <wikibugs>	 (03PS2) 10Dzahn: lower TTL for phabricator from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838916 (https://phabricator.wikimedia.org/T315319)
[17:36:22] <wikibugs>	 (03PS1) 10Dzahn: lower TTL for gitlab-replicas from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/839665 (https://phabricator.wikimedia.org/T315319)
[17:36:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] lower TTL for gerrit,gerrit-replica from 600 to 300 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn)
[17:37:43] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:38:47] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] lower TTL for gitlab-replicas from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/839665 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn)
[17:42:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] lower TTL for gitlab-replicas from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/839665 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn)
[17:45:29] <wikibugs>	 (03CR) 10Volans: [C: 03+2] mypy: remove upper limit [software/spicerack] - 10https://gerrit.wikimedia.org/r/839602 (owner: 10Volans)
[17:45:51] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Frack codfw management network issue, many DRACs inaccessible - https://phabricator.wikimedia.org/T319311 (10Dzahn) a:03Jgreen
[17:46:09] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Frack codfw management network issue, many DRACs inaccessible - https://phabricator.wikimedia.org/T319311 (10Dzahn) 05Open→03In progress
[17:46:40] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Frack codfw management network issue, many DRACs inaccessible - https://phabricator.wikimedia.org/T319311 (10Dzahn) p:05Unbreak!→03High
[17:47:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Please add eigyan to Restricted Group - https://phabricator.wikimedia.org/T318983 (10Dzahn)
[17:49:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: Please add eigyan (essexigyan) to Restricted Group - https://phabricator.wikimedia.org/T318983 (10Dzahn)
[17:49:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Please add eigyan (essexigyan) to Restricted Group - https://phabricator.wikimedia.org/T318983 (10Dzahn) p:05Triage→03Medium
[17:50:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Please add eigyan (essexigyan) to Restricted Group - https://phabricator.wikimedia.org/T318983 (10Dzahn) @Arnoldokoth This is existing shell user `essexigyan` but an additional group.
[17:52:21] <wikibugs>	 (03Merged) 10jenkins-bot: mypy: remove upper limit [software/spicerack] - 10https://gerrit.wikimedia.org/r/839602 (owner: 10Volans)
[17:54:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10Arnoldokoth) Hey @AnnWF Kindly sign this https://phabricator.wikimedia.org/L3   Will also need approval from @Ottomata / @odimitrijevic and Dylan Kozlowski (I can't seem...
[17:55:35] <wikibugs>	 (03PS8) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815)
[17:58:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Eventlogs, Stats for Simulo-wikitech - https://phabricator.wikimedia.org/T318058 (10Dzahn) > Originally had I asked Simulo to file a new NDA after their transition to a volunteer role, unfortunately this volunteer onboarding isn't as simple as I had hoped.   @...
[18:00:05] <jouncebot>	 ^demon and brennen: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T1800). nyaa~
[18:00:31] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Eventlogs, Stats for Simulo-wikitech - https://phabricator.wikimedia.org/T318058 (10Dzahn) a:03awight Could you get the approval from a manager of some type?  Meanwhile Katie can reach out directly to @Simulo (@Simulo, she will need your email address, you c...
[18:04:25] <wikibugs>	 (03PS9) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815)
[18:05:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for devnull - https://phabricator.wikimedia.org/T318104 (10Dzahn) p:05Triage→03Medium a:03Devnull
[18:07:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Eventlogs, Stats for Simulo-wikitech - https://phabricator.wikimedia.org/T318058 (10Dzahn) p:05Triage→03Medium
[18:08:42] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Dzahn) 05Open→03In progress
[18:09:00] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10Dzahn) 05Open→03In progress
[18:09:39] <icinga-wm>	 RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:09:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10Arnoldokoth)
[18:09:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10Dzahn) a:03AnnWF
[18:10:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10Arnoldokoth) 05Open→03In progress
[18:12:19] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Frack codfw management network issue, many DRACs inaccessible - https://phabricator.wikimedia.org/T319311 (10Papaul) 05In progress→03Resolved This was fixed
[18:13:01] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1061-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[18:14:22] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Arnoldokoth) a:03karapayneWMDE
[18:22:06] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[18:24:12] <wikibugs>	 (03PS1) 10Jdlrobson: Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839572 (https://phabricator.wikimedia.org/T319396)
[18:29:02] <wikibugs>	 (03Abandoned) 10Ryan Kemper: [wip] logstash: remove old files [puppet] - 10https://gerrit.wikimedia.org/r/838255 (owner: 10Ryan Kemper)
[18:29:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10Dzahn) [[ https://en.wikipedia.org/wiki/Janus | Janus  ]] because of the 2 faces and it's what you get when you search for "Greek god of identity" and this is managing identities.
[18:29:56] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudnet1003.eqiad.wmnet
[18:30:44] <wikibugs>	 (03PS1) 10AOkoth: admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326)
[18:31:33] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10Arnoldokoth) p:05Triage→03Medium
[18:35:29] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[18:39:45] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:39:45] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudnet1003.eqiad.wmnet
[18:42:01] <wikibugs>	 (03CR) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[18:42:48] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: replace 2 codfw masters to be decom'd [puppet] - 10https://gerrit.wikimedia.org/r/839668 (https://phabricator.wikimedia.org/T313431)
[18:43:07] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:44:03] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/839668 (https://phabricator.wikimedia.org/T313431) (owner: 10Ryan Kemper)
[18:44:10] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37478/console" [puppet] - 10https://gerrit.wikimedia.org/r/839668 (https://phabricator.wikimedia.org/T313431) (owner: 10Ryan Kemper)
[18:44:42] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elastic: replace 2 codfw masters to be decom'd [puppet] - 10https://gerrit.wikimedia.org/r/839668 (https://phabricator.wikimedia.org/T313431) (owner: 10Ryan Kemper)
[18:44:48] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:45:01] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01926 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[18:45:39] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:46:57] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] elastic: replace 2 codfw masters to be decom'd [puppet] - 10https://gerrit.wikimedia.org/r/839668 (https://phabricator.wikimedia.org/T313431) (owner: 10Ryan Kemper)
[18:47:01] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:47:07] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:47:07] <wikibugs>	 (03CR) 10Dzahn: "if they really need shell access then this patch looks good to me. but the ticket said "might need" and that seemed a little weak. maybe a" [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth)
[18:47:51] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH) dns4003 appears to be pushed fully into service (thanks @ssingh!)  With that now seeming all green in icinga & confirmed with @BBlack , I'll move ahead and take down/decom dns4002 next tim...
[18:47:57] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:49:21] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:49:35] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Jclark-ctr) Can this be changed at any time?   I will work on netbox updates  when not in data center
[18:50:14] <logmsgbot>	 !log gehel@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on elastic2061.codfw.wmnet with reason: restarting for config reload - T313431
[18:50:18] <stashbot>	 T313431: Increase Elastic master-eligible nodes from 3 to 5 - https://phabricator.wikimedia.org/T313431
[18:50:29] <logmsgbot>	 !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on elastic2061.codfw.wmnet with reason: restarting for config reload - T313431
[18:50:37] <logmsgbot>	 !log gehel@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on elastic2084.codfw.wmnet with reason: restarting for config reload - T313431
[18:50:45] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:50:47] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:50:48] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:50:48] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:50:49] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:50:49] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:50:51] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:50:52] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:51:04] <logmsgbot>	 !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on elastic2084.codfw.wmnet with reason: restarting for config reload - T313431
[18:51:20] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:51:41] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:51:42] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:51:47] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:51:55] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:51:57] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:51:58] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:51:58] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:51:59] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:52:22] <logmsgbot>	 !log gehel@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on elastic[2025,2031].codfw.wmnet with reason: restarting for config reload - T313431
[18:52:27] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:52:39] <logmsgbot>	 !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on elastic[2025,2031].codfw.wmnet with reason: restarting for config reload - T313431
[18:52:49] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:53:11] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:53:13] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:53:14] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:53:14] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:53:15] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:53:17] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:53:17] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:53:18] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:53:45] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:54:05] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:54:06] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:54:11] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:54:19] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:54:21] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:54:22] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:54:23] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:54:24] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:55:14] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:57:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:57:12] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[19:00:25] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002963 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[19:00:50] <andrewbogott>	 sorry about that noise!  I think things are all recovered/recovering now
[19:01:40] <gehel>	 andrewbogott: thanks 
[19:03:20] <inflatador>	 !log 'bking@elastic restarted elastic2025, 2031, 2061, 2084 T313431
[19:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:24] <stashbot>	 T313431: Increase Elastic master-eligible nodes from 3 to 5 - https://phabricator.wikimedia.org/T313431
[19:07:21] <wikibugs>	 (03CR) 10Jdlrobson: Automate icon generation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[19:15:04] <brennen>	 !log train 1.40.0-wmf.4 (T314193) no current blockers, rolling train to all wikis
[19:15:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:08] <stashbot>	 T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193
[19:15:48] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839672 (https://phabricator.wikimedia.org/T314193)
[19:15:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839672 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot)
[19:16:16] <wikibugs>	 (03PS1) 10Bking: elastic: raise master-eligibles from 3 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/839673 (https://phabricator.wikimedia.org/T313431)
[19:16:41] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839672 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot)
[19:18:33] <wikibugs>	 (03PS2) 10Bking: elastic: raise master-eligibles from 3 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/839673 (https://phabricator.wikimedia.org/T313431)
[19:20:25] <wikibugs>	 (03Abandoned) 10Bking: elastic: raise master-eligibles from 3 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/839673 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking)
[19:20:59] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.4  refs T314193
[19:21:03] <stashbot>	 T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193
[19:21:24] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elasticsearch: Increase number of master-eligible nodes to 5 for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/836908 (https://phabricator.wikimedia.org/T313431) (owner: 10Gehel)
[19:21:51] <wikibugs>	 (03PS4) 10Bking: elasticsearch: Increase number of master-eligible nodes to 5 for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/836908 (https://phabricator.wikimedia.org/T313431) (owner: 10Gehel)
[19:22:06] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839675 (https://phabricator.wikimedia.org/T314193)
[19:22:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839675 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot)
[19:22:16] <brennen>	 ...eh, rolling this back to group1 and filing some tickets.
[19:23:07] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elasticsearch: Increase number of master-eligible nodes to 5 for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/836908 (https://phabricator.wikimedia.org/T313431) (owner: 10Gehel)
[19:23:51] <James_F>	 The spike in "This Title instance does not represent a proper page, but merely a link target."?
[19:24:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:24:17] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839675 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot)
[19:24:35] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 154 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:25:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:25:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:25:25] <brennen>	 James_F: yeah, also just noticed a bunch of `Argument 6 passed to ContentTranslation\Entity\RecentSignificantEdit::__construct() must be of the type array`
[19:25:41] <James_F>	 But not new with the train?
[19:26:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:26:19] <James_F>	 Or maybe new on .4 but found on group1.
[19:26:31] <James_F>	 Fun times.
[19:27:09] <James_F>	 The invalid titles might be T292552 ?
[19:27:09] <stashbot>	 T292552: Rename articles and users to prepare for PHP 7.3 unicode changes - https://phabricator.wikimedia.org/T292552
[19:27:14] <James_F>	 (That's not been run yet.)
[19:27:32] <James_F>	 But I don't know of anything being intentionally merged that expected that to have been done.
[19:28:15] <zabe>	 I would have guessed https://gerrit.wikimedia.org/r/c/mediawiki/core/+/828553 for those proper page errors
[19:28:32] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.40.0-wmf.3  refs T314193
[19:28:34] <James_F>	 Oh, hmm, could well be.
[19:28:36] <stashbot>	 T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193
[19:28:53] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 6 hosts with reason: T313431
[19:28:57] <stashbot>	 T313431: Increase Elastic master-eligible nodes from 3 to 5 - https://phabricator.wikimedia.org/T313431
[19:29:00] <James_F>	 zabe: Good find.
[19:29:05] <brennen>	 i filed T319798; input welcome there
[19:29:05] <stashbot>	 T319798: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent a proper page, but merely a link target. - https://phabricator.wikimedia.org/T319798
[19:29:10] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 6 hosts with reason: T313431
[19:29:56] <icinga-wm>	 PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:30:04] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:31:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:32:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:32:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:33:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:34:11] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@cbdc509]: (no justification provided)
[19:34:25] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@cbdc509]: (no justification provided) (duration: 00m 14s)
[19:36:59] <brennen>	 blocking on T314193 as well.
[19:37:00] <stashbot>	 T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193
[19:37:31] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data Pipelines, 10Foundational Technology Requests, 10User-fgiunchedi: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10Ottomata) Cool!  Is `this.ip.geoip_asn` built into benthos or did you provide it somehow?
[19:38:31] <brennen>	 (er, T319799)
[19:38:31] <stashbot>	 T319799: TypeError: Argument 6 passed to ContentTranslation\Entity\RecentSignificantEdit::__construct() must be of the type array, object given - https://phabricator.wikimedia.org/T319799
[19:39:51] <wikibugs>	 (03PS2) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223)
[19:39:53] <wikibugs>	 (03PS1) 10Jdlrobson: Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223)
[19:40:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[19:40:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[19:41:12] <SandraEbele>	 !log deployed airflow to fix projectview_hourly_dag
[19:41:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:25] <wikibugs>	 (03PS1) 10Samtar: Replace promise handling when AfD'ing pages [extensions/PageTriage] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839575 (https://phabricator.wikimedia.org/T238025)
[19:46:04] <wikibugs>	 (03PS1) 10Jdlrobson: Flag when projects are missing wordmarks or icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680
[19:47:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Flag when projects are missing wordmarks or icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680 (owner: 10Jdlrobson)
[19:50:44] <SandraEbele>	 !log killed Oozie projectview-hourly job
[19:50:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:07] <SandraEbele>	 !log Started airflow projectview_hourly_dag
[19:51:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:29] <wikibugs>	 (03PS1) 10Jdlrobson: ReadingLists on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839684 (https://phabricator.wikimedia.org/T317935)
[19:51:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ReadingLists on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839684 (https://phabricator.wikimedia.org/T317935) (owner: 10Jdlrobson)
[19:53:28] <wikibugs>	 (03PS1) 10Samtar: Replace promise handling when AfD'ing pages [extensions/PageTriage] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839576 (https://phabricator.wikimedia.org/T238025)
[19:57:15] <wikibugs>	 (03PS2) 10Jdlrobson: ReadingLists on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839684 (https://phabricator.wikimedia.org/T317935)
[20:00:04] <jouncebot>	 brennen and TheresNoTime: That opportune time is upon us again. Time for a UTC late backport and config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T2000).
[20:00:04] <jouncebot>	 Jdlrobson, TheresNoTime, chlod, and NovemLinguae: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:09] * TheresNoTime can deploy ^^
[20:00:35] * urbanecm waves too
[20:01:22] <TheresNoTime>	 oh hey urbanecm, can you quickly double-check that my idea to backport https://gerrit.wikimedia.org/r/c/839575/ (for .3 and .4) is okay?
[20:02:06] <TheresNoTime>	 Jdlrobson: you around? Going to start with https://gerrit.wikimedia.org/r/c/839572/ :)
[20:02:12] <Jdlrobson>	 present
[20:02:13] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:02:22] <urbanecm>	 TheresNoTime: at first sight, sgtm!
[20:02:23] <Jdlrobson>	 TheresNoTime: sounds good! thanks
[20:02:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839572 (https://phabricator.wikimedia.org/T319396) (owner: 10Jdlrobson)
[20:02:54] <TheresNoTime>	 (ty urbanecm)
[20:03:02] <urbanecm>	 no problem
[20:03:25] <thcipriani>	 TheresNoTime: heyo! Can we steal deployment of a patch for training purposes :)
[20:03:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:03:47] <TheresNoTime>	 thcipriani: sure! https://gerrit.wikimedia.org/r/c/839684/ is up next if you want that one?
[20:04:02] <thcipriani>	 sure, thank you <3
[20:04:23] <Jdlrobson>	 my second change is beta cluster only thcipriani if you wanted to try your tool again
[20:04:48] <urbanecm>	 Jdlrobson: just double checking, https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/838899 doesn't need to be backported to fix T319396 in production?
[20:04:49] <stashbot>	 T319396: Either newcomer homepage or userpage/talk page are not displayed on mobile - https://phabricator.wikimedia.org/T319396
[20:05:00] <TheresNoTime>	 thcipriani: the current one merging is a mw core, so you've got ~15 minutes if you want me to cancel?
[20:05:03] <Jdlrobson>	 urbanecm: correct
[20:05:10] <urbanecm>	 okay, great!
[20:05:18] <TheresNoTime>	 s/cancel/let that merge
[20:05:38] <thcipriani>	 cool, thank you!
[20:05:43] <logmsgbot>	 !log samtar@deploy1002 backport aborted:  (duration: 03m 13s)
[20:05:47] <thcipriani>	 okie doke, merging the beta one
[20:06:01] <TheresNoTime>	 ack, all yours
[20:06:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839684 (https://phabricator.wikimedia.org/T317935) (owner: 10Jdlrobson)
[20:07:44] <wikibugs>	 (03Merged) 10jenkins-bot: ReadingLists on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839684 (https://phabricator.wikimedia.org/T317935) (owner: 10Jdlrobson)
[20:10:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:10:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:13:56] <thcipriani>	 TheresNoTime: Jdlrobson all done! Made a good demo. Want me to do any more?
[20:14:23] <TheresNoTime>	 thcipriani: you can pick up https://gerrit.wikimedia.org/r/c/mediawiki/core/+/839572 if you want?
[20:14:26] <TheresNoTime>	 almost merged
[20:14:33] <TheresNoTime>	 (though you'll miss the +2ing part)
[20:14:53] <Jdlrobson>	 yeh just waiting on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/839572 and that's me done thanks
[20:15:03] <thcipriani>	 TheresNoTime: sure
[20:16:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:19:46] <thcipriani>	 TheresNoTime: feel free to +2 your own changes now, so you're not waiting forever
[20:20:42] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "deploy" [extensions/PageTriage] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839575 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar)
[20:20:49] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "deploy" [extensions/PageTriage] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839576 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar)
[20:22:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:22:04] <wikibugs>	 (03Merged) 10jenkins-bot: Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839572 (https://phabricator.wikimedia.org/T319396) (owner: 10Jdlrobson)
[20:22:49] <TheresNoTime>	 thcipriani: good call, thanks - ^ has merged now :)
[20:24:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839572 (https://phabricator.wikimedia.org/T319396) (owner: 10Jdlrobson)
[20:24:44] <wikibugs>	 (03PS1) 10Hashar: gerrit: use 2 threads to replicate to GitHub [puppet] - 10https://gerrit.wikimedia.org/r/839694
[20:24:54] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:839572|Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks (T319396)]]
[20:24:55] <thcipriani>	 oh, it +2s again -- TIL
[20:24:59] <stashbot>	 T319396: Either newcomer homepage or userpage/talk page are not displayed on mobile - https://phabricator.wikimedia.org/T319396
[20:25:09] <Jdlrobson>	 thcipriani: is the train blocked? I just noticed eswiki where I need to test is on wm4
[20:25:16] <Jdlrobson>	 wmf3 rather
[20:25:18] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and jdlrobson: Backport for [[gerrit:839572|Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks (T319396)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:25:18] <Jdlrobson>	 than wmf4
[20:25:33] <Jdlrobson>	 (Just wondering if I need to backport this to wmf3 as well)
[20:25:43] <thcipriani>	 yeah wikipedias are still on wmf.3: https://versions.toolforge.org/
[20:26:05] <Jdlrobson>	 Is it likely to stay that way until Monday?
[20:26:21] <Jdlrobson>	 If so I guess I need to backport this to wmf3 as well (sorry)
[20:26:43] <wikibugs>	 (03CR) 10Hashar: "I have made the replica to use 4 threads in May with I172557bfbca4cf5bb8321cecafc7bc84f60abc5d / T307137." [puppet] - 10https://gerrit.wikimedia.org/r/839694 (owner: 10Hashar)
[20:26:47] <thcipriani>	 I believe that's being worked on, but it is getting late in the day. It'll probably be fixed by Monday, but I'm never 100%
[20:26:53] <wikibugs>	 (03Merged) 10jenkins-bot: Replace promise handling when AfD'ing pages [extensions/PageTriage] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839575 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar)
[20:26:57] <wikibugs>	 (03PS1) 10Jdlrobson: Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839577 (https://phabricator.wikimedia.org/T319396)
[20:27:00] <Jdlrobson>	 okay ill add this to the deployment calendar ^
[20:27:12] <Jdlrobson>	 TheresNoTime: feel free to do yours first
[20:27:16] <thcipriani>	 Jdlrobson: any way to check this on non wikipedia wikis?
[20:27:23] <Jdlrobson>	 probably...
[20:27:27] <Jdlrobson>	 im looking at group 1 wikis now
[20:27:33] <thcipriani>	 it's live on mwdebug on group0/1 now :)
[20:27:47] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudnet1003.eqiad.wmnet
[20:27:50] <TheresNoTime>	 (I'll wait to hear)
[20:27:52] <Jdlrobson>	 yep i can test on euwiki
[20:28:01] <thcipriani>	 cool :)
[20:28:35] <Jdlrobson>	 any of the debug servers thcipriani ?
[20:28:51] <wikibugs>	 (03Merged) 10jenkins-bot: Replace promise handling when AfD'ing pages [extensions/PageTriage] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839576 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar)
[20:28:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:29:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:29:06] <thcipriani>	 Jdlrobson: yep, all of them have it
[20:29:22] <Jdlrobson>	 Fix confirmed on itwiki
[20:29:26] <Jdlrobson>	 feel free to sync!
[20:30:19] <Jdlrobson>	 okay wmf3 change is on the calendar now. Let me know when it's a good time
[20:30:22] <thcipriani>	 thanks Jdlrobson 
[20:30:29] <thcipriani>	 going live
[20:31:03] <icinga-wm>	 RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:32:23] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[20:33:50] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:33:51] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudnet1003.eqiad.wmnet
[20:34:45] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:839572|Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks (T319396)]] (duration: 09m 51s)
[20:34:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10MoritzMuehlenhoff) >>! In T319409#8292457, @Dzahn wrote: > [[ https://en.wikipedia.org/wiki/Janus | Janus  ]] because of the 2 faces and it's what you get when you search for "Greek god of identity" and...
[20:34:49] <stashbot>	 T319396: Either newcomer homepage or userpage/talk page are not displayed on mobile - https://phabricator.wikimedia.org/T319396
[20:34:58] <thcipriani>	 Jdlrobson: ^ should be live
[20:35:21] <Jdlrobson>	 yep!
[20:35:21] <thcipriani>	 TheresNoTime: please feel free to sync your changes
[20:35:28] <TheresNoTime>	 thcipriani: thanks :)
[20:35:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:35:33] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: sync_check_icinga_contacts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:35:45] <thcipriani>	 I'll finish Jdlrobson 's patch after that
[20:35:51] <Jdlrobson>	 sounds good!
[20:35:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/PageTriage] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839575 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar)
[20:36:36] <logmsgbot>	 !log samtar@deploy1002 Backport cancelled.
[20:36:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/PageTriage] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839575 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar)
[20:37:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/PageTriage] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839576 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar)
[20:37:14] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:839575|Replace promise handling when AfD'ing pages (T238025)]], [[gerrit:839576|Replace promise handling when AfD'ing pages (T238025)]]
[20:37:18] <stashbot>	 T238025: Page Curation fails to create AFD page - https://phabricator.wikimedia.org/T238025
[20:37:37] <logmsgbot>	 !log samtar@deploy1002 samtar and samtar: Backport for [[gerrit:839575|Replace promise handling when AfD'ing pages (T238025)]], [[gerrit:839576|Replace promise handling when AfD'ing pages (T238025)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:37:51] <TheresNoTime>	 (testing)
[20:37:52] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder)
[20:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[20:39:14] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudnet1004.eqiad.wmnet
[20:40:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:41:15] <TheresNoTime>	 (syncing)
[20:41:24] <wikibugs>	 (03PS3) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223)
[20:41:35] <wikibugs>	 (03PS4) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223)
[20:42:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[20:44:28] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839577 (https://phabricator.wikimedia.org/T319396) (owner: 10Jdlrobson)
[20:44:36] * thcipriani gets that cooking
[20:45:11] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:839575|Replace promise handling when AfD'ing pages (T238025)]], [[gerrit:839576|Replace promise handling when AfD'ing pages (T238025)]] (duration: 07m 56s)
[20:45:13] <wikibugs>	 (03CR) 10Legoktm: "Conceptually +1 to this, though I think we should be consistent across PHP versions if we're adding packages, so I would've -1'd this as w" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/838939 (https://phabricator.wikimedia.org/T310435) (owner: 10BryanDavis)
[20:45:15] <stashbot>	 T238025: Page Curation fails to create AFD page - https://phabricator.wikimedia.org/T238025
[20:45:48] <TheresNoTime>	 thcipriani: all yours :)
[20:46:20] <thcipriani>	 thanks TheresNoTime 
[20:47:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:47:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:48:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:55:46] <wikibugs>	 (03CR) 10Xcollazo: [C: 03+1] "Change looks ok to me (cursory check though as I'm unfamiliar with codebase)." [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/780898 (https://phabricator.wikimedia.org/T306197) (owner: 10Ottomata)
[20:57:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10KFrancis) >>! In T308013#8282942, @jbond wrote: > @QChris thanks for the contribution and reaching out. >  >>>! In T308013#8282636, @QChris wrote: >> While I fully support...
[20:57:55] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:58:21] <wikibugs>	 (03PS5) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223)
[20:58:23] <wikibugs>	 (03PS2) 10Jdlrobson: Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223)
[20:58:25] <wikibugs>	 (03PS2) 10Jdlrobson: Flag when projects are missing wordmarks or icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680
[20:58:27] <wikibugs>	 (03PS1) 10Jdlrobson: DONOTMERGE: Proof of concept for batch updating DI wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839700 (https://phabricator.wikimedia.org/T319223)
[20:58:49] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[20:59:26] <wikibugs>	 (03Merged) 10jenkins-bot: Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839577 (https://phabricator.wikimedia.org/T319396) (owner: 10Jdlrobson)
[20:59:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[20:59:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[20:59:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DONOTMERGE: Proof of concept for batch updating DI wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839700 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[20:59:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Flag when projects are missing wordmarks or icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680 (owner: 10Jdlrobson)
[21:00:53] <thcipriani>	 \o/ merged
[21:01:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839577 (https://phabricator.wikimedia.org/T319396) (owner: 10Jdlrobson)
[21:01:40] <Jdlrobson>	 yeyyya
[21:01:51] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:839577|Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks (T319396)]]
[21:01:56] <stashbot>	 T319396: Either newcomer homepage or userpage/talk page are not displayed on mobile - https://phabricator.wikimedia.org/T319396
[21:02:14] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and jdlrobson: Backport for [[gerrit:839577|Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks (T319396)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[21:02:43] <thcipriani>	 Jdlrobson: alright, ^ should be on any of the mwdebug servers, check please
[21:02:56] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:02:57] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudnet1004.eqiad.wmnet
[21:03:24] <Jdlrobson>	 thcipriani: looking
[21:03:51] <Jdlrobson>	 yep that did it! let's sync 
[21:03:57] <thcipriani>	 great! going
[21:08:00] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:839577|Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks (T319396)]] (duration: 06m 08s)
[21:08:04] <stashbot>	 T319396: Either newcomer homepage or userpage/talk page are not displayed on mobile - https://phabricator.wikimedia.org/T319396
[21:08:20] <thcipriani>	 ^ Jdlrobson all sync'd
[21:09:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:09:30] <Jdlrobson>	 thcipriani: thanks!
[21:09:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:09:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:11:21] <wikibugs>	 (03PS6) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223)
[21:11:23] <wikibugs>	 (03PS3) 10Jdlrobson: Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223)
[21:11:25] <wikibugs>	 (03PS3) 10Jdlrobson: Flag when projects are missing wordmarks or icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680
[21:11:27] <wikibugs>	 (03PS2) 10Jdlrobson: DONOTMERGE: Proof of concept for batch updating DI wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839700 (https://phabricator.wikimedia.org/T319223)
[21:11:33] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] "I still need to handle redirects in this one (using symlinks ln -s)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[21:12:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[21:12:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[21:12:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Flag when projects are missing wordmarks or icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680 (owner: 10Jdlrobson)
[21:12:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DONOTMERGE: Proof of concept for batch updating DI wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839700 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[21:13:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:13:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Initial Django project setup - https://phabricator.wikimedia.org/T319410 (10bd808) #striker and/or #toolhub may have things that are worth copying for you here. #striker especially has a [[https://gerrit.wikimedia.org/r/plugins/gitiles/labs/striker/+/refs/heads/master/contr...
[21:14:20] <wikibugs>	 (03PS7) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223)
[21:14:59] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove refs to cloudnet100[34] [puppet] - 10https://gerrit.wikimedia.org/r/839706 (https://phabricator.wikimedia.org/T319682)
[21:15:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[21:16:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove refs to cloudnet100[34] [puppet] - 10https://gerrit.wikimedia.org/r/839706 (https://phabricator.wikimedia.org/T319682) (owner: 10Andrew Bogott)
[21:18:30] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): decommission cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T319682 (10Andrew) a:03Cmjohnson
[21:19:02] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): decommission cloudnet1004.eqiad.wmnet - https://phabricator.wikimedia.org/T319683 (10Andrew) a:03Cmjohnson
[21:20:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[21:24:56] <wikibugs>	 (03PS4) 10Dduvall: P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997)
[21:28:02] <wikibugs>	 (03CR) 10Dduvall: "Did quite a bit of refactoring and incorporated your feedback. I hope I didn't bloat the patch too much with the extra type definitions, b" [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[22:08:45] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic1073 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:08:45] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic2080 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.
[22:08:45] <icinga-wm>	 /Search%23Administration
[22:11:40] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T319682 (10Volans) >>! In T319682#8294169, @Andrew wrote: > cc @Volans regarding the failure to wipe the drives. Feel free to investigate/rerun this if yo...
[22:13:13] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic2075 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.
[22:13:13] <icinga-wm>	 /Search%23Administration
[22:13:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[22:15:03] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9400 on elastic2073 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.
[22:15:03] <icinga-wm>	 /Search%23Administration
[22:15:49] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic1075 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:15:49] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9400 on elastic1068 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:15:49] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9400 on elastic1057 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:15:51] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9400 on elastic2047 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.
[22:15:51] <icinga-wm>	 /Search%23Administration
[22:23:53] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9400 on elastic2042 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.
[22:23:53] <icinga-wm>	 /Search%23Administration
[22:23:57] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9400 on elastic2086 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.
[22:23:57] <icinga-wm>	 /Search%23Administration
[22:23:57] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9400 on elastic2052 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.
[22:23:57] <icinga-wm>	 /Search%23Administration
[22:26:37] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9200 on elastic1054 is CRITICAL: CRITICAL - [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500] does not match [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500, elastic1093.eqiad.wmnet:9500, elastic1098.eqiad.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:29:01] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9200 on elastic1081 is CRITICAL: CRITICAL - [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500] does not match [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500, elastic1093.eqiad.wmnet:9500, elastic1098.eqiad.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:29:03] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9200 on elastic1074 is CRITICAL: CRITICAL - [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500] does not match [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500, elastic1093.eqiad.wmnet:9500, elastic1098.eqiad.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:29:05] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic2054 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.
[22:29:05] <icinga-wm>	 /Search%23Administration
[22:33:47] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic1083 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:33:49] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic2083 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.
[22:33:49] <icinga-wm>	 /Search%23Administration
[22:33:51] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic2076 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.
[22:33:51] <icinga-wm>	 /Search%23Administration
[22:34:19] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9400 on elastic1076 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:53:23] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:55:41] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:57:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:57:12] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[23:10:56] <wikibugs>	 (03PS1) 10BryanDavis: Use explicit 'latest' tags on upstream base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/839745 (https://phabricator.wikimedia.org/T320100)
[23:47:53] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder)