[00:00:25] (03Merged) 10jenkins-bot: php74: add many TTF fonts [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/838939 (https://phabricator.wikimedia.org/T310435) (owner: 10BryanDavis) [00:13:44] (03CR) 10Ssingh: [C: 03+1] lower TTL for phabricator from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838916 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn) [00:14:09] (03CR) 10Ssingh: [C: 03+1] lower TTL for gerrit,gerrit-replica from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn) [00:23:00] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:35:05] !log reedy@deploy1002 Started deploy [integration/docroot@13687ed]: More minor updates [00:35:35] !log reedy@deploy1002 Finished deploy [integration/docroot@13687ed]: More minor updates (duration: 00m 30s) [00:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:50:57] (03PS1) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) [00:51:59] (03CR) 10CI reject: [V: 04-1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [00:55:13] (03CR) 10Jdlrobson: "Hey James, Reedy and Tyler" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [00:58:08] (03CR) 10Jdlrobson: Automate icon generation (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [01:03:06] !log reedy@deploy1002 Started deploy [integration/docroot@5cd2243]: Minor fixes [01:03:18] !log reedy@deploy1002 Finished deploy [integration/docroot@5cd2243]: Minor fixes (duration: 00m 12s) [01:12:28] !log reedy@deploy1002 Started deploy [integration/docroot@dc380cb]: Update jQuery [01:12:39] !log reedy@deploy1002 Finished deploy [integration/docroot@dc380cb]: Update jQuery (duration: 00m 11s) [01:24:12] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:45] (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [04:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:52:32] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:54:52] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:56:48] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (20) node(s) change every puppet run: an-test-client1001, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, phab1004, releases1002, releases2002, stat1004, stat1005, stat1007, stat1008 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [05:35:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:40:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:46:19] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Joe) a:03Joe [05:55:40] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance Issue: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10Joe) 05Open→03Resolved a:03Joe Tentatively resolving because we've moved past php 7.2 and we seem to have reverted the php 7.2-only st... [06:00:05] kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T0600). [06:04:33] 10SRE, 10MediaWiki-Parser, 10serviceops-radar, 10Performance-Team (Radar): purgeParserCache.php: Cannot purge this kind of parser cache - https://phabricator.wikimedia.org/T250231 (10Joe) [06:11:42] (03PS3) 10Elukey: role::kafka::logging: final clean up after migrating to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) [06:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [06:24:01] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 22616 [06:24:07] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37465/console" [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [06:24:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 22616 [06:25:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 6079 [06:26:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 6079 [06:27:26] (03PS1) 10Giuseppe Lavagetto: Remove php 7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/839324 [06:30:44] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:46:27] (03PS2) 10Muehlenhoff: wmcs::kubeadm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838831 (https://phabricator.wikimedia.org/T308013) [06:54:16] (03CR) 10Muehlenhoff: "One comment inline, rest looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [06:54:27] (03CR) 10Muehlenhoff: [C: 03+2] wmcs::kubeadm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838831 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [06:55:00] (03CR) 10Majavah: [C: 03+1] wmcs::metricsinfra: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838832 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [06:57:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1012.eqiad.wmnet with OS bullseye [06:57:56] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1012.eqiad.wmnet with OS bullseye [06:58:40] (03PS3) 10Muehlenhoff: swift: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838830 (https://phabricator.wikimedia.org/T308013) [06:59:50] (03PS2) 10Muehlenhoff: bgpalerter: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837070 [07:00:05] Amir1, apergos, and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T0700). [07:00:12] morning! there are no trainees signed up for the window and no deployments on the calendar for the window either. [07:05:43] (03CR) 10Muehlenhoff: [C: 03+2] swift: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838830 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:06:26] (03CR) 10Muehlenhoff: [C: 03+2] bgpalerter: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837070 (owner: 10Muehlenhoff) [07:11:26] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1012.eqiad.wmnet with reason: host reimage [07:14:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1012.eqiad.wmnet with reason: host reimage [07:15:31] !log draining ganeti1005 T311687 [07:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:35] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [07:30:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1012.eqiad.wmnet with OS bullseye [07:30:37] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1012.eqiad.wmnet with OS bullseye completed: - ganeti1012 (**PASS**) - Downtimed on... [07:36:40] !log draining ganeti1026 T311687 [07:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:45] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [07:42:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1012.eqiad.wmnet [07:47:12] (03PS1) 10Cathal Mooney: Depool esams in gdns prior to reboot of line card [dns] - 10https://gerrit.wikimedia.org/r/839396 (https://phabricator.wikimedia.org/T318783) [07:48:59] (03CR) 10Ayounsi: [C: 03+1] Depool esams in gdns prior to reboot of line card [dns] - 10https://gerrit.wikimedia.org/r/839396 (https://phabricator.wikimedia.org/T318783) (owner: 10Cathal Mooney) [07:49:22] (03CR) 10Cathal Mooney: [C: 03+2] Depool esams in gdns prior to reboot of line card [dns] - 10https://gerrit.wikimedia.org/r/839396 (https://phabricator.wikimedia.org/T318783) (owner: 10Cathal Mooney) [07:50:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1012.eqiad.wmnet [07:50:11] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::logging: final clean up after migrating to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:50:17] !log De-pooling esams in advance of cr2-esams line card reboot [07:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:05] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10Aklapper) @Arnoldokoth: This isn't resolved yet, see https://wikitech.wikimedia.org/wiki/SRE/LDAP#Add_a_user_to_a_group [07:56:07] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for MHorsey - https://phabricator.wikimedia.org/T318729 (10Aklapper) @Arnoldokoth: This isn't resolved yet, see https://wikitech.wikimedia.org/wiki/SRE/LDAP#Add_a_user_to_a_group [07:57:44] (03PS1) 10KartikMistry: ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839411 (https://phabricator.wikimedia.org/T319156) [07:59:28] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10SLyngshede-WMF) a:03SLyngshede-WMF [08:00:08] !log delete /etc/kafka/ssl/kafka_logging-eqiad_broker.keystore.jks on kafka-logging1001 and restart (old puppet cert + settings deleted) [08:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:36] (03CR) 10KartikMistry: "If I've understand correctly, 89% is OK when task says by 10% stricter (default is 99%)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839411 (https://phabricator.wikimedia.org/T319156) (owner: 10KartikMistry) [08:01:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1012.eqiad.wmnet to cluster eqiad and group C [08:05:39] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add records for ATS percent usage [puppet] - 10https://gerrit.wikimedia.org/r/838911 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [08:06:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you for the review! This will self-deploy on puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/838911 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [08:06:34] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Remove ATS 8-specific metrics [puppet] - 10https://gerrit.wikimedia.org/r/838886 (owner: 10BCornwall) [08:07:12] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall!" [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [08:09:03] !log kafka logging old cert cleanup - `cumin 'A:kafka-logging' 'rm -f /etc/kafka/ssl/kafka_logging-eqiad_broker.keystore.jks'` [08:09:03] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for gdal [puppet] - 10https://gerrit.wikimedia.org/r/838842 (owner: 10Muehlenhoff) [08:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I think we should be fine even if some exporter restarts" [puppet] - 10https://gerrit.wikimedia.org/r/838834 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:10:02] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad [08:10:29] !log restart kafka on kafka-logging1002 to reload the conifg (cleanup old super.users related to past keystore) [08:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:11] 10SRE, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 4 others: Make SwiftFileBackend::doStoreInternal defer the opening of file handles to stay in the concurrency limit - https://phabricator.wikimedia.org/T230245 (10MatthewVernon) [08:12:06] (03CR) 10Filippo Giunchedi: confd: export template status as Prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi) [08:12:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [08:12:59] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:15:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 244, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:20:41] ^^ Arelion circuit IC-314533, carrier maintenance it seems, ref PWIC223124. [08:20:51] I'm proceeding with reboot of cr2-esams line card [08:21:28] (03CR) 10Elukey: "Hi Ben! One thing that may be good to do is to split the Docker file into multiple ones, see what it has been done for istio for example. " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [08:21:34] !log disabling external BGP sessions on cr2-esams prior to line card reboot [08:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:29] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:24:22] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cr2-esams,cr2-esams IPv6,re0.cr2-esams.mgmt with reason: line card reboot [08:24:29] (03CR) 10Elukey: "This operator image may go under the same spark namespace, if we had it, to find all docker images in one place (see comments on the spark" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [08:24:36] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr2-esams,cr2-esams IPv6,re0.cr2-esams.mgmt with reason: line card reboot [08:25:04] !log disabling OSPF on cr2-esams [08:25:17] (03CR) 10Jelto: [C: 03+2] gitlab: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838829 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:25:25] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1029 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/838840 (https://phabricator.wikimedia.org/T299459) (owner: 10Muehlenhoff) [08:26:49] PROBLEM - Host lvs3006 is DOWN: PING CRITICAL - Packet loss = 100% [08:26:49] PROBLEM - Host lvs3005 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:05] PROBLEM - Host lvs3007 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:09] PROBLEM - Host prometheus3001 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:37] PROBLEM - Host ping3002 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:26] topranks: expected!?!?!? [08:28:37] PROBLEM - Host dns3002 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:41] PROBLEM - Host bast3005 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:41] PROBLEM - Host durum3001 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:43] PROBLEM - Host cp3050 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:43] PROBLEM - Host cp3051 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:43] PROBLEM - Host cp3054 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:43] PROBLEM - Host cp3062 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:43] PROBLEM - Host cp3061 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:44] PROBLEM - Host cp3052 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:44] PROBLEM - Host cp3056 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:45] PROBLEM - Host cp3058 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:45] PROBLEM - Host cp3059 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:46] PROBLEM - Host cp3057 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:46] PROBLEM - Host cp3063 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:47] PROBLEM - Host cp3065 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:47] PROBLEM - Host cp3064 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:48] PROBLEM - Host ganeti3001 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:48] PROBLEM - Host cp3060 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:49] PROBLEM - Host ganeti3002 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:49] PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:50] PROBLEM - Host dns3001 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:50] PROBLEM - Host doh3001 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:51] PROBLEM - Host durum3002 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:51] PROBLEM - Host ncredir3001 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:51] PROBLEM - Host doh3002 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:53] PROBLEM - Host ganeti3003 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:57] PROBLEM - Host cp3055 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:57] * volans preparing depool patch [08:28:57] PROBLEM - Host install3001 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:59] No... some disruption is not unexpected but OSPF should converge quickly [08:29:13] Traffic from US getting to cr3-esams as expected but not getting next-hop / reply [08:29:37] PROBLEM - Host ncredir3002 is DOWN: PING CRITICAL - Packet loss = 100% [08:29:37] PROBLEM - Host netflow3002 is DOWN: PING CRITICAL - Packet loss = 100% [08:29:46] topranks: o/ I was about to ask, I have trouble ssh-ing to a bastion via init7 -> cr3-esams [08:30:02] ah it's already depooled, you got me for a sec [08:30:05] hoo: gettimeofday() says it's time for Wikibase client unexpectedUnconnectedPage page prop format conversion. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T0830) [08:30:05] (bast3005 then I tried bast1003) [08:30:05] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - [08:30:05] , AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:30:23] PROBLEM - Host upload-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:30:31] (03PS1) 10Muehlenhoff: eventlogging: Update includes to current styleguide [puppet] - 10https://gerrit.wikimedia.org/r/839428 [08:30:37] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:37] PROBLEM - Host ripe-atlas-esams is DOWN: PING CRITICAL - Packet loss = 100% [08:30:38] PROBLEM - Host ripe-atlas-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:38] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:30:45] Something odd happening, connection to the OOB has dropped on me. [08:30:50] * topranks invetigating [08:30:59] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:31:31] topranks: can I help? do you need an incident doc? do we need to do something about ns2.w.o? [08:32:45] RECOVERY - Host cp3054 is UP: PING OK - Packet loss = 0%, RTA = 107.04 ms [08:32:45] RECOVERY - Host cp3055 is UP: PING OK - Packet loss = 0%, RTA = 107.13 ms [08:32:45] RECOVERY - Host cp3050 is UP: PING OK - Packet loss = 0%, RTA = 107.16 ms [08:32:45] RECOVERY - Host cp3059 is UP: PING OK - Packet loss = 0%, RTA = 107.03 ms [08:32:45] RECOVERY - Host cp3051 is UP: PING OK - Packet loss = 0%, RTA = 107.40 ms [08:32:46] RECOVERY - Host cp3064 is UP: PING OK - Packet loss = 0%, RTA = 107.04 ms [08:32:46] RECOVERY - Host cp3058 is UP: PING OK - Packet loss = 0%, RTA = 107.47 ms [08:32:47] RECOVERY - Host cp3056 is UP: PING OK - Packet loss = 0%, RTA = 106.98 ms [08:32:47] RECOVERY - Host cp3052 is UP: PING OK - Packet loss = 0%, RTA = 107.05 ms [08:32:48] RECOVERY - Host cp3053 is UP: PING OK - Packet loss = 0%, RTA = 107.42 ms [08:32:48] RECOVERY - Host cp3060 is UP: PING OK - Packet loss = 0%, RTA = 107.07 ms [08:32:49] RECOVERY - Host cp3057 is UP: PING OK - Packet loss = 0%, RTA = 107.14 ms [08:32:49] RECOVERY - Host cp3063 is UP: PING OK - Packet loss = 0%, RTA = 107.07 ms [08:32:49] (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:32:49] RECOVERY - Host lvs3007 is UP: PING OK - Packet loss = 0%, RTA = 107.07 ms [08:32:50] RECOVERY - Host cp3061 is UP: PING OK - Packet loss = 0%, RTA = 107.21 ms [08:32:51] RECOVERY - Host cp3065 is UP: PING OK - Packet loss = 0%, RTA = 107.06 ms [08:32:51] RECOVERY - Host cp3062 is UP: PING OK - Packet loss = 0%, RTA = 107.06 ms [08:32:52] RECOVERY - Host durum3002 is UP: PING OK - Packet loss = 0%, RTA = 107.44 ms [08:32:52] RECOVERY - Host ncredir3001 is UP: PING OK - Packet loss = 0%, RTA = 107.43 ms [08:32:53] RECOVERY - Host lvs3005 is UP: PING OK - Packet loss = 0%, RTA = 107.12 ms [08:32:53] RECOVERY - Host dns3001 is UP: PING OK - Packet loss = 0%, RTA = 107.50 ms [08:32:54] RECOVERY - Host ping3002 is UP: PING WARNING - Packet loss = 80%, RTA = 862.40 ms [08:32:54] RECOVERY - Host dns3002 is UP: PING OK - Packet loss = 0%, RTA = 107.26 ms [08:32:55] RECOVERY - Host doh3001 is UP: PING OK - Packet loss = 0%, RTA = 107.41 ms [08:32:55] RECOVERY - Host doh3002 is UP: PING OK - Packet loss = 0%, RTA = 107.43 ms [08:32:55] RECOVERY - Host ganeti3003 is UP: PING OK - Packet loss = 0%, RTA = 107.35 ms [08:32:56] RECOVERY - Host lvs3006 is UP: PING OK - Packet loss = 0%, RTA = 107.64 ms [08:32:59] RECOVERY - Host ncredir3002 is UP: PING OK - Packet loss = 0%, RTA = 107.48 ms [08:33:01] RECOVERY - Host bast3005 is UP: PING OK - Packet loss = 0%, RTA = 107.48 ms [08:33:05] RECOVERY - Host ganeti3002 is UP: PING OK - Packet loss = 0%, RTA = 107.15 ms [08:33:05] RECOVERY - Host install3001 is UP: PING OK - Packet loss = 0%, RTA = 107.65 ms [08:33:09] RECOVERY - Host netflow3002 is UP: PING OK - Packet loss = 0%, RTA = 107.29 ms [08:33:13] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 19, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:33:19] RECOVERY - Host prometheus3001 is UP: PING OK - Packet loss = 0%, RTA = 107.35 ms [08:33:21] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 107.15 ms [08:33:29] RECOVERY - Host durum3001 is UP: PING OK - Packet loss = 0%, RTA = 107.74 ms [08:33:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:33:33] RECOVERY - Host ganeti3001 is UP: PING OK - Packet loss = 0%, RTA = 107.08 ms [08:34:02] there was a small increase in 5XX and NELs [08:34:16] https://grafana.wikimedia.org/goto/EfwM6U44z?orgId=1 [08:35:07] yep [08:35:11] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:35:29] RECOVERY - Host upload-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 107.15 ms [08:35:47] RECOVERY - Host ripe-atlas-esams is UP: PING OK - Packet loss = 0%, RTA = 107.27 ms [08:35:47] RECOVERY - Host ripe-atlas-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 107.44 ms [08:35:47] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 107.03 ms [08:41:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 245, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:41:03] volans: I should have done VRRP first which meant this took longer than it ought to flip over. [08:41:03] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Slst2020) 05Open→03Resolved a:03Slst2020 Thank you, closing now! [08:41:03] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:41:03] !log installing puma security updates [08:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:03] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 216 probes of 634 (alerts on 90) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:41:03] jynus: the interesting part is that I can't see that increase in logstash [08:41:03] https://logstash.wikimedia.org/app/dashboards#/view/ee6432c0-82a9-11eb-9d45-739221ba7fb6?_g=h@42b0d52&_a=h@c3f9414 [08:41:03] for NEL [08:48:30] as for 5xx... still looking [08:48:50] the one in the grafana home uses varnish_requests [08:48:57] with method!="PURGE", status=~"5.." [08:50:11] (03CR) 10Btullis: Add a spark-operator production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [08:50:15] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:50:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:51:28] most of the NEL errors are from one russian ISP, my guess is that they don't respect the DNS TTL, and were still going to esams [08:51:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:51:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:51:51] maybe a DNS MITM or the such [08:52:24] 5xx seems that we just got some 502s on esams itself [08:52:25] https://grafana-rw.wikimedia.org/d/000000464/varnish-aggregate-client-status-code?orgId=1&from=now-1h&to=now&var-site=codfw&var-site=drmrs&var-site=eqiad&var-site=eqsin&var-site=esams&var-site=ulsfo&var-cache_type=varnish-text&var-cache_type=varnish-upload&var-status_type=5&var-method=GET&var-method=HEAD&var-method=POST&viewPanel=2 [08:52:35] XioNoX: yes I'm inclided to agreed, they are for IP address of upload-lb.esams.wikimedia.org., CNAME should have moved them to drmrs [08:52:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:52:42] (compare with just selecting esams, the others don't affect the 502 spike) but is small [08:52:43] (03PS2) 10JMeybohm: k8s: Remove all debian version if-guarding [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) [08:52:49] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:53:15] volans: yes. makes sense in my head I think I will proceed with reboot and then set things back to normal. [08:53:44] !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for three wikis (duration: 04m 03s) [08:53:48] topranks: ack [08:54:45] !log rebooting line card fpc 0 on cr2-esams (T318783) [08:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:49] T318783: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 [08:55:13] Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for ruwiktionary [08:55:26] (03PS1) 10Muehlenhoff: docker: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/839447 [08:56:49] Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for specieswiki [08:56:51] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:57:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:58:37] (03PS1) 10Muehlenhoff: labs_bootstrapvz: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/839449 [08:58:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:58:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:58:55] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:59:11] (03CR) 10CI reject: [V: 04-1] labs_bootstrapvz: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/839449 (owner: 10Muehlenhoff) [08:59:23] <_joe_> !log uploaded new php 7.4 packages T318918 [08:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:00:19] !log Ran extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for ruwiktionary [09:00:28] !log Ran extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for specieswiki [09:01:06] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for cebwiki [09:01:27] (03PS2) 10Muehlenhoff: labs_bootstrapvz: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/839449 [09:03:40] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 84 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37467/console" [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:04:21] T318918: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 [09:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:55] !log re-pooling esams after cr2-esams line card reboot [09:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:15] (03PS1) 10Cathal Mooney: Revert "Depool esams in gdns prior to reboot of line card" [dns] - 10https://gerrit.wikimedia.org/r/839022 [09:06:28] (03CR) 10Btullis: Add a new production image for spark version 3.3.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:06:57] PROBLEM - Check systemd state on ganeti1029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:37] (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for nine wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839454 [09:08:59] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) Reboot completed sucessfully, currently router not showing any alarms: ` root@re0.cr2-esams> show system alarms No alarms currently active `... [09:11:34] (03PS3) 10Muehlenhoff: prometheus: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838834 (https://phabricator.wikimedia.org/T308013) [09:12:13] (03CR) 10Hashar: [C: 04-2] "Gerrit 3.4.6 has been released and includes my patch to add a public getter \o/ I will get our instance upgraded via T319513." [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [09:13:09] (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for nine wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839454 (owner: 10Hoo man) [09:14:00] (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for nine wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839454 (owner: 10Hoo man) [09:15:56] (03CR) 10Cathal Mooney: [C: 03+2] Revert "Depool esams in gdns prior to reboot of line card" [dns] - 10https://gerrit.wikimedia.org/r/839022 (owner: 10Cathal Mooney) [09:16:24] 10SRE, 10Data Engineering Planning, 10Data Pipelines, 10Foundational Technology Requests, 10User-fgiunchedi: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10fgiunchedi) I have resumed work on this a little bit and produced a worked example using... [09:17:03] RECOVERY - Check systemd state on ganeti1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:25] !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for nine wikis (duration: 03m 41s) [09:18:44] (03CR) 10JMeybohm: [V: 03+1 C: 04-1] "We still have buster hosts with kubernetes-client installed. I'll create the versioned components there as well and copy the packages so w" [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:19:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:20:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:20:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:20:54] (03CR) 10Muehlenhoff: prometheus: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838834 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:20:56] (03CR) 10Muehlenhoff: [C: 03+2] prometheus: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838834 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:21:34] <_joe_> !log installed the upgraded php package to mw1414, T318918 [09:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:38] T318918: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 [09:21:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:21:50] (03PS3) 10JMeybohm: k8s: Remove all debian version if-guarding [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) [09:21:52] (03PS1) 10JMeybohm: aptrepo: Create versioned kubernets components for buster [puppet] - 10https://gerrit.wikimedia.org/r/839456 (https://phabricator.wikimedia.org/T307943) [09:22:03] (03PS1) 10Jelto: buildkit: add no_proxy for wmf domains [puppet] - 10https://gerrit.wikimedia.org/r/839457 (https://phabricator.wikimedia.org/T308271) [09:22:05] PROBLEM - Check systemd state on mw2368 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:17] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for nlwiktionary, ruwiki, jawiki [09:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:28] (03PS2) 10JMeybohm: aptrepo: Create versioned kubernetes components for buster [puppet] - 10https://gerrit.wikimedia.org/r/839456 (https://phabricator.wikimedia.org/T307943) [09:22:30] (03PS4) 10JMeybohm: k8s: Remove all debian version if-guarding [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) [09:23:12] (03CR) 10Muehlenhoff: [C: 03+2] wmcs::metricsinfra: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838832 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:23:57] RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:24:07] RECOVERY - Check systemd state on mw2368 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:17] <_joe_> not sure what happened there tbh [09:27:17] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37468/console" [puppet] - 10https://gerrit.wikimedia.org/r/839457 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto) [09:28:10] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for viwiki, metawiki, frwiktionary [09:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:47] RECOVERY - Check systemd state on mw2290 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:47] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/839456 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:31:17] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) >>! In T319184#8288137, @cmooney wrote: > [..] > Anyway thought I'd mention just in case you weren't aware. Thanks, double checking this now.... [09:32:05] (03CR) 10JMeybohm: [C: 03+2] aptrepo: Create versioned kubernetes components for buster [puppet] - 10https://gerrit.wikimedia.org/r/839456 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:32:21] !log installing python-oslo.utils security updates [09:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:40] (03CR) 10Jelto: [V: 03+1] "Like discussed in yesterdays troubleshooting session." [puppet] - 10https://gerrit.wikimedia.org/r/839457 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto) [09:34:07] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1005.eqiad.wmnet [09:35:37] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/839447 (owner: 10Muehlenhoff) [09:39:35] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:40:49] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 7 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37470/console" [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:41:30] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1005.eqiad.wmnet [09:44:50] (03CR) 10JMeybohm: [V: 03+1 C: 03+1] "Looks like releases and contint don't use packages_from_future. Not ideal but fine for now. We will have to refactor the version selection" [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:48:03] (03CR) 10Jelto: [V: 03+1 C: 03+2] buildkit: add no_proxy for wmf domains [puppet] - 10https://gerrit.wikimedia.org/r/839457 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto) [09:48:29] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10SLyngshede-WMF) In spicerack we'll add a "skip_acked=False" to the wait_for_optimal and "acked" properties to HostStatus and HostsStatus datatypes. When skip_a... [09:51:25] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:52:35] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for itwiki, arzwiki, ptwiki [09:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:09] RECOVERY - Check systemd state on mw1426 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:45] (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839468 [09:56:11] RECOVERY - Check systemd state on mw1434 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:26] (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839468 (owner: 10Hoo man) [09:56:45] RECOVERY - Check systemd state on mw1446 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:59] RECOVERY - Check systemd state on mw2319 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:07] RECOVERY - Check systemd state on mw2387 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:11] (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839468 (owner: 10Hoo man) [09:57:19] !log installing glib2.0 security updates on buster [09:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T1000). [10:00:48] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Jmads out of all services on: 1213 hosts [10:01:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jmads out of all services on: 1213 hosts [10:02:00] !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for all wikis (duration: 03m 39s) [10:02:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:03:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:03:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:03:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:05:11] RECOVERY - Disk space on moscovium is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=moscovium&var-datasource=eqiad+prometheus/ops [10:05:38] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Jmads out of all services on: 799 hosts [10:06:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jmads out of all services on: 799 hosts [10:06:47] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging NOkafor out of all services on: 799 hosts [10:07:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging NOkafor out of all services on: 799 hosts [10:07:23] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging NOkafor out of all services on: 1213 hosts [10:07:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging NOkafor out of all services on: 1213 hosts [10:11:52] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all remaining wikis [10:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:07] !log installing ruby-rack security updates [10:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:47] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) Plan of action: General overview before/after. Red: deactivated/removed. Green: activated/added. {F35550079} We're... [10:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [10:23:38] (03PS1) 10Volans: sre.dns.wipe-cache: add sudo to the command [cookbooks] - 10https://gerrit.wikimedia.org/r/839474 (https://phabricator.wikimedia.org/T244840) [10:26:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/839474 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [10:30:43] !log restart kafka on kafka-logging1003 to reload the conifg (cleanup old super.users related to past keystore) [10:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [10:37:42] 10SRE, 10Traffic, 10conftool, 10Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794 (10jbond) While implementing the the [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/768723/31/modules/varnish/templates/upload-fr... [10:40:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/839474 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [10:40:45] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:41:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [10:45:59] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/839428 (owner: 10Muehlenhoff) [10:47:01] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1005 is CRITICAL: CRITICAL: no netns defined? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:49:23] (03PS5) 10Jbond: C:postgress::server: add replication slot support [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) [10:49:45] (03CR) 10Jbond: C:postgress::server: add replication slot support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [10:50:06] (03PS6) 10Jbond: C:postgress::server: add replication slot support [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) [10:51:39] <_joe_> !log installing the upgraded php package everywhere, T318918 [10:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:44] T318918: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 [10:52:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 8 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37471/console" [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [10:53:34] (03PS4) 10Jbond: O:puppetdb: enable postgress slots for replication [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217) [10:53:36] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:postgress::server: add replication slot support [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [10:57:56] (03PS1) 10Urbanecm: eswiki: Enable Growth mentorship for 25% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839485 (https://phabricator.wikimedia.org/T285235) [10:58:13] !log disable puppet temporarily to deploy a puppetdb change 814824 [10:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37472/console" [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [11:00:50] (03PS1) 10Vgutierrez: trafficserver: Allow partioning the cache storage in several volumes [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748) [11:01:08] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:puppetdb: enable postgress slots for replication [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [11:02:51] (03PS2) 10Vgutierrez: trafficserver: Allow partitiooning the cache storage in several volumes [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748) [11:03:43] (03PS3) 10Vgutierrez: trafficserver: Allow partitiooning the cache storage in several volumes [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748) [11:04:43] (03PS1) 10Jbond: P:postgresql:: master: correct sql statment [puppet] - 10https://gerrit.wikimedia.org/r/839488 [11:04:59] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:postgresql:: master: correct sql statment [puppet] - 10https://gerrit.wikimedia.org/r/839488 (owner: 10Jbond) [11:05:04] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37473/console" [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [11:07:23] (03PS1) 10Jbond: P:postgresql::master: use correct unless statment [puppet] - 10https://gerrit.wikimedia.org/r/839489 [11:07:42] (03CR) 10Jbond: [C: 03+2] P:postgresql::master: use correct unless statment [puppet] - 10https://gerrit.wikimedia.org/r/839489 (owner: 10Jbond) [11:07:46] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:postgresql::master: use correct unless statment [puppet] - 10https://gerrit.wikimedia.org/r/839489 (owner: 10Jbond) [11:08:13] (03PS4) 10Vgutierrez: trafficserver: Allow partitioning the cache storage in several volumes [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748) [11:08:15] (03PS1) 10Vgutierrez: trafficserver: Enable cache partitioning in cp6008 [puppet] - 10https://gerrit.wikimedia.org/r/839490 (https://phabricator.wikimedia.org/T317748) [11:12:30] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) [11:13:10] (03CR) 10CI reject: [V: 04-1] openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) (owner: 10Arturo Borrero Gonzalez) [11:13:25] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) [11:14:04] (03CR) 10CI reject: [V: 04-1] openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) (owner: 10Arturo Borrero Gonzalez) [11:14:53] (03PS3) 10Arturo Borrero Gonzalez: openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) [11:15:32] (03CR) 10CI reject: [V: 04-1] openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) (owner: 10Arturo Borrero Gonzalez) [11:15:53] (03PS1) 10Jbond: P:puppetdb: correct slot name on master [puppet] - 10https://gerrit.wikimedia.org/r/839494 [11:16:17] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:puppetdb: correct slot name on master [puppet] - 10https://gerrit.wikimedia.org/r/839494 (owner: 10Jbond) [11:16:48] (03PS4) 10Arturo Borrero Gonzalez: openstack: neutron: refresh bridge ifupdown code to handle ordering [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) [11:18:51] PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:21:19] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Joe) 05Open→03Resolved [11:22:29] !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts aqs1004.eqiad.wmnet [11:27:32] !log cold-reset the BMC on analytics1076 [11:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:40] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [11:27:45] !log switch puppetdb replication to use replications slots [11:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:08] !log enable puppet post deploy puppetdb change 814824 [11:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:57] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:29:13] (03PS1) 10Giuseppe Lavagetto: Stop assigning the PHP_ENGINE cookie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839499 (https://phabricator.wikimedia.org/T271736) [11:30:11] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Joe) [11:30:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:postgress::server: add replication slot support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [11:31:30] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) [11:31:34] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb postgress: Improve postgress standby server - https://phabricator.wikimedia.org/T313217 (10jbond) 05Open→03Resolved a:03jbond puppetdb has now been migrated to use replication slots [11:31:39] (03PS1) 10Matthias Mullie: Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839500 (https://phabricator.wikimedia.org/T306883) [11:32:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:32:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1004.eqiad.wmnet [11:33:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) (owner: 10Arturo Borrero Gonzalez) [11:34:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/37474/" [puppet] - 10https://gerrit.wikimedia.org/r/839492 (https://phabricator.wikimedia.org/T319524) (owner: 10Arturo Borrero Gonzalez) [11:35:46] (03PS8) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [11:44:56] 10SRE, 10conftool: Add suopport to use different vsthrottle keys - https://phabricator.wikimedia.org/T319533 (10jbond) [11:45:03] 10SRE, 10conftool: Add suopport to use different vsthrottle keys - https://phabricator.wikimedia.org/T319533 (10jbond) p:05Triage→03Medium [11:55:15] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1005 is CRITICAL: CRITICAL: no netns defined? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:28:31] (03CR) 10CI reject: [V: 04-1] Merge tag 'v3.4.6' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839504 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar) [12:32:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [12:32:04] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: [12:32:04] t per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:32:53] (03PS1) 10Vgutierrez: varnish: Add sessioncookie bit to X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/839512 (https://phabricator.wikimedia.org/T319324) [12:32:54] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [12:34:53] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [12:36:50] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:39:00] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:40:38] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons. [12:42:59] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [12:43:27] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [12:45:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1029.eqiad.wmnet [12:47:01] (03PS3) 10Hashar: Merge tag 'v3.4.6' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839504 (https://phabricator.wikimedia.org/T319513) [12:47:37] (03CR) 10Jelto: [C: 04-1] "lookup for http_proxy fields returns empty string. Added some comments in-line." [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [12:52:27] ACKNOWLEDGEMENT - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is C [12:52:27] Test Get per file requests returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:52:27] ACKNOWLEDGEMENT - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITIC [12:52:28] Get per article page views returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:52:29] ACKNOWLEDGEMENT - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is C [12:52:30] Test Get per file requests returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:52:31] ACKNOWLEDGEMENT - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITIC [12:52:32] Get per article page views returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:52:55] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Lucas_Werkmeister_WMDE) 05Resolved→03Open As far as I can tell, this is done in production (thanks Joe!), but not yet in CI – a change I just... [12:53:22] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) [12:54:00] (03PS1) 10Hashar: Update Gerrit to v3.4.6 [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839515 (https://phabricator.wikimedia.org/T319513) [12:54:21] (03CR) 10CI reject: [V: 04-1] Update Gerrit to v3.4.6 [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839515 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar) [12:54:40] !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts aqs1006.eqiad.wmnet [12:56:22] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) 05Open→03Resolved a:03elukey The kafka logging clusters have the new PKI configurati... [12:56:32] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1026.eqiad.wmnet with reason: Downtime for removal from Ganeti cluster and eventual bullseye reimage [12:56:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1026.eqiad.wmnet with reason: Downtime for removal from Ganeti cluster and eventual bullseye reimage [12:58:14] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [12:59:01] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.4.6' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839504 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar) [12:59:29] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [12:59:41] 10SRE, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Consider adding X-Analytics subfield for 'has a session cookie' - https://phabricator.wikimedia.org/T319324 (10Vgutierrez) [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T1300). Please do the needful. [13:00:05] stephanebisson and matthiasmullie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:25] o/ [13:00:26] Hello [13:00:27] I can deploy today [13:00:29] hello [13:01:07] (03PS5) 10Urbanecm: Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson) [13:01:16] if matthiasmullie is around, we could start with his patch. I need some time to get ready [13:01:21] okay [13:01:31] matthiasmullie: hi, are you around? [13:02:02] PROBLEM - Host ganeti1029 is DOWN: PING CRITICAL - Packet loss = 100% [13:03:52] ^^ expected? [13:04:07] matthiasmullie: ping #2, are you around for your deployment? [13:04:10] hmm right, that's moritzm [13:04:11] urbanecm, ok we can do mine [13:04:15] okay [13:04:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson) [13:04:33] ganeti1029 is expired downtime, all is well [13:04:38] RECOVERY - Host ganeti1029 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:04:55] I'm glad Icinga concurs :-) [13:05:13] (03Merged) 10jenkins-bot: Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson) [13:05:45] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:826882|Explicit config for Wikistories discovery module (T314582)]] [13:05:49] T314582: Make Wikistories configurable for public release - https://phabricator.wikimedia.org/T314582 [13:06:08] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudnet1006.eqiad.wmnet with OS bullseye [13:06:10] !log urbanecm@deploy1002 urbanecm and sbisson: Backport for [[gerrit:826882|Explicit config for Wikistories discovery module (T314582)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:06:20] (03PS7) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) [13:06:21] stephanebisson: can you check it at a debug server? [13:06:24] o/ [13:06:27] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [13:06:37] @urbanecm sorry for showing up late, missed notification :p [13:06:40] no worries! [13:06:41] urbanecm mwdebug1002? [13:06:51] stephanebisson: yup! [13:06:52] (03CR) 10BCornwall: ats: Alert on high connection/request count (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [13:07:14] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) just putting a noted here. after looking at the [[ https://galaxy.ansible.com/dellemc/openm... [13:07:53] (03Merged) 10jenkins-bot: Merge tag 'v3.4.6' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839504 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar) [13:08:10] urbanecm looks good, you can sync [13:08:12] PROBLEM - configured eth on ganeti1029 is CRITICAL: public reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:08:16] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [13:08:17] stephanebisson: great, syncing [13:08:33] (03CR) 10Hashar: "recheck" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839515 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar) [13:09:01] (03CR) 10Urbanecm: [C: 03+2] Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839500 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie) [13:09:05] (03PS2) 10Urbanecm: Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839500 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie) [13:09:09] (03CR) 10CI reject: [V: 04-1] ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [13:09:11] (03CR) 10Urbanecm: [C: 03+2] Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839500 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie) [13:09:54] (03Merged) 10jenkins-bot: Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839500 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie) [13:11:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:12:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:12:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:12:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [13:12:22] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:826882|Explicit config for Wikistories discovery module (T314582)]] (duration: 06m 37s) [13:12:26] T314582: Make Wikistories configurable for public release - https://phabricator.wikimedia.org/T314582 [13:12:37] stephanebisson: your patch's live! [13:12:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839500 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie) [13:12:46] urbanecm thank you! [13:12:50] no problem! [13:12:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:13:07] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:839500|Show thumbnails on Special:Search for NS_FILE + PageImages (T306883)]] [13:13:10] T306883: [L] Searchers see thumbnails next to search results on the special:search page - https://phabricator.wikimedia.org/T306883 [13:13:30] !log urbanecm@deploy1002 urbanecm and mlitn: Backport for [[gerrit:839500|Show thumbnails on Special:Search for NS_FILE + PageImages (T306883)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:13:30] (03CR) 10Ssingh: [C: 03+1] "Verified the config and the volume.config output file." [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [13:13:54] matthiasmullie: can you check at mwdebug1002 please? [13:13:58] (03CR) 10Ssingh: [C: 03+1] trafficserver: Enable cache partitioning in cp6008 [puppet] - 10https://gerrit.wikimedia.org/r/839490 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [13:14:12] urbanecm: LGTM! [13:14:20] that was quick, syncing! [13:14:28] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:14:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:14:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) I don't think it's true to say the VRRP is over VXLAN here, the VRRP... [13:15:53] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Allow partitioning the cache storage in several volumes [puppet] - 10https://gerrit.wikimedia.org/r/839486 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [13:15:58] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [13:16:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:16:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1006.eqiad.wmnet [13:16:41] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [13:17:25] @urbanecm thanks! [13:17:56] !log partition ats-be cache in cp6008 - T317748 [13:17:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:00] T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 [13:18:14] (03CR) 10Hashar: [C: 03+2] Update Gerrit to v3.4.6 [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839515 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar) [13:18:19] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:839500|Show thumbnails on Special:Search for NS_FILE + PageImages (T306883)]] (duration: 05m 12s) [13:18:23] T306883: [L] Searchers see thumbnails next to search results on the special:search page - https://phabricator.wikimedia.org/T306883 [13:18:36] (03Merged) 10jenkins-bot: Update Gerrit to v3.4.6 [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839515 (https://phabricator.wikimedia.org/T319513) (owner: 10Hashar) [13:18:48] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Enable cache partitioning in cp6008 [puppet] - 10https://gerrit.wikimedia.org/r/839490 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [13:18:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:18:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:19:11] matthiasmullie: should be live! [13:19:16] RECOVERY - configured eth on ganeti1029 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:19:27] anything else? [13:19:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [13:19:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:19:59] (03CR) 10Hnowlan: [C: 04-1] Update the logic to run test coverage (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik) [13:20:16] !log draining ganeti1014 T311687 [13:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:20] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [13:20:53] !log UTC afternoon backport window done [13:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:58] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:25:10] (03PS1) 10Muehlenhoff: Make ganeti1030 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/839521 (https://phabricator.wikimedia.org/T299459) [13:36:36] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) cableid c220756659 fpc2 - fpc8. [13:41:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons. [13:42:07] \o/ [13:46:08] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) Both clusters are running PKI and today I have also ran the following clean up steps: 1) removed the old puppet ce... [13:47:01] (03PS1) 10Hnowlan: Add missing prod dependencies [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/839548 (https://phabricator.wikimedia.org/T233196) [13:48:19] !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts aqs1007.eqiad.wmnet [13:56:23] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Remove all debian version if-guarding [puppet] - 10https://gerrit.wikimedia.org/r/839435 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:00:19] PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 3 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:00:26] (03PS1) 10Clément Goubert: Add build instructions in debian/README [debs/helm3] - 10https://gerrit.wikimedia.org/r/839550 [14:00:32] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10hashar) James has made the necessary CI updates and I have deployed them. [14:01:08] (03CR) 10CI reject: [V: 04-1] Add build instructions in debian/README [debs/helm3] - 10https://gerrit.wikimedia.org/r/839550 (owner: 10Clément Goubert) [14:01:15] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [14:03:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:03:52] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts aqs1007.eqiad.wmnet [14:04:21] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Joe) 05Open→03Resolved Sorry @LucasWerkmeister I assumed this task was about updating production. Re-resolving then :) [14:04:41] (03PS1) 10Hashar: Remove gerrit2001 from deployment targets [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839551 (https://phabricator.wikimedia.org/T243027) [14:04:57] (03CR) 10Hashar: [C: 03+2] Remove gerrit2001 from deployment targets [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839551 (https://phabricator.wikimedia.org/T243027) (owner: 10Hashar) [14:05:21] (03Merged) 10jenkins-bot: Remove gerrit2001 from deployment targets [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839551 (https://phabricator.wikimedia.org/T243027) (owner: 10Hashar) [14:06:22] I am going to upgrade Gerrit from 3.4.5 to 3.4.6 [14:07:46] !log updating HAProxy to version 2.4.19 in ulsfo [14:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:09] (03PS1) 10Clément Goubert: Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/839554 [14:08:32] !log hashar@deploy1002 Started deploy [gerrit/gerrit@132ac68]: Gerrit to 3.4.6 on gerrit2002 [14:08:33] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Lucas_Werkmeister_WMDE) The change for T316923 is passing in CI now (currently going through test-and-submit), so I think this is indeed done. Th... [14:08:43] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@132ac68]: Gerrit to 3.4.6 on gerrit2002 (duration: 00m 10s) [14:08:44] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage [14:10:02] (03CR) 10CI reject: [V: 04-1] Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/839554 (owner: 10Clément Goubert) [14:11:31] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10aborrero) [14:12:24] (03CR) 10Jelto: [C: 03+1] "lgtm, see one in-line comment" [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn) [14:12:34] !log Upgrading primary Gerrit # T319513 [14:12:38] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage [14:12:38] !log hashar@deploy1002 Started deploy [gerrit/gerrit@132ac68]: Gerrit to 3.4.6 on gerrit1001 [14:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:39] T319513: Upgrade Gerrit to 3.4.6 - https://phabricator.wikimedia.org/T319513 [14:12:46] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@132ac68]: Gerrit to 3.4.6 on gerrit1001 (duration: 00m 08s) [14:13:55] !log move asw2-c-eqiad<->cr1 link to new 40G link - T313385 [14:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:27] <_joe_> hashar: gerrit is still down FWIW [14:15:24] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:15:41] (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/838916 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn) [14:15:50] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:36] !log Gerrit upgraded from 3.4.5 to 3.4.6 # T319513 [14:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:52] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:18:02] (03PS1) 10Jcrespo: Revert "dbbackups: Test mariadb 10.6 on a (currently passive) backup source" [puppet] - 10https://gerrit.wikimedia.org/r/839566 (https://phabricator.wikimedia.org/T318062) [14:18:26] Invalid cookie header: "set-cookie: WMF-Last-Access=06-Oct-2022;Path=/;HttpOnly;secure;Expires=Mon, 07 Nov 2022 12:00:00 GMT". Invalid 'expires' attribute: Mon, 07 Nov 2022 12:00:00 GMT [14:18:29] fun :) [14:18:44] looks like that cookie is set for all of wikimedia.org and ends up hitting Gerrit as well [14:19:13] hashar: what's issuing the error? [14:19:25] that cookie is set by varnish [14:19:31] the Jetty server in Gerrit [14:19:38] (03PS2) 10Jcrespo: Revert "dbbackups: Test mariadb 10.6 on a (currently passive) backup source" [puppet] - 10https://gerrit.wikimedia.org/r/839566 (https://phabricator.wikimedia.org/T318062) [14:19:42] that's a bogus client messing with you [14:19:42] (03PS1) 10Elukey: admin_ng: set proper TLS egress origination settings for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/839587 [14:19:44] (03PS1) 10Elukey: ml-services: move eventgate config to TLS egress origination [deployment-charts] - 10https://gerrit.wikimedia.org/r/839588 [14:19:56] a client should send "Cookie" rather than set-cookie [14:19:58] yeah it looks harmless, we had it before the Gerrit upgrade [14:20:18] set-cookies is meant to be used by a server, not an UA [14:20:53] (03CR) 10Jcrespo: [C: 03+2] Revert "dbbackups: Test mariadb 10.6 on a (currently passive) backup source" [puppet] - 10https://gerrit.wikimedia.org/r/839566 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [14:20:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:21:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:22:38] vgutierrez: oh nice. I have no idea from where it comes from though, maybe I will dig into it later :) It is a single user so far so probably not a concern in any way [14:22:41] thanks! [14:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:25:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:21] hashar: the CDN sets WMF-Last-Access here, https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/analytics.inc.vcl.erb#L55-L62 [14:25:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:26:04] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) So, we have a need to move on this pretty quickly, as we have 16 new cache hosts in ulsfo pending installs on this, and then 16 more in eqsin righ... [14:26:29] (03CR) 10Elukey: Add a new production image for spark version 3.3.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [14:27:03] hashar: and by RFC 6265 https://httpwg.org/specs/rfc6265.html#sane-set-cookie is intended as a server -> client header [14:28:02] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cloudnet1006.eqiad.wmnet with OS bullseye [14:29:20] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) I'll take care of "Create a buster-based 4.19+5.10 boot image " tomorrow. [14:29:47] vgutierrez: I think Gerrit is internally confused somehow cause I see that message for ssh commands or clients doing a `git push` over ssh [14:29:56] vgutierrez: thanks for the refs :] [14:30:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:12] hashar: that's weird [14:30:21] !log moving eqiad row C vrrp mastership to cr1-eqiad [14:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:52] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) Ok yeah I see what is going on. Cloudnet1005 is running VXLAN over U... [14:31:10] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) >>! In T319067#8290850, @MoritzMuehlenhoff wrote: > I'll take care of "Create a buster-based 4.19+5.10 boot image " tomorrow. Thank you! [14:31:30] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:31:56] (03PS1) 10MVernon: swift: restore ms-be1059 to production [puppet] - 10https://gerrit.wikimedia.org/r/839591 (https://phabricator.wikimedia.org/T307667) [14:32:56] (03CR) 10Volans: [C: 03+2] sre.dns.wipe-cache: add sudo to the command [cookbooks] - 10https://gerrit.wikimedia.org/r/839474 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [14:34:49] vgutierrez: yeah turns out I already filed a task for that https://phabricator.wikimedia.org/T273605 [14:56:13] !log eqiad front edge depooled in DNS [14:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:24] takes 10 minutes or so to take full effect anyways [14:56:30] as usual [14:56:32] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:42] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:56:45] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:50] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:56:50] no recovery yet on the status page graphs, https://grafana.wikimedia.org/d/3u6RLsL7k/status-page?orgId=1&from=now-1h&to=now [14:56:57] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [14:57:04] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:57:22] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [14:57:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:58:00] we could also depool that side of A/A at the mediawiki level [14:58:03] (ProbeDown) resolved: (6) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:58:05] ats-be 5xx are going back to normal [14:58:14] https://grafana.wikimedia.org/d/000000479/cdn-frontend-traffic?orgId=1&var-site=All&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5&viewPanel=14 [14:58:17] otherwise e.g. drmrs+esams traffic are still hitting mw in eqiad too [14:58:19] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:58:26] but if it's resolving, no point [14:58:31] (03CR) 10BCornwall: [C: 03+2] prometheus: Remove ATS 8-specific metrics [puppet] - 10https://gerrit.wikimedia.org/r/838886 (owner: 10BCornwall) [14:58:31] bblack: yes but unless we failover writes will still go all to eqiad [14:58:33] looks to be resolving [14:58:34] I'm wondering why it took so long to recover [14:58:44] volans: yeah but we could save the reads! :) [14:59:01] but looks like row D lost connectivity during the interface move, while it shouldn't have [14:59:08] not nice [14:59:12] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [14:59:23] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:00:52] graphs looks mostly at the recovered values [15:01:19] (03CR) 10BCornwall: [C: 03+2] prometheus: Add records for ATS percent usage [puppet] - 10https://gerrit.wikimedia.org/r/838911 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [15:01:19] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: /dev/sdg failed in thanos-be2004 - https://phabricator.wikimedia.org/T318422 (10Papaul) 05Open→03Resolved disk replaced [15:01:32] yeah, looks like full recovery [15:01:38] (03PS2) 10BCornwall: prometheus: Add records for ATS percent usage [puppet] - 10https://gerrit.wikimedia.org/r/838911 (https://phabricator.wikimedia.org/T292815) [15:01:42] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage [15:01:54] !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts aqs1008.eqiad.wmnet [15:01:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:02:06] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: restore ms-be1059 to production [puppet] - 10https://gerrit.wikimedia.org/r/839591 (https://phabricator.wikimedia.org/T307667) (owner: 10MVernon) [15:02:25] we will continue in a future window [15:02:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [15:02:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:03:30] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [15:03:35] XioNoX: so we're stable for now? [15:03:58] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [15:04:30] bblack: yeah everything is 100% back to normal on the network side [15:04:42] ok, any objection to reverting the dns depool? [15:04:48] no objection [15:04:54] +1 [15:05:16] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage [15:05:19] (03PS1) 10BBlack: Revert "depool eqiad front edge" [dns] - 10https://gerrit.wikimedia.org/r/839567 [15:05:28] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "depool eqiad front edge" [dns] - 10https://gerrit.wikimedia.org/r/839567 (owner: 10BBlack) [15:05:34] (03CR) 10Elukey: [C: 03+2] admin_ng: set proper TLS egress origination settings for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/839587 (owner: 10Elukey) [15:06:18] the tldr, is that disabling an interface caused traffic to be blackholed instead of failing over to the other interface, I believe things would have converged eventually or even did converge before the rollback [15:06:34] I'll write an incident report [15:07:01] blackholes are the worst for fast failover :( [15:07:17] there was a dbproxy failover [15:07:32] 2 actually [15:07:32] (03CR) 10BCornwall: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [15:07:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [15:07:56] jynus: ah? which ones? [15:07:59] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [15:08:14] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:08:20] dbproxy1016 and dbproxy1017, not sure what service they are and what they point to [15:08:27] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:08:31] the funny thing is that this change is much less risky than the router upgrades we did the previous weeks :) [15:08:39] (03CR) 10Jforrester: scap/dsh: remove parsoid service, replaced by parsoid-php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn) [15:09:00] jinxer-wm: looks like they're both in row D so that makes sens [15:09:05] er jynus ^ [15:09:16] it's all jinxer-wm fault [15:09:27] db1159 is considered down [15:09:43] jinxer-wm: still? [15:09:45] er! [15:09:46] that is m3 [15:09:49] jynus: [15:09:51] (phabricator) [15:10:01] not sure if the active one [15:10:12] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:10:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1008.eqiad.wmnet [15:10:46] I think not, dbproxy1020 was active, dbproxy1016 was passive [15:11:05] !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts aqs1009.eqiad.wmnet [15:11:39] checking now dbproxy1017 [15:12:35] anything we can do to help? [15:12:51] by chance also passive! :-D [15:12:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:12:57] yay [15:13:11] if dbproxy1021, m5 db would have been down/read only [15:13:55] XioNoX: if confirmed no more changes affecting that, I will reload the proxy config (it doesn't reconnect to the original dbs to avoid flapping) [15:14:07] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:14:13] (03PS1) 10Volans: mypy: remove upper limit [software/spicerack] - 10https://gerrit.wikimedia.org/r/839602 [15:15:08] jynus: yeah, everything is back to normal [15:15:23] ok, will log and reload config on those proxies [15:16:58] !log reload haproxy config on dbproxy1016, dbproxy1017 [15:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:18] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) Row C got moved to the new linecards with no issues, but moving cr1<->row D caused an outage. As row C cleanup, @Jclark-ctr can you rem... [15:17:25] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [15:18:03] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:18:03] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:19:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:19:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1009.eqiad.wmnet [15:21:09] (03CR) 10MVernon: [C: 03+2] swift: restore ms-be1059 to production [puppet] - 10https://gerrit.wikimedia.org/r/839591 (https://phabricator.wikimedia.org/T307667) (owner: 10MVernon) [15:21:18] https://grafana.wikimedia.org/goto/nnetPwV4k?orgId=1 [15:21:51] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:22:04] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:23:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] Allow SRE to send annotated and signed tags [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711 (owner: 10Hashar) [15:24:58] 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10dancy) [15:26:54] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) > With this method d-i will only see the two SSD disks and as such will have no way... [15:27:53] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder) [15:28:03] 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10dancy) [15:28:33] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1005.eqiad.wmnet with OS bullseye [15:28:41] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [15:29:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) @aborrero thanks. Reading briefly through the docs I have a better u... [15:31:43] (03PS1) 10Btullis: Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) [15:32:20] (03PS2) 10Elukey: ml-services: move eventgate config to TLS egress origination [deployment-charts] - 10https://gerrit.wikimedia.org/r/839588 [15:32:22] (03PS1) 10Elukey: admin_ng: fix eventgate's egress TLS origin config on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/839626 [15:32:35] (03CR) 10CI reject: [V: 04-1] Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [15:35:47] (03PS2) 10Btullis: Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) [15:35:56] (03PS1) 10Volans: sre.hosts.reimage: support different installers [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) [15:36:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) >>! In T319539#8291916, @cmooney wrote: > I gather the hypervisor ho... [15:38:41] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10Volans) > * Add support for it (it being whatever it takes to switch to 5.10) to the reimage cookbook stuff @BBlack the above patch should have all that... [15:38:48] (03CR) 10Elukey: [C: 03+2] admin_ng: fix eventgate's egress TLS origin config on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/839626 (owner: 10Elukey) [15:39:32] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37477/console" [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [15:41:47] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [15:41:47] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [15:42:00] :_) [15:42:09] * volans still here although not paged for 12 minutes [15:42:22] vgutierrez: happy oncall [15:42:26] * jhathaway here as well [15:43:22] quite some spikes https://librenms.wikimedia.org/graphs/device=140/type=device_bits/from=1664984587/legend=yes/popup_title=Device+Traffic/ [15:43:29] reaching the 10G [15:43:31] (03PS1) 10Arturo Borrero Gonzalez: cloudnet1003/1004: make them spare servers [puppet] - 10https://gerrit.wikimedia.org/r/839628 (https://phabricator.wikimedia.org/T319682) [15:44:26] that's Arelion [15:44:37] https://librenms.wikimedia.org/device/device=140/tab=port/port=16840/ [15:44:38] (03CR) 10Andrew Bogott: [C: 03+1] cloudnet1003/1004: make them spare servers [puppet] - 10https://gerrit.wikimedia.org/r/839628 (https://phabricator.wikimedia.org/T319682) (owner: 10Arturo Borrero Gonzalez) [15:44:55] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:45:07] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:45:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) > But we do have keepalived running on cloudgw servers. So we may wan... [15:45:46] volans: also being discussed in #wikimedia-sre [15:45:55] (03CR) 10Elukey: [C: 03+2] ml-services: move eventgate config to TLS egress origination [deployment-charts] - 10https://gerrit.wikimedia.org/r/839588 (owner: 10Elukey) [15:46:47] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [15:46:47] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [15:47:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [15:47:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet1003/1004: make them spare servers [puppet] - 10https://gerrit.wikimedia.org/r/839628 (https://phabricator.wikimedia.org/T319682) (owner: 10Arturo Borrero Gonzalez) [15:47:50] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1003.eqiad.wmnet with OS bullseye [15:48:01] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1004.eqiad.wmnet with OS bullseye [15:49:01] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:51:07] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:51:55] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [15:52:38] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:52:58] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:53:24] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:54:55] 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10akosiaris) Hi! >>! In T301505#8240830, @Novem_Linguae wrote: > In general, shouldn't phabricator tickets be one ticket = one cause? This one seems like i... [15:56:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:57:53] !log Applying explicit BFD mode configuration to cr4-ulsfo for Anycast BGP groups. [15:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:57] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:00:05] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T1600). Please do the needful. [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:21] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:01:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:05:07] (03CR) 10AOkoth: [C: 03+1] lower TTL for phabricator from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838916 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn) [16:05:19] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) This has been completed smoothly! I deleted the following VC cables from Netbox: 0315 0316 0317 0318 0320 Please... [16:06:56] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) 05Open→03Resolved a:03ayounsi Sub-task completed successfully nothing more to do here. [16:09:15] (03PS1) 10Cathal Mooney: Add explicit BFD session mode (single/multi-hop) to Anycast groups [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) [16:09:16] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for MHorsey - https://phabricator.wikimedia.org/T318729 (10Arnoldokoth) Ooh, sorry I missed that step. I have added you to the wmf-nda group as well. Thanks @Aklapper [16:10:10] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10Arnoldokoth) I have added to the wmf-nda group as well. Thanks @Aklapper [16:18:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [16:21:00] (03PS2) 10JMeybohm: Update to Kubernetes v1.23.12 [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943) [16:21:33] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:26:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [16:27:42] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [16:34:02] (03CR) 10JMeybohm: Add a spark-operator production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [16:36:33] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:36:55] (03CR) 10Btullis: Add a new production image for spark version 3.3.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [16:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:43:38] (03CR) 10Vlad.shapik: Update the logic to run test coverage (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik) [16:45:55] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) Also looks like the optic or fiber needs to be replaced, error rate is high: https://librenms.wikimedia.org/device/device=162/tab=port/p... [16:50:11] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10cmooney) Diff if the above patch is merged (running from my laptop with updated template): ` Changes for 8 devices: ['c... [16:53:16] (03CR) 10Vlad.shapik: [C: 03+1] Add missing prod dependencies [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/839548 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:57:08] (03PS3) 10Btullis: Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) [16:57:50] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder) [17:00:03] (03CR) 10Btullis: "Note that the Cassandra 3 cluster is still using a role called aqs_next - which is why it's safe to delete the aqs role. I will rename the" [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [17:00:05] bd808: gettimeofday() says it's time for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T1700) [17:08:27] PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:12:03] (03PS1) 10Ssingh: Revert "bird: temporarily disable validate_cmd for bird.conf" [puppet] - 10https://gerrit.wikimedia.org/r/839571 [17:12:43] (03CR) 10CI reject: [V: 04-1] Revert "bird: temporarily disable validate_cmd for bird.conf" [puppet] - 10https://gerrit.wikimedia.org/r/839571 (owner: 10Ssingh) [17:14:08] (03PS2) 10Ssingh: Revert "bird: temporarily disable validate_cmd for bird.conf" [puppet] - 10https://gerrit.wikimedia.org/r/839571 [17:15:12] (03CR) 10Btullis: Add a spark-operator production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [17:16:17] (03CR) 10Ssingh: [C: 03+2] Revert "bird: temporarily disable validate_cmd for bird.conf" [puppet] - 10https://gerrit.wikimedia.org/r/839571 (owner: 10Ssingh) [17:16:44] (03CR) 10Btullis: Add a new production image for spark version 3.3.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [17:22:49] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:27:20] (03CR) 10Dzahn: [C: 03+1] Remove gerrit2001 from deployment targets [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/839551 (https://phabricator.wikimedia.org/T243027) (owner: 10Hashar) [17:28:58] (03CR) 10Dzahn: [C: 03+2] lower TTL for phabricator from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838916 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn) [17:29:44] (03CR) 10Dzahn: lower TTL for gerrit,gerrit-replica from 600 to 300 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn) [17:29:48] (03CR) 10Slyngshede: [C: 03+1] "Tested on an M1 with Python 3.10 and looks good:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/839602 (owner: 10Volans) [17:30:16] (03CR) 10Dzahn: [C: 03+2] lower TTL for gerrit,gerrit-replica from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn) [17:30:19] (03PS2) 10Dzahn: lower TTL for gerrit,gerrit-replica from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) [17:31:23] (03CR) 10FNegri: [C: 03+1] "LGTM!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/838835 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [17:31:52] (03CR) 10Ssingh: [C: 03+1] "Thanks for working on this and fixing it in the configuration!" [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) (owner: 10Cathal Mooney) [17:32:08] (03CR) 10Dzahn: lower TTL for gerrit,gerrit-replica from 600 to 300 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn) [17:32:39] (03PS2) 10Dzahn: lower TTL for phabricator from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838916 (https://phabricator.wikimedia.org/T315319) [17:36:22] (03PS1) 10Dzahn: lower TTL for gitlab-replicas from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/839665 (https://phabricator.wikimedia.org/T315319) [17:36:41] (03CR) 10Dzahn: [C: 03+2] lower TTL for gerrit,gerrit-replica from 600 to 300 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn) [17:37:43] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:38:47] (03CR) 10Ssingh: [C: 03+1] lower TTL for gitlab-replicas from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/839665 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn) [17:42:04] (03CR) 10Dzahn: [C: 03+2] lower TTL for gitlab-replicas from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/839665 (https://phabricator.wikimedia.org/T315319) (owner: 10Dzahn) [17:45:29] (03CR) 10Volans: [C: 03+2] mypy: remove upper limit [software/spicerack] - 10https://gerrit.wikimedia.org/r/839602 (owner: 10Volans) [17:45:51] 10SRE, 10ops-codfw, 10DC-Ops: Frack codfw management network issue, many DRACs inaccessible - https://phabricator.wikimedia.org/T319311 (10Dzahn) a:03Jgreen [17:46:09] 10SRE, 10ops-codfw, 10DC-Ops: Frack codfw management network issue, many DRACs inaccessible - https://phabricator.wikimedia.org/T319311 (10Dzahn) 05Open→03In progress [17:46:40] 10SRE, 10ops-codfw, 10DC-Ops: Frack codfw management network issue, many DRACs inaccessible - https://phabricator.wikimedia.org/T319311 (10Dzahn) p:05Unbreak!→03High [17:47:53] 10SRE, 10SRE-Access-Requests: Please add eigyan to Restricted Group - https://phabricator.wikimedia.org/T318983 (10Dzahn) [17:49:43] 10SRE, 10SRE-Access-Requests: Please add eigyan (essexigyan) to Restricted Group - https://phabricator.wikimedia.org/T318983 (10Dzahn) [17:49:57] 10SRE, 10SRE-Access-Requests: Please add eigyan (essexigyan) to Restricted Group - https://phabricator.wikimedia.org/T318983 (10Dzahn) p:05Triage→03Medium [17:50:53] 10SRE, 10SRE-Access-Requests: Please add eigyan (essexigyan) to Restricted Group - https://phabricator.wikimedia.org/T318983 (10Dzahn) @Arnoldokoth This is existing shell user `essexigyan` but an additional group. [17:52:21] (03Merged) 10jenkins-bot: mypy: remove upper limit [software/spicerack] - 10https://gerrit.wikimedia.org/r/839602 (owner: 10Volans) [17:54:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10Arnoldokoth) Hey @AnnWF Kindly sign this https://phabricator.wikimedia.org/L3 Will also need approval from @Ottomata / @odimitrijevic and Dylan Kozlowski (I can't seem... [17:55:35] (03PS8) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) [17:58:41] 10SRE, 10SRE-Access-Requests: Requesting access to Eventlogs, Stats for Simulo-wikitech - https://phabricator.wikimedia.org/T318058 (10Dzahn) > Originally had I asked Simulo to file a new NDA after their transition to a volunteer role, unfortunately this volunteer onboarding isn't as simple as I had hoped. @... [18:00:05] ^demon and brennen: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T1800). nyaa~ [18:00:31] 10SRE, 10SRE-Access-Requests: Requesting access to Eventlogs, Stats for Simulo-wikitech - https://phabricator.wikimedia.org/T318058 (10Dzahn) a:03awight Could you get the approval from a manager of some type? Meanwhile Katie can reach out directly to @Simulo (@Simulo, she will need your email address, you c... [18:04:25] (03PS9) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) [18:05:12] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for devnull - https://phabricator.wikimedia.org/T318104 (10Dzahn) p:05Triage→03Medium a:03Devnull [18:07:16] 10SRE, 10SRE-Access-Requests: Requesting access to Eventlogs, Stats for Simulo-wikitech - https://phabricator.wikimedia.org/T318058 (10Dzahn) p:05Triage→03Medium [18:08:42] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Dzahn) 05Open→03In progress [18:09:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10Dzahn) 05Open→03In progress [18:09:39] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:09:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10Arnoldokoth) [18:09:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10Dzahn) a:03AnnWF [18:10:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10Arnoldokoth) 05Open→03In progress [18:12:19] 10SRE, 10ops-codfw, 10DC-Ops: Frack codfw management network issue, many DRACs inaccessible - https://phabricator.wikimedia.org/T319311 (10Papaul) 05In progress→03Resolved This was fixed [18:13:01] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1061-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:14:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Arnoldokoth) a:03karapayneWMDE [18:22:06] (03CR) 10Eevans: [C: 03+1] Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [18:24:12] (03PS1) 10Jdlrobson: Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839572 (https://phabricator.wikimedia.org/T319396) [18:29:02] (03Abandoned) 10Ryan Kemper: [wip] logstash: remove old files [puppet] - 10https://gerrit.wikimedia.org/r/838255 (owner: 10Ryan Kemper) [18:29:06] 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10Dzahn) [[ https://en.wikipedia.org/wiki/Janus | Janus ]] because of the 2 faces and it's what you get when you search for "Greek god of identity" and this is managing identities. [18:29:56] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudnet1003.eqiad.wmnet [18:30:44] (03PS1) 10AOkoth: admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) [18:31:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10Arnoldokoth) p:05Triage→03Medium [18:35:29] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [18:39:45] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:39:45] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudnet1003.eqiad.wmnet [18:42:01] (03CR) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [18:42:48] (03PS1) 10Ryan Kemper: elastic: replace 2 codfw masters to be decom'd [puppet] - 10https://gerrit.wikimedia.org/r/839668 (https://phabricator.wikimedia.org/T313431) [18:43:07] PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:44:03] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/839668 (https://phabricator.wikimedia.org/T313431) (owner: 10Ryan Kemper) [18:44:10] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37478/console" [puppet] - 10https://gerrit.wikimedia.org/r/839668 (https://phabricator.wikimedia.org/T313431) (owner: 10Ryan Kemper) [18:44:42] (03CR) 10Bking: [C: 03+1] elastic: replace 2 codfw masters to be decom'd [puppet] - 10https://gerrit.wikimedia.org/r/839668 (https://phabricator.wikimedia.org/T313431) (owner: 10Ryan Kemper) [18:44:48] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:45:01] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01926 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:45:39] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:46:57] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] elastic: replace 2 codfw masters to be decom'd [puppet] - 10https://gerrit.wikimedia.org/r/839668 (https://phabricator.wikimedia.org/T313431) (owner: 10Ryan Kemper) [18:47:01] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:47:07] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:47:07] (03CR) 10Dzahn: "if they really need shell access then this patch looks good to me. but the ticket said "might need" and that seemed a little weak. maybe a" [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth) [18:47:51] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH) dns4003 appears to be pushed fully into service (thanks @ssingh!) With that now seeming all green in icinga & confirmed with @BBlack , I'll move ahead and take down/decom dns4002 next tim... [18:47:57] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:21] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:35] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Jclark-ctr) Can this be changed at any time? I will work on netbox updates when not in data center [18:50:14] !log gehel@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on elastic2061.codfw.wmnet with reason: restarting for config reload - T313431 [18:50:18] T313431: Increase Elastic master-eligible nodes from 3 to 5 - https://phabricator.wikimedia.org/T313431 [18:50:29] !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on elastic2061.codfw.wmnet with reason: restarting for config reload - T313431 [18:50:37] !log gehel@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on elastic2084.codfw.wmnet with reason: restarting for config reload - T313431 [18:50:45] PROBLEM - nova-compute proc minimum on cloudvirt1022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:47] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:48] PROBLEM - nova-compute proc minimum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:48] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:49] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:49] PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:51] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:52] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:51:04] !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on elastic2084.codfw.wmnet with reason: restarting for config reload - T313431 [18:51:20] PROBLEM - nova-compute proc minimum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:51:41] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:51:42] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:51:47] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:51:55] PROBLEM - nova-compute proc minimum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:51:57] PROBLEM - nova-compute proc minimum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:51:58] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:51:58] PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:51:59] PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:52:22] !log gehel@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on elastic[2025,2031].codfw.wmnet with reason: restarting for config reload - T313431 [18:52:27] RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:52:39] !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on elastic[2025,2031].codfw.wmnet with reason: restarting for config reload - T313431 [18:52:49] PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:11] RECOVERY - nova-compute proc minimum on cloudvirt1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:13] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:14] RECOVERY - nova-compute proc minimum on cloudvirt1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:14] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:15] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:17] RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:17] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:18] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:45] RECOVERY - nova-compute proc minimum on cloudvirt1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:05] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:06] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:11] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:19] RECOVERY - nova-compute proc minimum on cloudvirt1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:21] RECOVERY - nova-compute proc minimum on cloudvirt1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:22] RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:23] RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:24] RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:55:14] RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:57:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:57:12] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [19:00:25] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002963 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:00:50] sorry about that noise! I think things are all recovered/recovering now [19:01:40] andrewbogott: thanks [19:03:20] !log 'bking@elastic restarted elastic2025, 2031, 2061, 2084 T313431 [19:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:24] T313431: Increase Elastic master-eligible nodes from 3 to 5 - https://phabricator.wikimedia.org/T313431 [19:07:21] (03CR) 10Jdlrobson: Automate icon generation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [19:15:04] !log train 1.40.0-wmf.4 (T314193) no current blockers, rolling train to all wikis [19:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:08] T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193 [19:15:48] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839672 (https://phabricator.wikimedia.org/T314193) [19:15:52] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839672 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [19:16:16] (03PS1) 10Bking: elastic: raise master-eligibles from 3 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/839673 (https://phabricator.wikimedia.org/T313431) [19:16:41] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839672 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [19:18:33] (03PS2) 10Bking: elastic: raise master-eligibles from 3 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/839673 (https://phabricator.wikimedia.org/T313431) [19:20:25] (03Abandoned) 10Bking: elastic: raise master-eligibles from 3 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/839673 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [19:20:59] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.4 refs T314193 [19:21:03] T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193 [19:21:24] (03CR) 10Bking: [C: 03+1] elasticsearch: Increase number of master-eligible nodes to 5 for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/836908 (https://phabricator.wikimedia.org/T313431) (owner: 10Gehel) [19:21:51] (03PS4) 10Bking: elasticsearch: Increase number of master-eligible nodes to 5 for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/836908 (https://phabricator.wikimedia.org/T313431) (owner: 10Gehel) [19:22:06] (03PS1) 10TrainBranchBot: group2 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839675 (https://phabricator.wikimedia.org/T314193) [19:22:08] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839675 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [19:22:16] ...eh, rolling this back to group1 and filing some tickets. [19:23:07] (03CR) 10Bking: [C: 03+2] elasticsearch: Increase number of master-eligible nodes to 5 for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/836908 (https://phabricator.wikimedia.org/T313431) (owner: 10Gehel) [19:23:51] The spike in "This Title instance does not represent a proper page, but merely a link target."? [19:24:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:24:17] (03Merged) 10jenkins-bot: group2 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839675 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [19:24:35] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 154 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:25:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:25:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:25:25] James_F: yeah, also just noticed a bunch of `Argument 6 passed to ContentTranslation\Entity\RecentSignificantEdit::__construct() must be of the type array` [19:25:41] But not new with the train? [19:26:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:26:19] Or maybe new on .4 but found on group1. [19:26:31] Fun times. [19:27:09] The invalid titles might be T292552 ? [19:27:09] T292552: Rename articles and users to prepare for PHP 7.3 unicode changes - https://phabricator.wikimedia.org/T292552 [19:27:14] (That's not been run yet.) [19:27:32] But I don't know of anything being intentionally merged that expected that to have been done. [19:28:15] I would have guessed https://gerrit.wikimedia.org/r/c/mediawiki/core/+/828553 for those proper page errors [19:28:32] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.40.0-wmf.3 refs T314193 [19:28:34] Oh, hmm, could well be. [19:28:36] T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193 [19:28:53] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 6 hosts with reason: T313431 [19:28:57] T313431: Increase Elastic master-eligible nodes from 3 to 5 - https://phabricator.wikimedia.org/T313431 [19:29:00] zabe: Good find. [19:29:05] i filed T319798; input welcome there [19:29:05] T319798: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent a proper page, but merely a link target. - https://phabricator.wikimedia.org/T319798 [19:29:10] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 6 hosts with reason: T313431 [19:29:56] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:30:04] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:31:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:32:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:32:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:33:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:34:11] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@cbdc509]: (no justification provided) [19:34:25] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@cbdc509]: (no justification provided) (duration: 00m 14s) [19:36:59] blocking on T314193 as well. [19:37:00] T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193 [19:37:31] 10SRE, 10Data Engineering Planning, 10Data Pipelines, 10Foundational Technology Requests, 10User-fgiunchedi: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10Ottomata) Cool! Is `this.ip.geoip_asn` built into benthos or did you provide it somehow? [19:38:31] (er, T319799) [19:38:31] T319799: TypeError: Argument 6 passed to ContentTranslation\Entity\RecentSignificantEdit::__construct() must be of the type array, object given - https://phabricator.wikimedia.org/T319799 [19:39:51] (03PS2) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) [19:39:53] (03PS1) 10Jdlrobson: Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) [19:40:37] (03CR) 10CI reject: [V: 04-1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [19:40:49] (03CR) 10CI reject: [V: 04-1] Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [19:41:12] !log deployed airflow to fix projectview_hourly_dag [19:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:25] (03PS1) 10Samtar: Replace promise handling when AfD'ing pages [extensions/PageTriage] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839575 (https://phabricator.wikimedia.org/T238025) [19:46:04] (03PS1) 10Jdlrobson: Flag when projects are missing wordmarks or icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680 [19:47:10] (03CR) 10CI reject: [V: 04-1] Flag when projects are missing wordmarks or icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680 (owner: 10Jdlrobson) [19:50:44] !log killed Oozie projectview-hourly job [19:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:07] !log Started airflow projectview_hourly_dag [19:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:29] (03PS1) 10Jdlrobson: ReadingLists on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839684 (https://phabricator.wikimedia.org/T317935) [19:51:45] (03CR) 10CI reject: [V: 04-1] ReadingLists on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839684 (https://phabricator.wikimedia.org/T317935) (owner: 10Jdlrobson) [19:53:28] (03PS1) 10Samtar: Replace promise handling when AfD'ing pages [extensions/PageTriage] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839576 (https://phabricator.wikimedia.org/T238025) [19:57:15] (03PS2) 10Jdlrobson: ReadingLists on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839684 (https://phabricator.wikimedia.org/T317935) [20:00:04] brennen and TheresNoTime: That opportune time is upon us again. Time for a UTC late backport and config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221006T2000). [20:00:04] Jdlrobson, TheresNoTime, chlod, and NovemLinguae: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] * TheresNoTime can deploy ^^ [20:00:35] * urbanecm waves too [20:01:22] oh hey urbanecm, can you quickly double-check that my idea to backport https://gerrit.wikimedia.org/r/c/839575/ (for .3 and .4) is okay? [20:02:06] Jdlrobson: you around? Going to start with https://gerrit.wikimedia.org/r/c/839572/ :) [20:02:12] present [20:02:13] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:02:22] TheresNoTime: at first sight, sgtm! [20:02:23] TheresNoTime: sounds good! thanks [20:02:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839572 (https://phabricator.wikimedia.org/T319396) (owner: 10Jdlrobson) [20:02:54] (ty urbanecm) [20:03:02] no problem [20:03:25] TheresNoTime: heyo! Can we steal deployment of a patch for training purposes :) [20:03:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:03:47] thcipriani: sure! https://gerrit.wikimedia.org/r/c/839684/ is up next if you want that one? [20:04:02] sure, thank you <3 [20:04:23] my second change is beta cluster only thcipriani if you wanted to try your tool again [20:04:48] Jdlrobson: just double checking, https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/838899 doesn't need to be backported to fix T319396 in production? [20:04:49] T319396: Either newcomer homepage or userpage/talk page are not displayed on mobile - https://phabricator.wikimedia.org/T319396 [20:05:00] thcipriani: the current one merging is a mw core, so you've got ~15 minutes if you want me to cancel? [20:05:03] urbanecm: correct [20:05:10] okay, great! [20:05:18] s/cancel/let that merge [20:05:38] cool, thank you! [20:05:43] !log samtar@deploy1002 backport aborted: (duration: 03m 13s) [20:05:47] okie doke, merging the beta one [20:06:01] ack, all yours [20:06:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839684 (https://phabricator.wikimedia.org/T317935) (owner: 10Jdlrobson) [20:07:44] (03Merged) 10jenkins-bot: ReadingLists on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839684 (https://phabricator.wikimedia.org/T317935) (owner: 10Jdlrobson) [20:10:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:10:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:13:56] TheresNoTime: Jdlrobson all done! Made a good demo. Want me to do any more? [20:14:23] thcipriani: you can pick up https://gerrit.wikimedia.org/r/c/mediawiki/core/+/839572 if you want? [20:14:26] almost merged [20:14:33] (though you'll miss the +2ing part) [20:14:53] yeh just waiting on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/839572 and that's me done thanks [20:15:03] TheresNoTime: sure [20:16:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:19:46] TheresNoTime: feel free to +2 your own changes now, so you're not waiting forever [20:20:42] (03CR) 10Samtar: [C: 03+2] "deploy" [extensions/PageTriage] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839575 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar) [20:20:49] (03CR) 10Samtar: [C: 03+2] "deploy" [extensions/PageTriage] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839576 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar) [20:22:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:22:04] (03Merged) 10jenkins-bot: Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839572 (https://phabricator.wikimedia.org/T319396) (owner: 10Jdlrobson) [20:22:49] thcipriani: good call, thanks - ^ has merged now :) [20:24:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839572 (https://phabricator.wikimedia.org/T319396) (owner: 10Jdlrobson) [20:24:44] (03PS1) 10Hashar: gerrit: use 2 threads to replicate to GitHub [puppet] - 10https://gerrit.wikimedia.org/r/839694 [20:24:54] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:839572|Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks (T319396)]] [20:24:55] oh, it +2s again -- TIL [20:24:59] T319396: Either newcomer homepage or userpage/talk page are not displayed on mobile - https://phabricator.wikimedia.org/T319396 [20:25:09] thcipriani: is the train blocked? I just noticed eswiki where I need to test is on wm4 [20:25:16] wmf3 rather [20:25:18] !log thcipriani@deploy1002 thcipriani and jdlrobson: Backport for [[gerrit:839572|Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks (T319396)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:25:18] than wmf4 [20:25:33] (Just wondering if I need to backport this to wmf3 as well) [20:25:43] yeah wikipedias are still on wmf.3: https://versions.toolforge.org/ [20:26:05] Is it likely to stay that way until Monday? [20:26:21] If so I guess I need to backport this to wmf3 as well (sorry) [20:26:43] (03CR) 10Hashar: "I have made the replica to use 4 threads in May with I172557bfbca4cf5bb8321cecafc7bc84f60abc5d / T307137." [puppet] - 10https://gerrit.wikimedia.org/r/839694 (owner: 10Hashar) [20:26:47] I believe that's being worked on, but it is getting late in the day. It'll probably be fixed by Monday, but I'm never 100% [20:26:53] (03Merged) 10jenkins-bot: Replace promise handling when AfD'ing pages [extensions/PageTriage] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839575 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar) [20:26:57] (03PS1) 10Jdlrobson: Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839577 (https://phabricator.wikimedia.org/T319396) [20:27:00] okay ill add this to the deployment calendar ^ [20:27:12] TheresNoTime: feel free to do yours first [20:27:16] Jdlrobson: any way to check this on non wikipedia wikis? [20:27:23] probably... [20:27:27] im looking at group 1 wikis now [20:27:33] it's live on mwdebug on group0/1 now :) [20:27:47] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudnet1003.eqiad.wmnet [20:27:50] (I'll wait to hear) [20:27:52] yep i can test on euwiki [20:28:01] cool :) [20:28:35] any of the debug servers thcipriani ? [20:28:51] (03Merged) 10jenkins-bot: Replace promise handling when AfD'ing pages [extensions/PageTriage] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839576 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar) [20:28:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:29:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:29:06] Jdlrobson: yep, all of them have it [20:29:22] Fix confirmed on itwiki [20:29:26] feel free to sync! [20:30:19] okay wmf3 change is on the calendar now. Let me know when it's a good time [20:30:22] thanks Jdlrobson [20:30:29] going live [20:31:03] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:32:23] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [20:33:50] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:33:51] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudnet1003.eqiad.wmnet [20:34:45] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:839572|Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks (T319396)]] (duration: 09m 51s) [20:34:48] 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10MoritzMuehlenhoff) >>! In T319409#8292457, @Dzahn wrote: > [[ https://en.wikipedia.org/wiki/Janus | Janus ]] because of the 2 faces and it's what you get when you search for "Greek god of identity" and... [20:34:49] T319396: Either newcomer homepage or userpage/talk page are not displayed on mobile - https://phabricator.wikimedia.org/T319396 [20:34:58] Jdlrobson: ^ should be live [20:35:21] yep! [20:35:21] TheresNoTime: please feel free to sync your changes [20:35:28] thcipriani: thanks :) [20:35:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:35:33] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: sync_check_icinga_contacts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:45] I'll finish Jdlrobson 's patch after that [20:35:51] sounds good! [20:35:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/PageTriage] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839575 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar) [20:36:36] !log samtar@deploy1002 Backport cancelled. [20:36:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/PageTriage] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/839575 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar) [20:37:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/PageTriage] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839576 (https://phabricator.wikimedia.org/T238025) (owner: 10Samtar) [20:37:14] !log samtar@deploy1002 Started scap: Backport for [[gerrit:839575|Replace promise handling when AfD'ing pages (T238025)]], [[gerrit:839576|Replace promise handling when AfD'ing pages (T238025)]] [20:37:18] T238025: Page Curation fails to create AFD page - https://phabricator.wikimedia.org/T238025 [20:37:37] !log samtar@deploy1002 samtar and samtar: Backport for [[gerrit:839575|Replace promise handling when AfD'ing pages (T238025)]], [[gerrit:839576|Replace promise handling when AfD'ing pages (T238025)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:37:51] (testing) [20:37:52] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder) [20:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:39:14] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudnet1004.eqiad.wmnet [20:40:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:41:15] (syncing) [20:41:24] (03PS3) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) [20:41:35] (03PS4) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) [20:42:33] (03CR) 10CI reject: [V: 04-1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:44:28] (03CR) 10Thcipriani: [C: 03+2] Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839577 (https://phabricator.wikimedia.org/T319396) (owner: 10Jdlrobson) [20:44:36] * thcipriani gets that cooking [20:45:11] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:839575|Replace promise handling when AfD'ing pages (T238025)]], [[gerrit:839576|Replace promise handling when AfD'ing pages (T238025)]] (duration: 07m 56s) [20:45:13] (03CR) 10Legoktm: "Conceptually +1 to this, though I think we should be consistent across PHP versions if we're adding packages, so I would've -1'd this as w" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/838939 (https://phabricator.wikimedia.org/T310435) (owner: 10BryanDavis) [20:45:15] T238025: Page Curation fails to create AFD page - https://phabricator.wikimedia.org/T238025 [20:45:48] thcipriani: all yours :) [20:46:20] thanks TheresNoTime [20:47:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:47:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:48:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:55:46] (03CR) 10Xcollazo: [C: 03+1] "Change looks ok to me (cursory check though as I'm unfamiliar with codebase)." [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/780898 (https://phabricator.wikimedia.org/T306197) (owner: 10Ottomata) [20:57:19] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10KFrancis) >>! In T308013#8282942, @jbond wrote: > @QChris thanks for the contribution and reaching out. > >>>! In T308013#8282636, @QChris wrote: >> While I fully support... [20:57:55] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:58:21] (03PS5) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) [20:58:23] (03PS2) 10Jdlrobson: Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) [20:58:25] (03PS2) 10Jdlrobson: Flag when projects are missing wordmarks or icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680 [20:58:27] (03PS1) 10Jdlrobson: DONOTMERGE: Proof of concept for batch updating DI wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839700 (https://phabricator.wikimedia.org/T319223) [20:58:49] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [20:59:26] (03Merged) 10jenkins-bot: Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839577 (https://phabricator.wikimedia.org/T319396) (owner: 10Jdlrobson) [20:59:28] (03CR) 10CI reject: [V: 04-1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:59:41] (03CR) 10CI reject: [V: 04-1] Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:59:52] (03CR) 10CI reject: [V: 04-1] DONOTMERGE: Proof of concept for batch updating DI wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839700 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:59:59] (03CR) 10CI reject: [V: 04-1] Flag when projects are missing wordmarks or icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680 (owner: 10Jdlrobson) [21:00:53] \o/ merged [21:01:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/839577 (https://phabricator.wikimedia.org/T319396) (owner: 10Jdlrobson) [21:01:40] yeyyya [21:01:51] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:839577|Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks (T319396)]] [21:01:56] T319396: Either newcomer homepage or userpage/talk page are not displayed on mobile - https://phabricator.wikimedia.org/T319396 [21:02:14] !log thcipriani@deploy1002 thcipriani and jdlrobson: Backport for [[gerrit:839577|Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks (T319396)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:02:43] Jdlrobson: alright, ^ should be on any of the mwdebug servers, check please [21:02:56] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:02:57] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudnet1004.eqiad.wmnet [21:03:24] thcipriani: looking [21:03:51] yep that did it! let's sync [21:03:57] great! going [21:08:00] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:839577|Skin: Map namespaces to associated pages inside runOnSkinTemplateNavigationHooks (T319396)]] (duration: 06m 08s) [21:08:04] T319396: Either newcomer homepage or userpage/talk page are not displayed on mobile - https://phabricator.wikimedia.org/T319396 [21:08:20] ^ Jdlrobson all sync'd [21:09:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:09:30] thcipriani: thanks! [21:09:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:09:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:11:21] (03PS6) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) [21:11:23] (03PS3) 10Jdlrobson: Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) [21:11:25] (03PS3) 10Jdlrobson: Flag when projects are missing wordmarks or icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680 [21:11:27] (03PS2) 10Jdlrobson: DONOTMERGE: Proof of concept for batch updating DI wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839700 (https://phabricator.wikimedia.org/T319223) [21:11:33] (03CR) 10Jdlrobson: [C: 04-1] "I still need to handle redirects in this one (using symlinks ln -s)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [21:12:07] (03CR) 10CI reject: [V: 04-1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [21:12:20] (03CR) 10CI reject: [V: 04-1] Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [21:12:50] (03CR) 10CI reject: [V: 04-1] Flag when projects are missing wordmarks or icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680 (owner: 10Jdlrobson) [21:12:52] (03CR) 10CI reject: [V: 04-1] DONOTMERGE: Proof of concept for batch updating DI wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839700 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [21:13:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:13:37] 10SRE, 10Infrastructure-Foundations: Initial Django project setup - https://phabricator.wikimedia.org/T319410 (10bd808) #striker and/or #toolhub may have things that are worth copying for you here. #striker especially has a [[https://gerrit.wikimedia.org/r/plugins/gitiles/labs/striker/+/refs/heads/master/contr... [21:14:20] (03PS7) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) [21:14:59] (03PS1) 10Andrew Bogott: Remove refs to cloudnet100[34] [puppet] - 10https://gerrit.wikimedia.org/r/839706 (https://phabricator.wikimedia.org/T319682) [21:15:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:16:12] (03CR) 10Andrew Bogott: [C: 03+2] Remove refs to cloudnet100[34] [puppet] - 10https://gerrit.wikimedia.org/r/839706 (https://phabricator.wikimedia.org/T319682) (owner: 10Andrew Bogott) [21:18:30] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): decommission cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T319682 (10Andrew) a:03Cmjohnson [21:19:02] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): decommission cloudnet1004.eqiad.wmnet - https://phabricator.wikimedia.org/T319683 (10Andrew) a:03Cmjohnson [21:20:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:24:56] (03PS4) 10Dduvall: P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) [21:28:02] (03CR) 10Dduvall: "Did quite a bit of refactoring and incorporated your feedback. I hope I didn't bloat the patch too much with the extra type definitions, b" [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [22:08:45] PROBLEM - ElasticSearch setting check - 9600 on elastic1073 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:08:45] PROBLEM - ElasticSearch setting check - 9600 on elastic2080 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia. [22:08:45] /Search%23Administration [22:11:40] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T319682 (10Volans) >>! In T319682#8294169, @Andrew wrote: > cc @Volans regarding the failure to wipe the drives. Feel free to investigate/rerun this if yo... [22:13:13] PROBLEM - ElasticSearch setting check - 9600 on elastic2075 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia. [22:13:13] /Search%23Administration [22:13:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:15:03] PROBLEM - ElasticSearch setting check - 9400 on elastic2073 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia. [22:15:03] /Search%23Administration [22:15:49] PROBLEM - ElasticSearch setting check - 9600 on elastic1075 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:15:49] PROBLEM - ElasticSearch setting check - 9400 on elastic1068 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:15:49] PROBLEM - ElasticSearch setting check - 9400 on elastic1057 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:15:51] PROBLEM - ElasticSearch setting check - 9400 on elastic2047 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia. [22:15:51] /Search%23Administration [22:23:53] PROBLEM - ElasticSearch setting check - 9400 on elastic2042 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia. [22:23:53] /Search%23Administration [22:23:57] PROBLEM - ElasticSearch setting check - 9400 on elastic2086 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia. [22:23:57] /Search%23Administration [22:23:57] PROBLEM - ElasticSearch setting check - 9400 on elastic2052 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia. [22:23:57] /Search%23Administration [22:26:37] PROBLEM - ElasticSearch setting check - 9200 on elastic1054 is CRITICAL: CRITICAL - [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500] does not match [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500, elastic1093.eqiad.wmnet:9500, elastic1098.eqiad.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:01] PROBLEM - ElasticSearch setting check - 9200 on elastic1081 is CRITICAL: CRITICAL - [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500] does not match [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500, elastic1093.eqiad.wmnet:9500, elastic1098.eqiad.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:03] PROBLEM - ElasticSearch setting check - 9200 on elastic1074 is CRITICAL: CRITICAL - [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500] does not match [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500, elastic1093.eqiad.wmnet:9500, elastic1098.eqiad.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:05] PROBLEM - ElasticSearch setting check - 9600 on elastic2054 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia. [22:29:05] /Search%23Administration [22:33:47] PROBLEM - ElasticSearch setting check - 9600 on elastic1083 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:33:49] PROBLEM - ElasticSearch setting check - 9600 on elastic2083 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia. [22:33:49] /Search%23Administration [22:33:51] PROBLEM - ElasticSearch setting check - 9600 on elastic2076 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] does not match [elastic2042.codfw.wmnet:9300, elastic2061.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300, elastic2084.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia. [22:33:51] /Search%23Administration [22:34:19] PROBLEM - ElasticSearch setting check - 9400 on elastic1076 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:53:23] PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:55:41] RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:57:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:57:12] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [23:10:56] (03PS1) 10BryanDavis: Use explicit 'latest' tags on upstream base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/839745 (https://phabricator.wikimedia.org/T320100) [23:47:53] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder)