[00:36:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:27] jouncebot: nowandnext [00:56:27] No deployments scheduled for the next 2 hour(s) and 3 minute(s) [00:56:27] In 2 hour(s) and 3 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0300) [00:56:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890173 (https://phabricator.wikimedia.org/T330015) (owner: 10Urbanecm) [00:57:33] (03Merged) 10jenkins-bot: cswikibooks: Enable visualeditor for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890173 (https://phabricator.wikimedia.org/T330015) (owner: 10Urbanecm) [00:57:50] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:890173|cswikibooks: Enable visualeditor for all users (T330015)]] [00:57:54] T330015: Enable VisualEditor by default on cs.wikibooks.org - https://phabricator.wikimedia.org/T330015 [00:59:25] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:890173|cswikibooks: Enable visualeditor for all users (T330015)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [01:06:38] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:890173|cswikibooks: Enable visualeditor for all users (T330015)]] (duration: 08m 47s) [01:06:42] T330015: Enable VisualEditor by default on cs.wikibooks.org - https://phabricator.wikimedia.org/T330015 [01:07:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10phaultfinder) [01:48:46] (03CR) 10Legoktm: [C: 03+1] Remove wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890496 (https://phabricator.wikimedia.org/T329992) (owner: 10Arlolra) [01:49:02] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [01:53:03] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [01:58:22] (03PS1) 10Legoktm: varnish: Set `X-Content-Type-Options: nosniff` on upload requests [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) [02:06:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:02] (03PS1) 10Andrew Bogott: systemd timers: Add the 'After' requirement to the timer module [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) [02:30:11] (03CR) 10Andrew Bogott: "PCC results:" [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott) [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0300) [03:07:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.24 [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/890352 (https://phabricator.wikimedia.org/T325587) [03:07:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.24 [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/890352 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot) [03:23:00] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.24 [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/890352 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot) [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0400) [04:05:54] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:03:18] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [05:05:02] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:53:46] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [05:57:48] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [06:09:20] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:10:32] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0700) [07:00:04] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0700). [07:45:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:45:10] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:46:44] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:46:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:49:34] !log Staging the new Junos version on the codfw row B switches - T327991 [07:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:39] T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 [07:54:09] (03PS2) 10KartikMistry: Section Translation: Fix language code for Cantonese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890482 (https://phabricator.wikimedia.org/T304865) [07:57:30] (Emergency syslog message) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [07:58:15] unexpected but seems link impactless ^ [07:59:28] (03PS3) 10Muehlenhoff: sre.ganeti.makevm: Stop printing the dhcp config snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/889999 (https://phabricator.wikimedia.org/T306661) [08:00:04] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0800). [08:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:45] * kart_ is here [08:00:56] I'll go ahead with self-deploy.. [08:02:30] (Emergency syslog message) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [08:03:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890482 (https://phabricator.wikimedia.org/T304865) (owner: 10KartikMistry) [08:04:17] (03Merged) 10jenkins-bot: Section Translation: Fix language code for Cantonese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890482 (https://phabricator.wikimedia.org/T304865) (owner: 10KartikMistry) [08:04:48] !log kartik@deploy1002 Started scap: Backport for [[gerrit:890482|Section Translation: Fix language code for Cantonese Wikipedia (T304865)]] [08:04:52] T304865: Enable Content and Section Translation for Cantonese Wikipedia - https://phabricator.wikimedia.org/T304865 [08:04:59] (03CR) 10Alexandros Kosiaris: admin_ng: Update wikikube-codfw settings to k8s 1.23 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [08:05:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: Update wikikube-codfw settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [08:07:46] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.makevm: Stop printing the dhcp config snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/889999 (https://phabricator.wikimedia.org/T306661) (owner: 10Muehlenhoff) [08:09:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, but 1 minor comment regarding the order of things." [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [08:09:33] !log kartik@deploy1002 kartik: Backport for [[gerrit:890482|Section Translation: Fix language code for Cantonese Wikipedia (T304865)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:21:24] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:890482|Section Translation: Fix language code for Cantonese Wikipedia (T304865)]] (duration: 16m 36s) [08:21:29] T304865: Enable Content and Section Translation for Cantonese Wikipedia - https://phabricator.wikimedia.org/T304865 [08:23:09] (03CR) 10Slyngshede: P:idm configure production IDM (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [08:23:16] (03PS33) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [08:23:25] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890772 (https://phabricator.wikimedia.org/T325587) [08:23:27] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890772 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot) [08:24:00] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890772 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot) [08:24:22] !log hashar@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.24 refs T325587 [08:24:26] T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587 [08:29:41] (03CR) 10Slyngshede: [C: 03+2] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [08:36:20] (03CR) 10Phedenskog: [C: 03+1] "I prepared a test we shoot asap when this is merged, and then I'll clean it up when you tell me you are done Nicholas." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890509 (https://phabricator.wikimedia.org/T326147) (owner: 10Nray) [08:43:32] I kind of forgot about the backport window :( [08:49:59] !log installing clamav security updates [08:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:15] (03CR) 10Klausman: [C: 03+2] ml-services: outlink model upgrade debian and python [deployment-charts] - 10https://gerrit.wikimedia.org/r/890471 (https://phabricator.wikimedia.org/T328438) (owner: 10Ilias Sarantopoulos) [08:56:53] (03PS4) 10Nicolas Fraison: netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) [08:57:19] (03CR) 10CI reject: [V: 04-1] netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [08:58:02] (03PS5) 10Nicolas Fraison: netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) [08:58:28] (03CR) 10CI reject: [V: 04-1] netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [09:00:05] jayme: That opportune time is upon us again. Time for a Kubernetes upgrade wikikube codfw deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900). [09:00:05] hashar and dduvall: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900). [09:01:07] (03Merged) 10jenkins-bot: ml-services: outlink model upgrade debian and python [deployment-charts] - 10https://gerrit.wikimedia.org/r/890471 (https://phabricator.wikimedia.org/T328438) (owner: 10Ilias Sarantopoulos) [09:01:30] PROBLEM - Check that envoy is running on idm2001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:06:19] (03PS1) 10Muehlenhoff: Restore two old entries (and mark as absented) [puppet] - 10https://gerrit.wikimedia.org/r/890774 [09:08:03] (03CR) 10Muehlenhoff: [C: 03+2] Restore two old entries (and mark as absented) [puppet] - 10https://gerrit.wikimedia.org/r/890774 (owner: 10Muehlenhoff) [09:08:15] (03PS1) 10Elukey: knative-serving: add network policies for domain-mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/890775 (https://phabricator.wikimedia.org/T327767) [09:08:27] (03CR) 10Muehlenhoff: admin: remove users mmarble/marble (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890497 (owner: 10Jbond) [09:10:20] !log hashar@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.24 refs T325587 (duration: 45m 58s) [09:10:28] T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587 [09:12:08] !log update thirdparty/haproxy26 to version 2.6.9 for bullseye and buster (apt.wm.o) [09:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:39] !log hashar@deploy1002 Pruned MediaWiki: 1.40.0-wmf.22 (duration: 02m 16s) [09:13:37] !log jayme@cumin1001 START - Cookbook sre.discovery.datacenter depool all active/active services in codfw: maintenance [09:14:17] !log testing HAProxy 2.6.9 in cp4052 and cp4044 [09:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:44] PROBLEM - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:46] (03CR) 10JMeybohm: [V: 03+1] wikikube: Update cluster settings for k8s 1.23 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [09:16:52] (03PS6) 10Nicolas Fraison: netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) [09:18:46] (03CR) 10Elukey: [C: 03+2] knative-serving: add network policies for domain-mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/890775 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:20:26] PROBLEM - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:53] (03CR) 10Klausman: [C: 03+1] knative-serving: add network policies for domain-mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/890775 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:22:51] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Volans) @Papaul ideally we should find what's the characteristic that determines the change (iDRAC version?, BIOS version?, Dell GEN?) and automatically detect that ins... [09:24:10] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2002.codfw.wmnet [09:24:27] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=prometheus2005.codfw.wmnet [09:24:31] (03PS7) 10Nicolas Fraison: netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) [09:24:58] ACKNOWLEDGEMENT - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service Slyngshede Setup https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:02] ACKNOWLEDGEMENT - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service Slyngshede Setup https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:02] ACKNOWLEDGEMENT - Check that envoy is running on idm2001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed Slyngshede Setup https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:26:36] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert dry-run looks good, resolving [09:26:48] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [09:27:40] (03CR) 10Muehlenhoff: [C: 03+1] "Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [09:31:00] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in codfw: maintenance [09:32:42] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39750/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890489 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [09:34:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "Just a nit inline, LGTM though! nicely done" [puppet] - 10https://gerrit.wikimedia.org/r/890489 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [09:34:27] (03CR) 10Nicolas Fraison: [C: 03+2] presto: add last 5 nodes to prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/889995 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [09:35:59] (03CR) 10Jelto: "comment in line" [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott) [09:36:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:38:30] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:39:36] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:40:05] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10JMeybohm) Global depool of a/a services from codfw is done. [09:40:23] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) (owner: 10Jelto) [09:44:27] (03PS1) 10EoghanGaffney: Change the active gitlab replica host to be the eqiad instance [puppet] - 10https://gerrit.wikimedia.org/r/890779 (https://phabricator.wikimedia.org/T329930) [09:44:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:46:49] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:48:53] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:48:54] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route depool 2 services in codfw: T329664 [09:48:59] T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 [09:49:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:53:58] !log jayme@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) depool 2 services in codfw: T329664 [09:54:00] (03CR) 10Elukey: [C: 03+1] "Looks good to me for a test. Adding Moritz since this may eventually be refactored into a standard recipe." [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [09:54:01] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [09:54:02] T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 [09:58:04] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [09:58:44] 10SRE, 10Znuny: Puppet template for /etc/clamav/clamd.conf needs to be updated - https://phabricator.wikimedia.org/T330129 (10MoritzMuehlenhoff) [09:59:01] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:59:46] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:01:30] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:02:52] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:04:00] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:04:28] (03PS1) 10Elukey: knative-serving: fix some bugs in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/890780 (https://phabricator.wikimedia.org/T327767) [10:08:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [10:09:09] expected ^ will silence them [10:11:32] (03CR) 10Klausman: [C: 03+1] knative-serving: fix some bugs in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/890780 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:11:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [10:13:01] (03PS44) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [10:13:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:13:53] (03PS2) 10Majavah: alerts: Allow customizing the git repository info [puppet] - 10https://gerrit.wikimedia.org/r/890489 (https://phabricator.wikimedia.org/T304716) [10:13:55] (03PS3) 10Majavah: P:toolforge::prometheus: deploy alert rules from GitLab [puppet] - 10https://gerrit.wikimedia.org/r/890490 (https://phabricator.wikimedia.org/T284860) [10:14:07] (03CR) 10Majavah: alerts: Allow customizing the git repository info (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890489 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [10:15:03] (03CR) 10Jelto: [C: 03+1] "lgtm, can be merged during switchover tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/890779 (https://phabricator.wikimedia.org/T329930) (owner: 10EoghanGaffney) [10:16:22] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/890353 (https://phabricator.wikimedia.org/T330134) [10:16:41] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39751/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [10:18:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:20:46] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [10:20:46] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:22:12] (03PS45) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [10:23:22] (03CR) 10Elukey: [C: 03+2] knative-serving: fix some bugs in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/890780 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:24:09] (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [10:24:18] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:25:23] 10SRE-tools, 10Infrastructure-Foundations: Decide which cookbooks using icinga_hosts.wait_for_optimal() should use skip_acked=True - https://phabricator.wikimedia.org/T330136 (10Volans) p:05Triage→03Medium [10:26:18] (03PS46) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [10:26:51] (03CR) 10Hashar: [C: 04-1] systemd timers: Add the 'After' requirement to the timer module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott) [10:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT clusterissuers) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:51] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39752/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [10:29:55] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:29:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s8 T330134 [10:30:02] T330134: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T330134 [10:30:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s8 T330134 [10:30:52] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:30:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2165 with weight 0 T330134', diff saved to https://phabricator.wikimedia.org/P44696 and previous config saved to /var/cache/conftool/dbconfig/20230221-103053-ladsgroup.json [10:31:56] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:32:23] (03PS1) 10Clément Goubert: sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 [10:33:53] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT clusterissuers) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:34:53] (03PS47) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [10:36:50] (03PS1) 10Elukey: knative-serving: extend delay for domain mapping webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/890783 (https://phabricator.wikimedia.org/T327767) [10:37:13] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:37:30] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:38:33] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39753/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [10:39:00] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 (owner: 10Clément Goubert) [10:39:10] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:40:10] (03PS1) 10EoghanGaffney: Update DNS to switch gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/890785 (https://phabricator.wikimedia.org/T329930) [10:43:50] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1003.eqiad.wmnet with OS bullseye [10:44:35] (03CR) 10Klausman: [C: 03+1] knative-serving: extend delay for domain mapping webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/890783 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:44:44] (03PS2) 10Clément Goubert: sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 [10:44:51] (03CR) 10Clément Goubert: sre.discovery.datacenter: Logging improvements (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 (owner: 10Clément Goubert) [10:46:24] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 23 hosts with reason: Reinitialize wikikube codfw with k8s 1.23 [10:46:41] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 23 hosts with reason: Reinitialize wikikube codfw with k8s 1.23 [10:48:11] jouncebot: nowandnext [10:48:11] For the next 5 hour(s) and 11 minute(s): Kubernetes upgrade wikikube codfw (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900) [10:48:11] For the next 0 hour(s) and 11 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900) [10:48:11] In 0 hour(s) and 11 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1100) [10:49:21] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:49:38] (03CR) 10Nicolas Fraison: netboot: create dedicated partman recipe for presto workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [10:49:40] (03CR) 10Nicolas Fraison: [C: 03+2] netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [10:49:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:50:57] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1001.eqiad.wmnet with OS bullseye [10:51:25] (03CR) 10Jelto: [C: 03+1] "lgtm, can be merged during switchover tomorrow" [dns] - 10https://gerrit.wikimedia.org/r/890785 (https://phabricator.wikimedia.org/T329930) (owner: 10EoghanGaffney) [10:53:22] (03PS2) 10Ladsgroup: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/890353 (https://phabricator.wikimedia.org/T330134) (owner: 10Gerrit maintenance bot) [10:53:26] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/890353 (https://phabricator.wikimedia.org/T330134) (owner: 10Gerrit maintenance bot) [10:54:35] !log Starting s8 codfw failover from db2161 to db2165 - T330134 [10:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:38] T330134: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T330134 [10:54:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:55:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2165 to s8 primary T330134', diff saved to https://phabricator.wikimedia.org/P44697 and previous config saved to /var/cache/conftool/dbconfig/20230221-105503-ladsgroup.json [10:55:30] !log jayme@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: T329664 [10:55:34] T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 [10:55:34] (03PS3) 10Clément Goubert: sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 [10:56:14] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1003.eqiad.wmnet with reason: host reimage [10:56:32] (03PS4) 10Clément Goubert: sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 [10:57:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2161 T330134', diff saved to https://phabricator.wikimedia.org/P44698 and previous config saved to /var/cache/conftool/dbconfig/20230221-105714-ladsgroup.json [10:57:42] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] wikikube: Update cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [10:58:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P44699 and previous config saved to /var/cache/conftool/dbconfig/20230221-105823-root.json [10:59:32] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:59:42] !log root@cumin1001 START - Cookbook sre.ganeti.reimage for host kubetcd2004.codfw.wmnet with OS bullseye [10:59:51] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubetcd2005.codfw.wmnet with OS bullseye [11:00:04] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubetcd2006.codfw.wmnet with OS bullseye [11:00:04] jayme: How many deployers does it take to do Kubernetes upgrade wikikube codfw deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900). [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1100) [11:01:06] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1003.eqiad.wmnet with reason: host reimage [11:01:06] Alerts arising from kubernetes / wikikube in codfw is me reimaging (jynus, godog) [11:01:21] yeah, I've seen them already [11:01:29] we were aware [11:01:34] (03PS16) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [11:01:39] ok, cool [11:01:43] but please ping if maintenance finishes/codfw is repooled [11:01:57] to be more on top of it in case something is bad when it shouldn't [11:02:26] sure. but repool won't happen before switch maintenenace is done [11:04:29] (03CR) 10Elukey: [C: 03+2] knative-serving: extend delay for domain mapping webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/890783 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:05:51] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 (owner: 10Clément Goubert) [11:06:33] (03PS1) 10Ladsgroup: Move userrights related configs from IS.php to core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890789 (https://phabricator.wikimedia.org/T308932) [11:06:45] (JobUnavailable) firing: Reduced availability for job kubetcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:07:12] (03CR) 10CI reject: [V: 04-1] Move userrights related configs from IS.php to core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890789 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [11:07:14] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [11:08:15] (03PS2) 10Ladsgroup: Move userrights related configs from IS.php to core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890789 (https://phabricator.wikimedia.org/T308932) [11:09:51] (03CR) 10David Caro: [C: 03+2] node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:10:35] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond) [11:11:24] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubetcd2006.codfw.wmnet with reason: host reimage [11:11:28] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubetcd2005.codfw.wmnet with reason: host reimage [11:12:40] !log rolling upgrade to HAproxy 2.6.9 on ulsfo [11:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P44700 and previous config saved to /var/cache/conftool/dbconfig/20230221-111328-root.json [11:13:45] (03PS1) 10Jbond: puppetmaster: offline 2003 for switch upgrade [puppet] - 10https://gerrit.wikimedia.org/r/890792 (https://phabricator.wikimedia.org/T327991) [11:13:53] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubetcd2006.codfw.wmnet with reason: host reimage [11:14:52] (03PS1) 10Elukey: sre.k8s.upgrade-cluster: add final ask_confirmation before the end [cookbooks] - 10https://gerrit.wikimedia.org/r/890794 (https://phabricator.wikimedia.org/T327767) [11:15:32] (03CR) 10JMeybohm: [C: 03+1] sre.k8s.upgrade-cluster: add final ask_confirmation before the end [cookbooks] - 10https://gerrit.wikimedia.org/r/890794 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:15:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:16:21] (03CR) 10Jbond: [C: 03+2] puppetmaster: offline 2003 for switch upgrade [puppet] - 10https://gerrit.wikimedia.org/r/890792 (https://phabricator.wikimedia.org/T327991) (owner: 10Jbond) [11:16:22] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubetcd2005.codfw.wmnet with reason: host reimage [11:16:31] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubetcd2004.codfw.wmnet with reason: host reimage [11:16:35] (03CR) 10CI reject: [V: 04-1] sre.k8s.upgrade-cluster: add final ask_confirmation before the end [cookbooks] - 10https://gerrit.wikimedia.org/r/890794 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:16:42] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond) [11:16:56] (03PS1) 10Elukey: ml-services: set proper EventGate stream values [deployment-charts] - 10https://gerrit.wikimedia.org/r/890795 (https://phabricator.wikimedia.org/T327767) [11:16:57] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1003.eqiad.wmnet with OS bullseye [11:17:17] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 1 NOOP 21): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39754/console" [puppet] - 10https://gerrit.wikimedia.org/r/890792 (https://phabricator.wikimedia.org/T327991) (owner: 10Jbond) [11:18:13] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [11:18:36] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10SLyngshede-WMF) [11:19:09] (03PS2) 10Elukey: sre.k8s.upgrade-cluster: add final ask_confirmation before the end [cookbooks] - 10https://gerrit.wikimedia.org/r/890794 (https://phabricator.wikimedia.org/T327767) [11:19:39] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubetcd2004.codfw.wmnet with reason: host reimage [11:21:47] (03CR) 10Elukey: [C: 03+2] sre.k8s.upgrade-cluster: add final ask_confirmation before the end [cookbooks] - 10https://gerrit.wikimedia.org/r/890794 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:23:30] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [11:23:50] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [11:24:23] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:24:41] !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: T329664 [11:24:45] T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 [11:25:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [11:25:15] I've aborted the cookbook on purpose [11:25:25] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [11:25:25] !log jayme@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: T329664 [11:25:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [11:25:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:26:00] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:26:17] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: Allow customizing the git repository info [puppet] - 10https://gerrit.wikimedia.org/r/890489 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [11:26:22] !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubetcd2006.codfw.wmnet with OS bullseye [11:26:29] (03PS1) 10Slyngshede: data.yaml add sgimeno to deployment group. [puppet] - 10https://gerrit.wikimedia.org/r/890797 (https://phabricator.wikimedia.org/T330070) [11:26:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10SLyngshede-WMF) Approval to deployment group is required from @thcipriani according to the data.yaml info. [11:27:41] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubetcd2005.codfw.wmnet with OS bullseye [11:28:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P44701 and previous config saved to /var/cache/conftool/dbconfig/20230221-112833-root.json [11:28:53] (03CR) 10Klausman: [C: 03+1] ml-services: set proper EventGate stream values [deployment-charts] - 10https://gerrit.wikimedia.org/r/890795 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:30:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/890432 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [11:30:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/890439 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [11:30:57] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:30:58] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:31:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [11:32:08] !log root@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubetcd2004.codfw.wmnet with OS bullseye [11:32:37] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [11:32:42] (03CR) 10Elukey: [C: 03+2] ml-services: set proper EventGate stream values [deployment-charts] - 10https://gerrit.wikimedia.org/r/890795 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:33:33] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:34:03] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:34:05] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:34:35] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubemaster2001.codfw.wmnet with OS bullseye [11:35:16] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [11:38:49] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:39:00] (03CR) 10JMeybohm: admin_ng: Update wikikube-codfw settings to k8s 1.23 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [11:39:47] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:40:18] AS64602 BGP errors is me as well - T329664 [11:40:18] T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 [11:40:58] 10SRE, 10Znuny, 10serviceops-collab: Puppet template for /etc/clamav/clamd.conf needs to be updated - https://phabricator.wikimedia.org/T330129 (10MoritzMuehlenhoff) [11:41:21] (03PS1) 10Muehlenhoff: clamd.conf: Remove some config entries [puppet] - 10https://gerrit.wikimedia.org/r/890799 (https://phabricator.wikimedia.org/T330129) [11:43:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P44702 and previous config saved to /var/cache/conftool/dbconfig/20230221-114338-root.json [11:45:47] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [11:46:34] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubemaster2001.codfw.wmnet with reason: host reimage [11:47:29] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:49:05] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubemaster2001.codfw.wmnet with reason: host reimage [11:49:38] (03PS1) 10Jbond: systemd::timer::job: update email subject and body [puppet] - 10https://gerrit.wikimedia.org/r/890800 (https://phabricator.wikimedia.org/T330120) [11:49:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert) [11:50:46] (03PS15) 10Clément Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [11:51:06] (03PS3) 10Clément Goubert: sre.switchdc.mediawiki: Add mw-on-k8s services [cookbooks] - 10https://gerrit.wikimedia.org/r/889801 (https://phabricator.wikimedia.org/T327924) [11:51:18] (03PS48) 10Stevemunene: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [11:51:29] (03PS1) 10Slyngshede: C:idm::deployment use Redis password [puppet] - 10https://gerrit.wikimedia.org/r/890801 (https://phabricator.wikimedia.org/T320797) [11:51:46] (03CR) 10Jbond: [C: 04-1] "change no longer required, see comments" [puppet] - 10https://gerrit.wikimedia.org/r/890497 (owner: 10Jbond) [11:51:48] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 (owner: 10Clément Goubert) [11:51:57] (03Abandoned) 10Jbond: admin: remove users mmarble/marble [puppet] - 10https://gerrit.wikimedia.org/r/890497 (owner: 10Jbond) [11:52:06] (03PS5) 10Jbond: admin: add a test to prevent duplicates in users/ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [11:52:16] !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1001.eqiad.wmnet with reason: host reimage [11:52:39] (03PS9) 10Clément Goubert: sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 [11:55:22] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1001.eqiad.wmnet with reason: host reimage [11:56:10] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39755/console" [puppet] - 10https://gerrit.wikimedia.org/r/890801 (https://phabricator.wikimedia.org/T320797) (owner: 10Slyngshede) [11:56:59] (03CR) 10Volans: "FYI all those records have a 1H TTL, depending on how quicker you want the failover to happen and depending if you plan to run the sre.dns" [dns] - 10https://gerrit.wikimedia.org/r/890785 (https://phabricator.wikimedia.org/T329930) (owner: 10EoghanGaffney) [11:57:02] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39756/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [11:57:22] jayme: can I do a couple of mw deploys or you prefer I'd do later? [11:57:32] (nothing urgent) [11:57:59] Amir1: You can try now if you like. Deploying to wikikube codfw will fail, though [11:58:27] probably with very strange errors as the first control-plane is about to come up again [11:58:28] hmm, I'd wait then, it's not anything important, ping me once done please [11:58:34] ack [11:58:43] (03PS1) 10Alexandros Kosiaris: codfw: Add new WikiKube IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890802 (https://phabricator.wikimedia.org/T326617) [12:00:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (once the group owner approval is in)" [puppet] - 10https://gerrit.wikimedia.org/r/890797 (https://phabricator.wikimedia.org/T330070) (owner: 10Slyngshede) [12:00:15] (03PS1) 10Alexandros Kosiaris: WikiKube codfw: Remove the old IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890803 (https://phabricator.wikimedia.org/T326617) [12:00:54] (03PS2) 10Slyngshede: C:idm::deployment use Redis password [puppet] - 10https://gerrit.wikimedia.org/r/890801 (https://phabricator.wikimedia.org/T320797) [12:02:21] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:02:23] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39757/console" [puppet] - 10https://gerrit.wikimedia.org/r/890801 (https://phabricator.wikimedia.org/T320797) (owner: 10Slyngshede) [12:02:51] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:02:56] (03PS1) 10Alexandros Kosiaris: WikiKube eqiad: Add the new larger IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890804 (https://phabricator.wikimedia.org/T326617) [12:02:58] (03PS1) 10Alexandros Kosiaris: WikiKube eqiad: Remove the old IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890805 (https://phabricator.wikimedia.org/T326617) [12:03:07] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment use Redis password [puppet] - 10https://gerrit.wikimedia.org/r/890801 (https://phabricator.wikimedia.org/T320797) (owner: 10Slyngshede) [12:03:25] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:04:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] codfw: Add new WikiKube IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890802 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [12:04:34] (03CR) 10Jbond: [C: 04-1] "Its hard to tell from the task the error this is attempting to fix, however it doesn't seem like the correct way forward. perhaps the iss" [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott) [12:04:59] (03Merged) 10jenkins-bot: codfw: Add new WikiKube IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890802 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [12:05:27] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:05:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (one typo inline)" [puppet] - 10https://gerrit.wikimedia.org/r/890800 (https://phabricator.wikimedia.org/T330120) (owner: 10Jbond) [12:05:54] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubemaster2001.codfw.wmnet with OS bullseye [12:05:56] (03PS1) 10Slyngshede: C:idm::deployment missing comma [puppet] - 10https://gerrit.wikimedia.org/r/890806 [12:06:00] (03CR) 10Jbond: [C: 03+2] admin: add a test to prevent duplicates in users/ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [12:06:09] (03CR) 10Clément Goubert: sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 (owner: 10Clément Goubert) [12:06:13] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 (owner: 10Clément Goubert) [12:06:35] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubemaster2002.codfw.wmnet with OS bullseye [12:06:35] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment missing comma [puppet] - 10https://gerrit.wikimedia.org/r/890806 (owner: 10Slyngshede) [12:08:02] (03Merged) 10jenkins-bot: sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 (owner: 10Clément Goubert) [12:08:31] (03PS16) 10Clément Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [12:08:47] (03PS2) 10Jbond: systemd::timer::job: update email subject and body [puppet] - 10https://gerrit.wikimedia.org/r/890800 (https://phabricator.wikimedia.org/T330120) [12:09:02] (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/890800 (https://phabricator.wikimedia.org/T330120) (owner: 10Jbond) [12:09:06] (03CR) 10Ayounsi: WikiKube codfw: Remove the old IP space (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/890803 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [12:09:08] (03CR) 10Jbond: [V: 03+2 C: 03+2] systemd::timer::job: update email subject and body [puppet] - 10https://gerrit.wikimedia.org/r/890800 (https://phabricator.wikimedia.org/T330120) (owner: 10Jbond) [12:09:51] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:10:07] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /robots.txt (robots.txt check) timed out before a response was received: / (spec from root) timed out before a response was received: / (doc from root) timed out before a response was received: / (root with wrong query param) timed out before a response was received: /v1/page/{language}/{title} (Fetch enwiki protected page) timed out before a response was received: /v1/ [12:10:07] nguage}/{title}/{revision} (Fetch enwiki protected page) timed out before a response was received: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning with a given provider) timed out before a response was received: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning without specifying a provider) timed out before a response was received: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning with a given p [12:10:07] timed out before a response was received: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning without specifying a provider) timed out before a response was received: /v1/mt/{from}/{to} (Machine translate an HTML fragment using TestClient.) timed out before a response was received: /v1/mt/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient.) timed out before a response was received: /v1/list/pai [12:10:07] /{to} (Get the tools between two language pairs) timed out before a response was received: /v1/list/languagepairs (Get all the language pairs) timed out before a response was received: /v1/list/{tool} (Get the MT tool between two language pairs) timed out before a response was received: /v1/list/{tool}/{from}/{to} (Get the MT tool between two language pairs) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlangua [12:10:07] le} (Translate enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}/{revision} (Translate enwiki protected page) timed out before a response was received: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an H [12:10:08] ment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/title/{title}/{from}/{to} (Suggest a target title for the given source title and language pairs) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Su [12:10:08] urce sections to translate) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /_info/name (retrieve service name) timed out before a response was received: /_info/version (retrieve service version) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [12:10:18] (ProbeDown) firing: (4) Service eventgate-main:4492 has failed probes (http_eventgate-main_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:25] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers kubernetes2007.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2022.codf [12:10:25] kubernetes2008.codfw.wmnet are marked down but pooled: mw-web_4450: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled: eventgate-analytics-external_4692: Servers kubernetes2010.codfw.wmnet, kubernetes2013. [12:10:25] net, kubernetes2020.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes20 https://wikitech.wikimedia.org/wiki/PyBal [12:10:43] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled: mw-web_4450: Servers kubernetes2010.codfw.wmnet, [12:10:43] tes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2011.codfw.wmnet are marked down but pooled: eventgate-analytics-external_4692: Servers kubernetes2007.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2006. [12:10:43] net, kubernetes2012.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes20 https://wikitech.wikimedia.org/wiki/PyBal [12:10:45] no impact, right? [12:10:53] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:11:00] we just got some paging issues [12:11:06] jynus: shouldn't, it's depooled [12:11:09] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fc1c0221b70, Connection to wikifeeds.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/Wikifeeds [12:11:10] jayme: ^ [12:11:12] ok, acking [12:11:13] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:11:14] here too, checking [12:11:29] godog: help me double check on monitoring 0 impact [12:11:45] for sure jynus [12:11:55] it should be 0 impact indeed, DC has been depooled [12:11:57] (03CR) 10Clément Goubert: [C: 03+2] sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert) [12:12:07] yeah can confirm, I'm not seeing any impact so far [12:12:08] I acked on splunk [12:12:09] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:12:18] (ProbeDown) firing: (15) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:12:32] (03CR) 10TheDJ: [C: 03+1] varnish: Set `X-Content-Type-Options: nosniff` on upload requests [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm) [12:12:35] !log add 10.194.128.0/18 to kubernetes-ipv4 prefix-list for codfw. T326617 [12:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:39] T326617: Decide on new Pod and Sevice IPv4 ranges for wikikube clusters - https://phabricator.wikimedia.org/T326617 [12:12:43] indeed traffic is flatlined at 0 [12:13:07] ok to ack/silence the alerts that are paging ? [12:13:23] PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: / (mass-energy equivalence (json)) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{format}/ (mass-energy equivalence (mml)) timed out before a response was received: /{format}/ (mass-energy equivalence (texvcinfo)) timed out before a response w [12:13:23] ved: /{format}/ (Invalid command (texvcinfo)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid [12:13:31] PROBLEM - eventgate-logging-external LVS codfw on eventgate-logging-external.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fa304688cf8, Connection to eventgate-logging-external.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://wikitec [12:13:31] dia.org/wiki/Event_Platform/EventGate [12:13:31] godog: yes [12:13:37] godog: at least ack them or they are gonna page everybody [12:13:48] (03Merged) 10jenkins-bot: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert) [12:13:57] akosiaris: ack (ha ha) thank you [12:14:13] vgutierrez: yeah the VO page has been ack'd already by jynus, I was referring to alerts.w.o [12:14:16] lol [12:14:25] I think I need a beer already after that joke [12:14:42] (03PS4) 10Clément Goubert: sre.switchdc.mediawiki: Add mw-on-k8s services [cookbooks] - 10https://gerrit.wikimedia.org/r/889801 (https://phabricator.wikimedia.org/T327924) [12:14:47] PROBLEM - eventgate-analytics-external LVS codfw on eventgate-analytics-external.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7ff350b51c50, Connection to eventgate-analytics-external.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://w [12:14:47] wikimedia.org/wiki/Event_Platform/EventGate [12:15:03] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:15:08] haha you are welcome [12:15:11] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/data/css/mobile/base (Get base CSS) timed out before a response was received: /{domain}/v1/data/css/mobile/pagelib (Get CSS bundle from wikimedia-page-library) timed out before a response was received: /{domain}/v1/data/c [12:15:11] e/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/data/i18n/pcs (Get i18n strings for the Page Content Service) timed out before a response was received: /{domain}/v1/data/javascript/mobile/pagelib (Get javascript bundle for page library) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: / [12:15:11] /v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve [12:15:11] ge via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:15:18] (ProbeDown) firing: (7) Service eventgate-main:4492 has failed probes (http_eventgate-main_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:15:57] I will keep an eye on phab too, in case some deployer finds issues or something [12:16:05] vgutierrez: Well it's 5PM in Bhutan [12:16:31] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubemaster2002.codfw.wmnet with reason: host reimage [12:16:43] sorry for the noise :/ [12:16:50] (JobUnavailable) firing: (11) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:17:02] (03CR) 10Clément Goubert: [C: 03+2] sre.switchdc.mediawiki: Add mw-on-k8s services [cookbooks] - 10https://gerrit.wikimedia.org/r/889801 (https://phabricator.wikimedia.org/T327924) (owner: 10Clément Goubert) [12:17:07] jayme: no worries, that's what we're here for [12:17:27] (ProbeDown) firing: (16) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:17:39] not sure if I should silence all probedown alerts from codfw as services are depooled anyways. wdyt godog? [12:18:10] jayme: mmhh interesting, I'm not sure offhand [12:18:23] happy to discuss later though, I was in the middle of lunch [12:18:27] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 554962312 and 181 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:18:37] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [12:18:46] unfortunately there is no clever way to filter kubernetes services I guess.. [12:19:00] indeed [12:19:06] ok going back to lunch, ttyl [12:19:10] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubemaster2002.codfw.wmnet with reason: host reimage [12:19:11] oh well, yeah. Get back to lunch then :) not urgent obviously [12:19:17] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Add mw-on-k8s services [cookbooks] - 10https://gerrit.wikimedia.org/r/889801 (https://phabricator.wikimedia.org/T327924) (owner: 10Clément Goubert) [12:19:33] PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: / (mass-energy equivalence (json)) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{format}/ (mass-energy equivalence (complete)) timed out before a response was received: /{format}/ (mass-energy equivalence (svg)) timed out before a response wa [12:19:33] ed: /{format}/ (mass-energy equivalence (mml)) timed out before a response was received: /{format}/ (mass-energy equivalence (texvcinfo)) timed out before a response was received: /{format}/ (Invalid command (texvcinfo)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid [12:19:44] incident autoresolved [12:20:05] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 716064 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:20:36] althought there is still some failed probes [12:21:45] (JobUnavailable) firing: (14) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:21:49] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /{domain}/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured ar [12:21:49] r April 29, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve featured article info for unsupported site (with aggregated=true)) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most [12:21:49] icles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve most-read articles for date with no data (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News [12:21:49] ) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [12:21:59] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: reload-acme-chief-backend.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:23] (03PS1) 10Slyngshede: C:idm::deployment missing comma in Ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/890807 [12:23:11] PROBLEM - eventgate-analytics LVS codfw on eventgate-analytics.svc.codfw.wmnet is CRITICAL: /robots.txt (robots.txt check) timed out before a response was received: / (spec from root) timed out before a response was received: / (doc from root) timed out before a response was received: / (root with wrong query param) timed out before a response was received https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [12:23:17] PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:23:21] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/data/css/mobile/base (Get base CSS) timed out before a response was received: /{domain}/v1/data/css/mobile/pagelib (Get CSS bundle from wikimedia-page-library) timed out before a response was received: /{domain}/v1/data/i [12:23:21] (Get i18n strings for the Page Content Service) timed out before a response was received: /{domain}/v1/data/javascript/mobile/pagelib (Get javascript bundle for page library) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was [12:23:21] d: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/talk/{title} (Get st [12:23:21] talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:23:23] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /_info/name (retrieve service name) timed out before a response was received: /_info/home (redirect to the home page) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{t [12:23:23] ormat}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond bad request for an unsupported format) timed out before a response was received https://wikitech.wikimedia.or [12:23:23] roton [12:23:32] hmmm acme-chief is screaming? [12:23:37] * vgutierrez checking [12:25:39] PROBLEM - Check unit status of reload-acme-chief-backend on acmechief1001 is CRITICAL: CRITICAL: Status of the systemd unit reload-acme-chief-backend https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:26:27] PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: / (mass-energy equivalence (json)) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{format}/ (mass-energy equivalence (complete)) timed out before a response was received: /{format}/ (mass-energy equivalence (svg)) timed out before a response wa [12:26:27] ed: /{format}/ (mass-energy equivalence (mml)) timed out before a response was received: /{format}/ (mass-energy equivalence (texvcinfo)) timed out before a response was received: /{format}/ (Invalid command (texvcinfo)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid [12:26:45] (JobUnavailable) firing: (14) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:27:09] vgutierrez: let me know if I can help in any way [12:27:19] jynus: -sre [12:28:15] PROBLEM - eventgate-main LVS codfw on eventgate-main.svc.codfw.wmnet is CRITICAL: /robots.txt (robots.txt check) timed out before a response was received: / (root with no query params) timed out before a response was received: / (spec from root) timed out before a response was received: / (doc from root) timed out before a response was received: / (root with wrong query param) timed out before a response was received https://wikitech.wiki [12:28:15] g/wiki/Event_Platform/EventGate [12:29:39] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:29:57] PROBLEM - eventgate-logging-external LVS codfw on eventgate-logging-external.svc.codfw.wmnet is CRITICAL: /robots.txt (robots.txt check) timed out before a response was received: / (root with no query params) timed out before a response was received: / (spec from root) timed out before a response was received: / (doc from root) timed out before a response was received: / (root with wrong query param) timed out before a response was receiv [12:29:57] ://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [12:30:06] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39758/console" [puppet] - 10https://gerrit.wikimedia.org/r/890807 (owner: 10Slyngshede) [12:30:51] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:42] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment missing comma in Ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/890807 (owner: 10Slyngshede) [12:31:45] (JobUnavailable) firing: (13) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:34:30] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubemaster2002.codfw.wmnet with OS bullseye [12:34:32] !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: T329664 [12:34:37] T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 [12:35:42] !log jayme@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: T329664 [12:36:27] RECOVERY - Check unit status of reload-acme-chief-backend on acmechief1001 is OK: OK: Status of the systemd unit reload-acme-chief-backend https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:36:42] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2009.codfw.wmnet with OS bullseye [12:36:45] (JobUnavailable) firing: (14) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:36:57] (ProbeDown) firing: (16) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:37:35] PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f64db51fba8, Connection to mathoid.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/Mathoid [12:38:45] PROBLEM - Sessionstore codfw on sessionstore.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f480741ccc0, Connection to sessionstore.svc.codfw.wmnet timed out. (connect timeout=15)): /openapi https://www.mediawiki.org/wiki/Kask [12:39:59] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes2005.codfw.wmnet with OS bullseye [12:40:00] (03CR) 10Muehlenhoff: C:idm::deployment missing comma in Ferm rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890807 (owner: 10Slyngshede) [12:40:03] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes2006.codfw.wmnet with OS bullseye [12:40:07] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes2016.codfw.wmnet with OS bullseye [12:40:10] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes2015.codfw.wmnet with OS bullseye [12:41:37] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fd8f3e5aba8, Connection to termbox.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [12:41:40] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1001.eqiad.wmnet with OS bullseye [12:41:45] (JobUnavailable) firing: (13) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:41:57] (ProbeDown) firing: (16) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:43:36] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2010.codfw.wmnet with OS bullseye [12:43:43] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2020.codfw.wmnet with OS bullseye [12:43:52] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2023.codfw.wmnet with OS bullseye [12:44:31] PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: / (mass-energy equivalence (json)) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{format}/ (mass-energy equivalence (complete)) timed out before a response was received: /{format}/ (mass-energy equivalence (svg)) timed out before a response wa [12:44:31] ed: /{format}/ (mass-energy equivalence (mml)) timed out before a response was received: /{format}/ (mass-energy equivalence (texvcinfo)) timed out before a response was received: /{format}/ (Invalid command (texvcinfo)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid [12:46:45] (JobUnavailable) firing: (14) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:47:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:49:06] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2007.codfw.wmnet with OS bullseye [12:50:37] (03PS2) 10Jbond: sre.hosts.reboot-single: Simplify icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) [12:50:45] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2008.codfw.wmnet with OS bullseye [12:50:46] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1002.eqiad.wmnet with OS bullseye [12:50:48] (03CR) 10Jbond: sre.hosts.reboot-single: Simplify icinga logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [12:50:49] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:51:15] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2013.codfw.wmnet with OS bullseye [12:51:28] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2005.codfw.wmnet with reason: host reimage [12:51:34] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2016.codfw.wmnet with reason: host reimage [12:51:39] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2015.codfw.wmnet with reason: host reimage [12:51:39] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2006.codfw.wmnet with reason: host reimage [12:51:45] (JobUnavailable) firing: (14) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:52:03] PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7ff7c7c32b70, Connection to mathoid.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/Mathoid [12:52:34] (03CR) 10Muehlenhoff: netboot: create dedicated partman recipe for presto workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [12:53:19] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2011.codfw.wmnet with OS bullseye [12:54:15] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2005.codfw.wmnet with reason: host reimage [12:54:24] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2012.codfw.wmnet with OS bullseye [12:54:33] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f93a3f66cc0, Connection to termbox.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [12:55:35] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2014.codfw.wmnet with OS bullseye [12:56:41] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2006.codfw.wmnet with reason: host reimage [12:56:42] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2016.codfw.wmnet with reason: host reimage [12:56:45] (JobUnavailable) firing: (14) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:57:32] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2022.codfw.wmnet with OS bullseye [12:57:59] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2024.codfw.wmnet with OS bullseye [12:58:08] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2020.codfw.wmnet with reason: host reimage [12:58:57] (03PS1) 10Nicolas Fraison: presto.coordinator: reduce max heap size of coordinator [puppet] - 10https://gerrit.wikimedia.org/r/890810 [12:58:59] (03PS8) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) [12:59:05] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2023.codfw.wmnet with reason: host reimage [12:59:06] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2015.codfw.wmnet with reason: host reimage [12:59:07] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2010.codfw.wmnet with reason: host reimage [12:59:15] (03CR) 10Jbond: "updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [13:01:23] !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1002.eqiad.wmnet with reason: host reimage [13:01:45] (JobUnavailable) firing: (15) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:01:57] (ProbeDown) firing: (17) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:02:08] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2023.codfw.wmnet with reason: host reimage [13:03:29] (03PS1) 10Ayounsi: Use port 2222 for management router ssh [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) [13:04:32] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1002.eqiad.wmnet with reason: host reimage [13:04:39] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2007.codfw.wmnet with reason: host reimage [13:06:12] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2008.codfw.wmnet with reason: host reimage [13:06:20] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2020.codfw.wmnet with reason: host reimage [13:06:31] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2010.codfw.wmnet with reason: host reimage [13:06:36] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2013.codfw.wmnet with reason: host reimage [13:06:45] (JobUnavailable) firing: (16) Reduced availability for job dragonfly_dfdaemon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:06:54] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10JMeybohm) This might be caused by {T330048} [13:08:32] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubernetes2005.codfw.wmnet with OS bullseye [13:09:12] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2011.codfw.wmnet with reason: host reimage [13:09:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2007.codfw.wmnet with reason: host reimage [13:09:30] (03PS2) 10Ayounsi: Management routers: move ssh port to 2222 [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) [13:10:09] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2012.codfw.wmnet with reason: host reimage [13:10:14] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message consume rate in last 30m on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [13:10:40] PROBLEM - confd service on kubernetes2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.135: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:10:48] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:11:00] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2014.codfw.wmnet with reason: host reimage [13:11:42] PROBLEM - dhclient process on kubernetes2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.135: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [13:11:45] (JobUnavailable) firing: (16) Reduced availability for job dragonfly_dfdaemon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:11:45] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2013.codfw.wmnet with reason: host reimage [13:12:09] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubernetes2016.codfw.wmnet with OS bullseye [13:12:28] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2022.codfw.wmnet with reason: host reimage [13:12:31] (03Abandoned) 10Ayounsi: Management routers: move ssh port to 2222 [homer/public] - 10https://gerrit.wikimedia.org/r/785274 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [13:12:38] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [13:13:02] PROBLEM - puppet last run on kubernetes2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.135: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:13:17] (03CR) 10Ayounsi: "Note that this is the end result, the change will be done manually on the devices and verified afterwards." [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [13:13:22] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2009.codfw.wmnet with OS bullseye [13:13:25] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2024.codfw.wmnet with reason: host reimage [13:14:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2012.codfw.wmnet with reason: host reimage [13:14:19] RECOVERY - confd service on kubernetes2020 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:14:45] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2009.codfw.wmnet with OS bullseye [13:15:00] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubernetes2006.codfw.wmnet with OS bullseye [13:16:12] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2008.codfw.wmnet with reason: host reimage [13:16:41] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2022.codfw.wmnet with reason: host reimage [13:16:42] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubernetes2015.codfw.wmnet with OS bullseye [13:16:45] (JobUnavailable) firing: (16) Reduced availability for job dragonfly_dfdaemon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:17:03] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2023.codfw.wmnet with OS bullseye [13:18:36] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2024.codfw.wmnet with reason: host reimage [13:18:48] PROBLEM - Host kubernetes2020 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:12] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2011.codfw.wmnet with reason: host reimage [13:19:20] RECOVERY - Host kubernetes2020 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms [13:20:30] PROBLEM - DPKG on kubernetes2011 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.109: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:21:00] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2014.codfw.wmnet with reason: host reimage [13:21:14] RECOVERY - dhclient process on kubernetes2020 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [13:21:14] RECOVERY - puppet last run on kubernetes2020 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:21:30] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2020.codfw.wmnet with OS bullseye [13:21:45] (JobUnavailable) firing: (16) Reduced availability for job dragonfly_dfdaemon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:22:07] ACKNOWLEDGEMENT - MD RAID on kubernetes2012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.110. Check system logs on 10.192.32.110 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T330150 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:22:11] 10SRE, 10ops-codfw: Degraded RAID on kubernetes2012 - https://phabricator.wikimedia.org/T330150 (10ops-monitoring-bot) [13:22:46] false positive ^ could be useful to tune the script [13:23:09] I will add that ticket to infra team [13:23:34] PROBLEM - Check for large files in client bucket on kubernetes2008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.197: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [13:24:17] !log nfraison@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1002.eqiad.wmnet with OS bullseye [13:24:42] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10jcrespo) [13:24:54] RECOVERY - Check for large files in client bucket on kubernetes2008 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [13:25:22] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2010.codfw.wmnet with OS bullseye [13:25:34] (03PS1) 10Slyngshede: C:idm::deployment escape password and use 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/890814 [13:25:48] PROBLEM - dhclient process on kubernetes2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.29: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [13:26:14] PROBLEM - confd service on kubernetes2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.29: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:26:33] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2007.codfw.wmnet with OS bullseye [13:26:33] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2024.codfw.wmnet with OS bullseye [13:26:38] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39759/console" [puppet] - 10https://gerrit.wikimedia.org/r/890814 (owner: 10Slyngshede) [13:26:45] (JobUnavailable) firing: (16) Reduced availability for job dragonfly_dfdaemon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:27:05] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment escape password and use 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/890814 (owner: 10Slyngshede) [13:27:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] improvement: display update-known-host-productions zsh hint on macOS [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/821770 (owner: 10Clément Goubert) [13:27:30] PROBLEM - puppet last run on kubernetes2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.29: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:27:33] (03PS1) 10Muehlenhoff: Set ferm access for redis [puppet] - 10https://gerrit.wikimedia.org/r/890815 (https://phabricator.wikimedia.org/T320797) [13:27:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/890815 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff) [13:28:02] PROBLEM - MD RAID on kubernetes2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.29: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:28:11] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10jcrespo) This looks like it could be avoided with some extra check, maybe? I added @jbond and @Volans as I think they were involved in th... [13:28:31] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2013.codfw.wmnet with OS bullseye [13:29:17] 10SRE-tools, 10Infrastructure-Foundations, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10jcrespo) [13:29:18] RECOVERY - confd service on kubernetes2014 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:30:21] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2009.codfw.wmnet with reason: host reimage [13:31:29] (03CR) 10Slyngshede: [C: 03+2] Set ferm access for redis [puppet] - 10https://gerrit.wikimedia.org/r/890815 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff) [13:31:40] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/890815 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff) [13:31:44] (03CR) 10Slyngshede: [C: 03+2] Set ferm access for redis [puppet] - 10https://gerrit.wikimedia.org/r/890815 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff) [13:31:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2012.codfw.wmnet with OS bullseye [13:31:57] (ProbeDown) firing: (17) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:32:24] I've silenced the jobunavailable for swagger checks in codfw [13:32:49] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [13:33:11] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2008.codfw.wmnet with OS bullseye [13:33:25] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2009.codfw.wmnet with reason: host reimage [13:33:52] 10SRE-tools, 10Infrastructure-Foundations, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10jcrespo) Adding @JMeybohm in case it was just a fluke (reimage taking more time than usual). [13:34:12] RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:34] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2022.codfw.wmnet with OS bullseye [13:35:14] PROBLEM - Host kubernetes2014 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:44] RECOVERY - Host kubernetes2014 is UP: PING OK - Packet loss = 0%, RTA = 33.30 ms [13:36:12] 10SRE-tools, 10Infrastructure-Foundations, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10JMeybohm) /cc @elukey this is one of "yours" :) [13:36:48] RECOVERY - DPKG on kubernetes2011 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:36:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2011.codfw.wmnet with OS bullseye [13:37:14] (03CR) 10Volans: "couple of comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [13:37:40] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) In the meantime we have created two cookbook: * sre.k8s.upgrade-cluster.py * sre.k8s.wipe-cluster.py [13:37:44] RECOVERY - puppet last run on kubernetes2014 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:38:10] RECOVERY - MD RAID on kubernetes2014 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:38:38] RECOVERY - dhclient process on kubernetes2014 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [13:38:56] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2014.codfw.wmnet with OS bullseye [13:38:58] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 215 hosts with reason: codfw row B upgrade [13:39:26] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Update wikikube-codfw settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [13:41:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 215 hosts with reason: codfw row B upgrade [13:41:31] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=aec8ddda-9ad5-4b7f-8bca-c273e036a282) set by ayounsi@cumin1001 for 2:00:00 on 215 host(s... [13:41:57] (ProbeDown) firing: (17) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:30] (03PS1) 10Muehlenhoff: Fix condition for including haveged [puppet] - 10https://gerrit.wikimedia.org/r/890816 [13:44:36] (03Merged) 10jenkins-bot: admin_ng: Update wikikube-codfw settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [13:45:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10fgiunchedi) Following up for silences, especially the ones paging in production (`ProbeDown`). * ProbeDown: the most e... [13:46:33] (03CR) 10Nicolas Fraison: [C: 03+1] "LGTM, would be better to have also a review from @Ottomata" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [13:48:28] 10SRE, 10Data-Engineering-Planning, 10Observability-Alerting, 10Traffic, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10JArguello-WMF) 05Open→03Resolved [13:48:34] !log stop kafka on kafka-logging[2002,2004].codfw.wmnet - T327991 [13:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:39] T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 [13:49:35] (03PS1) 10Ladsgroup: auto_schema: Add support for live schema changes on replicas too [software] - 10https://gerrit.wikimedia.org/r/890820 [13:49:44] (03CR) 10CI reject: [V: 04-1] auto_schema: Add support for live schema changes on replicas too [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup) [13:49:48] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [13:49:51] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:49:55] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [13:49:58] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:50:02] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [13:50:18] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:50:22] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [13:50:29] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:50:39] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [13:51:28] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:51:50] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2009.codfw.wmnet with OS bullseye [13:52:00] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:53:32] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:54:10] !log depooling elastic[2041-2044,2057-2058,2063-2064,2070,2077-2080].codfw.wmnet for switch maintenance - T327991 [13:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:14] T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 [13:54:14] !log depool doh2002 - T327991 [13:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:54] !log depooling wcqs2001.codfw.wmnet for switch maintenance - T327991 [13:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:22] !log depooling wdqs[2005,2007,2010].codfw.wmnet for switch maintenance - T327991 [13:55:24] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Vgutierrez) [13:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:33] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:55:49] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:55:57] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [13:56:00] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:56:34] ryankemper, inflatador: see the 3 depool above ^^^ (and please check that they have been repooled at some point) [13:56:57] (ProbeDown) firing: (17) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:57:58] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:58:17] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [13:58:29] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:58:32] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [13:58:35] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:58:40] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [13:58:42] (03PS1) 10Vgutierrez: admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/890822 [13:58:47] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [13:58:56] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:59:05] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:59:06] (03CR) 10Ayounsi: [C: 03+1] admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/890822 (owner: 10Vgutierrez) [13:59:11] (03CR) 10Jcrespo: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/890822 (owner: 10Vgutierrez) [13:59:22] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [13:59:28] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:59:33] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [13:59:35] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:59:45] (03CR) 10Vgutierrez: [C: 03+2] admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/890822 (owner: 10Vgutierrez) [13:59:48] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:59:57] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:00:05] jayme: Your horoscope predicts another unfortunate Kubernetes upgrade wikikube codfw deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900). [14:00:05] Deploy window codfw row B switches upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1400) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1400) [14:00:05] !log depooling codfw - T327991 [14:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:10] T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 [14:00:27] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:00:28] my ssh is really slow, I need to downtime s2 in codfw [14:00:40] (ack no patches for deploy) [14:00:57] sheesh, four simultaneously active windows in the deployments calendar [14:01:00] jouncebot: now [14:01:00] For the next 1 hour(s) and 59 minute(s): Kubernetes upgrade wikikube codfw (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900) [14:01:00] For the next 1 hour(s) and 59 minute(s): codfw row B switches upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1400) [14:01:00] For the next 0 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1400) [14:01:00] For the next 0 hour(s) and 59 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1400) [14:01:00] TheresNoTime: it should be cancelled anyway, as there is going to be hw maintenance [14:01:08] * Lucas_WMDE not deploying [14:02:30] jynus: Do you know how to do what Amir1 needs? [14:02:32] sudo cookbook sre.hosts.downtime --hours 1 -r "codfw maint (T327991)" 'A:db-section-s2' [14:02:40] ok I'll do it [14:02:44] someone needs to do this [14:02:45] thanks [14:02:48] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: codfw maint (T327991) [14:02:51] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [14:03:01] Amir1: running [14:03:07] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: codfw maint (T327991) [14:03:13] thanks [14:03:36] np <3 [14:04:35] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:06:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2134,2160].codfw.wmnet,db[1117,1159].eqiad.wmnet with reason: codfw maint (T327991) [14:06:40] T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 [14:06:50] finally logged in to cumin [14:06:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2134,2160].codfw.wmnet,db[1117,1159].eqiad.wmnet with reason: codfw maint (T327991) [14:06:57] (ProbeDown) firing: (17) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:07:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 27 hosts with reason: codfw maint (T327991) [14:07:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 27 hosts with reason: codfw maint (T327991) [14:07:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2186.mgmt.codfw.wmnet with reboot policy FORCED [14:11:25] (03CR) 10Ladsgroup: "recheck" [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup) [14:11:57] (ProbeDown) firing: (17) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:12:57] (03PS1) 10Elukey: Add kubernetes202[3,4] to the wikikube-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/890824 (https://phabricator.wikimedia.org/T329664) [14:13:50] I've silenced the probedown for known-down k8s services ^ [14:15:23] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: switch to using functools.cache [cookbooks] - 10https://gerrit.wikimedia.org/r/890825 [14:15:25] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) [14:15:27] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 [14:15:29] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) [14:15:35] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:15:51] (03PS1) 10Slyngshede: C:idm::deployment Redis binding and ports [puppet] - 10https://gerrit.wikimedia.org/r/890829 [14:15:58] (03PS2) 10Elukey: Add kubernetes202[3,4] to the wikikube-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/890824 (https://phabricator.wikimedia.org/T329664) [14:16:24] log asw-b-codfw> request system reboot all-members - T327991 [14:16:31] T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 [14:17:09] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: switch to using functools.cache [cookbooks] - 10https://gerrit.wikimedia.org/r/890825 (owner: 10Jbond) [14:17:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39760/console" [puppet] - 10https://gerrit.wikimedia.org/r/890829 (owner: 10Slyngshede) [14:17:20] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond) [14:17:22] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [14:17:24] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (owner: 10Jbond) [14:17:37] going down in < 1 min [14:17:56] ack [14:19:25] (03CR) 10Andrew Bogott: systemd timers: Add the 'After' requirement to the timer module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott) [14:20:11] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) [14:20:13] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 [14:20:15] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) [14:20:46] PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:20:47] (03PS5) 10Clément Goubert: sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 [14:21:11] looking good [14:21:26] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Papaul) @Volans thanks one thing i know for sure is that the R650 and R750 are Dell 15th generation and the other like R440 and R740 are 14th generation. I will check... [14:21:30] (virtual-chassis crash) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [14:21:57] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:02] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (owner: 10Jbond) [14:22:04] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond) [14:22:11] Amir1: not a big deal, but es replicas complaining, may need manual restart later [14:22:26] PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 5 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:22:35] (restart of replication, not service or host) [14:22:38] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2003 is CRITICAL: 79 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003 [14:22:51] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [14:23:00] jynus: oh thanks [14:23:06] (03PS1) 10Elukey: role::kubernetes::{master,worker}: add kubernetes202[34] [puppet] - 10https://gerrit.wikimedia.org/r/890832 (https://phabricator.wikimedia.org/T329664) [14:23:12] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 129, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:23:17] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:23:18] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:23:22] PROBLEM - MariaDB Replica IO: es4 on es2022 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2021.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es2021.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:23:26] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:23:32] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2001 is CRITICAL: 79 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001 [14:23:37] I should downtime it [14:23:47] I wonder if that will page [14:23:58] (KubernetesCalicoDown) firing: ml-serve2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2006.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:24:00] in any case, not a big deal [14:24:00] Amir1: same command as earlier but with section-es4 ? [14:24:17] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:24:20] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1083 threshold =0.2 breach: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 36, number_of_data_nodes: 36, active_primary_shards: 1664, active_shards: 3889, relocating_shards: 0, initializing_shards: 142, unassigned_shards: 941, delayed_unassigned_shards: 0, number [14:24:20] ing_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 78.21802091713596 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:24:21] (03PS1) 10Elukey: conftool: add kubernetes202[3,4] to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/890833 (https://phabricator.wikimedia.org/T329664) [14:24:21] yeah but I restart my pc :D [14:24:26] lol k [14:24:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: codfw maint (T327991) [14:24:43] done & [14:24:44] T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 [14:24:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: codfw maint (T327991) [14:24:45] claime: ah, it doesn't page because it is not primary, so should be ok [14:24:53] a'ight [14:24:55] although not sure if that makes sense anymore [14:24:58] (KubernetesCalicoDown) firing: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:25:03] (KubernetesCalicoDown) firing: ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging-ctrl2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:25:16] RECOVERY - eventgate-logging-external LVS codfw on eventgate-logging-external.svc.codfw.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f910182eb00: Failed to establish a new connection: [Errno 113] No route to host): /robots.txt https://wikitech.wikimedia.org/wiki/Ev [14:25:16] form/EventGate [14:25:31] (03CR) 10Ladsgroup: "recheck" [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup) [14:25:33] (03PS1) 10Elukey: Add kubernetes202[3,4] to its k8s_neighbors list [homer/public] - 10https://gerrit.wikimedia.org/r/890834 (https://phabricator.wikimedia.org/T329664) [14:25:50] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1083 threshold =0.2 breach: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 36, number_of_data_nodes: 36, active_primary_shards: 1664, active_shards: 3889, relocating_shards: 0, initializing_shards: 142, unassigned_shards: 941, delayed_unassigned_shards: 0 [14:25:50] _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 78.21802091713596 Brian_King T327991 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:25:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/890829 (owner: 10Slyngshede) [14:25:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:26:45] (JobUnavailable) firing: (15) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:26:54] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:57] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:26:58] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 (owner: 10Clément Goubert) [14:27:01] dbproxyies will also need a reload, they failed over [14:27:17] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: switch to using functools.cache [cookbooks] - 10https://gerrit.wikimedia.org/r/890825 [14:27:22] ACKNOWLEDGEMENT - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@search.service,wmf_auto_restart_airflow-webserver@search.service Brian_King T327970 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:22] ACKNOWLEDGEMENT - Checks that the airflow database for airflow search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow db check did not succeed Brian_King T327970 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [14:27:23] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed Brian_King T327970 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [14:27:40] PROBLEM - Sessionstore codfw on sessionstore.svc.codfw.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) timed out before a response was received https://www.mediawiki.org/wiki/Kask [14:27:47] I'll take care of it [14:28:06] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 36, number_of_data_nodes: 36, active_primary_shards: 1664, active_shards: 4030, relocating_shards: 0, initializing_shards: 140, unassigned_shards: 802, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, n [14:28:06] _in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.05390185036202 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:28:08] PROBLEM - configured eth on lvs2010 is CRITICAL: ens3f0np0 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:28:49] (03Merged) 10jenkins-bot: sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 (owner: 10Clément Goubert) [14:28:58] (KubernetesCalicoDown) firing: (2) ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:29:09] !log installing NSS security updates [14:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:36] PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:29:54] (EtcdReplicationDown) firing: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [14:29:58] RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms [14:30:00] RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:30:05] (KubernetesCalicoDown) firing: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:30:35] (KafkaUnderReplicatedPartitions) firing: (2) Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:30:50] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:30:58] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:30:59] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) [14:31:04] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:31:05] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 [14:31:10] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) [14:31:23] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment Redis binding and ports [puppet] - 10https://gerrit.wikimedia.org/r/890829 (owner: 10Slyngshede) [14:31:26] RECOVERY - Sessionstore codfw on sessionstore.svc.codfw.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask [14:31:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:31:32] (03CR) 10JMeybohm: [C: 03+1] Add kubernetes202[3,4] to the wikikube-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/890824 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey) [14:31:45] (JobUnavailable) firing: (116) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:32:31] here [14:32:42] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond) [14:32:49] brett: see -sre, all good [14:32:49] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (owner: 10Jbond) [14:32:57] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [14:32:58] RECOVERY - MariaDB Replica IO: es4 on es2022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:33:17] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:33:17] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:33:19] ah, thanks godog [14:33:28] yw [14:33:58] (KubernetesCalicoDown) resolved: (10) ml-serve-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:34:01] (03CR) 10JMeybohm: [C: 03+1] role::kubernetes::{master,worker}: add kubernetes202[34] [puppet] - 10https://gerrit.wikimedia.org/r/890832 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey) [14:34:04] (03CR) 10Andrew Bogott: "For additional context, this is the patch that "doesn't work". If there's an obvious flaw in that I would definitely prefer to fix that! " [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott) [14:34:08] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01925 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:34:12] (03CR) 10JMeybohm: [C: 03+1] conftool: add kubernetes202[3,4] to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/890833 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey) [14:34:17] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:34:54] (EtcdReplicationDown) resolved: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [14:34:58] (KubernetesCalicoDown) resolved: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:34:58] (KubernetesCalicoDown) resolved: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:35:00] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:19] (ProbeDown) resolved: (36) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:35] (KafkaUnderReplicatedPartitions) firing: (2) Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:35:38] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:49] (03CR) 10JMeybohm: [C: 03+1] "IPs match" [homer/public] - 10https://gerrit.wikimedia.org/r/890834 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey) [14:36:01] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=prometheus2005.codfw.wmnet [14:36:10] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=thanos-fe2002.codfw.wmnet [14:36:13] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:36:30] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:30] (virtual-chassis crash) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [14:36:42] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:36:45] (JobUnavailable) resolved: (116) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:16] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) Upgrade went smoothly, less than 15min hard downtime here too. [14:37:54] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [14:38:00] RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:38:00] RECOVERY - configured eth on lvs2010 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:38:31] (Device rebooted) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [14:38:40] jynus: I'm not seeing any host having broken replication/lagged/not-responding in https://orchestrator.wikimedia.org/web/clusters beside reloading haporxy anything I should do? [14:39:05] no, they caught up replication this time well, I think [14:39:17] I think last time one host didn't fully recover automatically [14:40:01] on backup side, I need to restart es5 backup [14:40:11] (03CR) 10Ayounsi: [C: 03+1] Add kubernetes202[3,4] to its k8s_neighbors list (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/890834 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey) [14:42:29] (03PS2) 10Elukey: Add kubernetes202[3,4] to its k8s_neighbors list [homer/public] - 10https://gerrit.wikimedia.org/r/890834 (https://phabricator.wikimedia.org/T329664) [14:42:44] (03CR) 10Elukey: Add kubernetes202[3,4] to its k8s_neighbors list (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/890834 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey) [14:43:07] (03CR) 10Elukey: [C: 03+2] Add kubernetes202[3,4] to the wikikube-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/890824 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey) [14:43:19] 10ops-codfw, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10Ladsgroup) [14:43:36] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0009872 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:43:52] (03PS1) 10Jbond: Revert "puppetmaster: offline 2003 for switch upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/890846 [14:44:07] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "puppetmaster: offline 2003 for switch upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/890846 (owner: 10Jbond) [14:44:32] (03CR) 10Elukey: [C: 03+2] Add kubernetes202[3,4] to its k8s_neighbors list [homer/public] - 10https://gerrit.wikimedia.org/r/890834 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey) [14:44:47] (03PS2) 10Elukey: role::kubernetes::{master,worker}: add kubernetes202[34] [puppet] - 10https://gerrit.wikimedia.org/r/890832 (https://phabricator.wikimedia.org/T329664) [14:45:17] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond) [14:45:28] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:30] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:45:30] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:45:37] 10ops-codfw, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10Ladsgroup) It's a backup source so it doesn't even need depooling, just downtime, mabye gracefully shutting down mysql but that's not even that important. cc @jcrespo [14:47:18] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003 [14:47:18] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001 [14:50:09] (03PS1) 10Elukey: admin_ng: fix Istio gateways configured in Knative for ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/890837 (https://phabricator.wikimedia.org/T327767) [14:50:35] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:50:52] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:51:13] (03PS1) 10JMeybohm: restbase: Update kubernetes ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/890838 (https://phabricator.wikimedia.org/T326617) [14:51:24] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [14:53:40] 10ops-codfw, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10jcrespo) @Papaul you can ping me directly, as Manuel is off these days for backup sources. If only network is going to be down for this server for a small amount of time, just go ahead at any time. If it is going to be... [14:54:03] (03CR) 10Elukey: [C: 03+1] restbase: Update kubernetes ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/890838 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [14:54:50] (03CR) 10JMeybohm: [C: 03+2] restbase: Update kubernetes ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/890838 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [14:57:38] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:58:31] (Device rebooted) resolved: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [15:00:40] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:00:40] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:00:40] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.228 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:00:44] 10ops-codfw, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10jcrespo) I checked db2099 and saw some stats about packet errors when using the link to saturation: {F36864017} But compared to, eg db2098, those seem expected when running at full bandwidth: {F36864019} [15:01:40] RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.244 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:01:40] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:01:40] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:02:26] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [15:03:46] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:04:48] PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f5537e8db70: Failed to establish a new connection: [Errno 111] Connection refused): /?spec https://wikitech.wikimedia.org/wiki/Mathoid [15:05:58] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jcrespo) I restarted es5 codfw backup job, the only backup-related thingy affected by the downtime. [15:06:28] (03CR) 10Elukey: [C: 03+2] role::kubernetes::{master,worker}: add kubernetes202[34] [puppet] - 10https://gerrit.wikimedia.org/r/890832 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey) [15:06:32] (03PS2) 10Elukey: conftool: add kubernetes202[3,4] to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/890833 (https://phabricator.wikimedia.org/T329664) [15:07:33] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [15:08:20] PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: /{format}/ (mass-energy equivalence (mml)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid [15:10:16] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:10:30] (03CR) 10Elukey: [C: 03+2] conftool: add kubernetes202[3,4] to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/890833 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey) [15:11:12] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:12:00] RECOVERY - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mathoid [15:13:02] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:13:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2101.codfw.wmnet with reason: Maintenance [15:13:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2101.codfw.wmnet with reason: Maintenance [15:13:35] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [15:16:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2111.codfw.wmnet with reason: Maintenance [15:16:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2111.codfw.wmnet with reason: Maintenance [15:16:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:16:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:16:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2128.codfw.wmnet with reason: Maintenance [15:17:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2128.codfw.wmnet with reason: Maintenance [15:17:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2094.codfw.wmnet with reason: Maintenance [15:17:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2094.codfw.wmnet with reason: Maintenance [15:17:26] (03PS1) 10Ayounsi: Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/890847 [15:17:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2137.codfw.wmnet with reason: Maintenance [15:17:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2137.codfw.wmnet with reason: Maintenance [15:18:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:18:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:18:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:19:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:19:22] 10SRE, 10Infrastructure Security, 10observability, 10SRE Observability (FY2022/2023-Q3): Grafana: CVE-2022-39324 CVE-2022-23552 - https://phabricator.wikimedia.org/T328405 (10lmata) [15:19:28] (03CR) 10Vgutierrez: [C: 03+1] Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/890847 (owner: 10Ayounsi) [15:19:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2178.codfw.wmnet with reason: Maintenance [15:19:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2178.codfw.wmnet with reason: Maintenance [15:20:42] RECOVERY - eventgate-analytics-external LVS codfw on eventgate-analytics-external.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [15:22:24] is codfw still fully depooled (for mw I mean)? [15:22:43] If yes, it'd make some of my schema changes much simpler [15:23:02] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10akosiaris) [15:23:12] (03PS1) 10SBassett: Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 [15:23:20] (03CR) 10CI reject: [V: 04-1] Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 (owner: 10SBassett) [15:23:24] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10akosiaris) [15:23:48] RECOVERY - eventgate-analytics LVS codfw on eventgate-analytics.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [15:23:52] Amir1: yes, it's DNS depooled [15:24:02] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [15:24:02] awesome [15:24:11] time to clean up some drifst [15:24:25] Amir1: just tell us if we need to hold on repooling [15:25:07] (03PS2) 10Elukey: admin_ng: fix Istio gateways configured in Knative for ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/890837 (https://phabricator.wikimedia.org/T327767) [15:25:07] thanks. I need an hour at most [15:25:30] (03PS1) 10Andrew Bogott: profile::cloudceph::client::rbd_backy: Fix reduce() syntax [puppet] - 10https://gerrit.wikimedia.org/r/890840 [15:26:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [15:26:52] RECOVERY - eventgate-main LVS codfw on eventgate-main.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [15:26:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [15:27:22] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:20] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:29:42] !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.upgrade-cluster (exit_code=0) Upgrade K8s version: T329664 [15:29:46] T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 [15:32:44] (03PS1) 10Clément Goubert: sre.discovery.datacenter: fix documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/890842 [15:33:08] (03PS1) 10Jbond: systemd::timer: update services to onshot and set RemainAfterExit [puppet] - 10https://gerrit.wikimedia.org/r/890843 [15:34:40] (03CR) 10Jbond: [C: 04-1] systemd timers: Add the 'After' requirement to the timer module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott) [15:35:21] claime: I'm done for the offline maint [15:35:57] Amir1: ack [15:36:22] (03PS2) 10Jbond: systemd::timer: update services to onshot and set RemainAfterExit [puppet] - 10https://gerrit.wikimedia.org/r/890843 [15:37:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.discovery.datacenter: fix documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/890842 (owner: 10Clément Goubert) [15:37:07] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: fix documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/890842 (owner: 10Clément Goubert) [15:38:33] (03PS2) 10Andrew Bogott: profile::cloudceph::client::rbd_backy: Fix reduce() syntax [puppet] - 10https://gerrit.wikimedia.org/r/890840 [15:38:47] (03Merged) 10jenkins-bot: sre.discovery.datacenter: fix documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/890842 (owner: 10Clément Goubert) [15:38:55] (03PS2) 10SBassett: Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 [15:39:42] (03CR) 10CI reject: [V: 04-1] Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 (owner: 10SBassett) [15:39:49] PROBLEM - Check systemd state on thanos-fe2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:14] (03PS3) 10SBassett: Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 [15:47:21] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:48:23] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Bad power supply on cr1-codfw(PEM 0) - https://phabricator.wikimedia.org/T329943 (10Papaul) 05Open→03Resolved Replaced PEM0 everything looks good now . {F36864090} [15:48:53] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:49:01] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/890825 (owner: 10Jbond) [15:49:12] 10SRE, 10ops-eqiad, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q3): Decommission netmon1002 - https://phabricator.wikimedia.org/T322321 (10lmata) [15:50:07] (03PS1) 10Clément Goubert: sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844 [15:50:32] 10SRE, 10Maps, 10Observability-Metrics, 10observability, and 3 others: SLO dashboards with N latency targets - https://phabricator.wikimedia.org/T320749 (10lmata) [15:51:10] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [15:52:00] (03CR) 10CI reject: [V: 04-1] sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844 (owner: 10Clément Goubert) [15:53:23] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad average message consume rate in last 30m on alert1001 is OK: (C)0 le (W)3 le 153.5 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [15:54:38] (03PS3) 10Elukey: admin_ng: fix Istio gateways configured in Knative for ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/890837 (https://phabricator.wikimedia.org/T327767) [15:54:57] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on alert1001 is OK: (C)0 le (W)3 le 149.4 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [15:55:37] (03PS1) 10David Caro: [cloudceph.client.rbd_backy] Fix wrong reduce call [puppet] - 10https://gerrit.wikimedia.org/r/890845 [15:57:15] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39763/console" [puppet] - 10https://gerrit.wikimedia.org/r/890845 (owner: 10David Caro) [15:58:01] 10SRE, 10Traffic, 10IPv6: Start a pure IPv6 web site for wikimedia services - https://phabricator.wikimedia.org/T330020 (10BCornwall) a:03BCornwall [15:58:19] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service,send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:24] (03PS2) 10Clément Goubert: sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844 [15:59:58] (03CR) 10Elukey: [C: 03+2] admin_ng: fix Istio gateways configured in Knative for ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/890837 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [16:00:03] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: switch to using functools.cache [cookbooks] - 10https://gerrit.wikimedia.org/r/890825 (owner: 10Jbond) [16:00:09] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: switch to using functools.cache [cookbooks] - 10https://gerrit.wikimedia.org/r/890825 [16:00:21] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) [16:00:23] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 [16:00:25] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) [16:00:33] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10BCornwall) 05Open→03Resolved Perfect! Thanks so much for your magic hands and making this a reality, @Sbenchagra. [16:00:37] (03CR) 10Atieno: [C: 04-1] "Maybe we should add a unit test to check that the dpi is now dynamic but defaults to 150" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T325771) (owner: 10Vlad.shapik) [16:00:52] (03CR) 10Jbond: "this and the others in the chain should be ready for review now" [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond) [16:01:16] (03CR) 10Atieno: [C: 04-1] Add the ability to specify the default DPI value for PDF files (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T325771) (owner: 10Vlad.shapik) [16:01:31] RECOVERY - Host ps1-a2-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [16:02:27] RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [16:02:30] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:02:59] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:03:35] (03Abandoned) 10Andrew Bogott: profile::cloudceph::client::rbd_backy: Fix reduce() syntax [puppet] - 10https://gerrit.wikimedia.org/r/890840 (owner: 10Andrew Bogott) [16:05:00] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:05:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:06:35] (03CR) 10Krinkle: [C: 03+1] Add static "Cleopatra" page to facilitate synthetic testing of 885362 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890509 (https://phabricator.wikimedia.org/T326147) (owner: 10Nray) [16:06:57] !log imported libxml2 2.9.4+dfsg1-7+deb10u5+icu67+wmf1 to component/icu67 for buster-wikimedia T329491 [16:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:01] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [16:07:02] XioNoX: yes [16:07:11] RECOVERY - Host ps1-a3-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.23 ms [16:07:26] papaul: any idea what could have caused that? [16:07:33] should be coming back up it was 190 now 180 [16:08:46] XioNoX: i think it was last week pdu maintenane since we took the main mgmt switch went down that is the only thing i can think of right now [16:09:07] ok [16:09:24] (03CR) 10Ladsgroup: [C: 03+2] "Manuel is out, tested with job_cmd and it works like a charm" [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup) [16:09:30] (03CR) 10CI reject: [V: 04-1] auto_schema: Add support for live schema changes on replicas too [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup) [16:10:27] !log rebooting mgmt switch in rack a5 [16:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:31] RECOVERY - Host ps1-a5-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [16:12:47] Amir1: had a little flashback to the pre-REdis jobqueue at WMF. https://codesearch.wmcloud.org/core/?q=job_cmd [16:13:10] !log rebooting mgmt switch in rack a7 [16:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:17] (03PS1) 10Jbond: redfish: ensure versions are parsed as packging.version.Version instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/890870 [16:13:54] 10SRE, 10serviceops: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10MoritzMuehlenhoff) PHP build depends on libxml2, which itself also uses ICU by default. I have patched it to build without ICU for the component/icu67 component, it falls back to iconv internally. [16:14:13] Krinkle: thankfully we don't use mysql as a hammer for everything anymore. The tables are in production but somehow the schema drifted and since they were empty, it didn't make sense to make them depool etc. so I wrote this code :D [16:14:30] oh it actually is that schema? [16:14:41] I thought it was something unrelated that used that same field name [16:14:46] even better :D [16:15:12] I see, you use it as a test case. [16:15:14] Nice [16:15:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [16:15:31] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1003.eqiad.wmnet with OS bullseye [16:15:33] RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.20 ms [16:15:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [16:15:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T328255)', diff saved to https://phabricator.wikimedia.org/P44704 and previous config saved to /var/cache/conftool/dbconfig/20230221-161552-ladsgroup.json [16:15:57] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [16:16:13] (part of T328255 before the switchover) [16:16:38] (03CR) 10Ladsgroup: [C: 03+2] "please" [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup) [16:17:27] !log rebooting mgmt switch in rack b1 [16:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:11] (03Merged) 10jenkins-bot: auto_schema: Add support for live schema changes on replicas too [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup) [16:19:18] you see, being nice to jenkins works [16:19:41] RECOVERY - Host ps1-b1-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.81 ms [16:20:17] !log rebooting mgmt switch in rack b3 [16:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:25] (03CR) 10Jbond: [C: 03+1] "lgtm FYI we no longer have an stretch vms in production so this is just around for any slacking cloud stretch hosts" [puppet] - 10https://gerrit.wikimedia.org/r/890816 (owner: 10Muehlenhoff) [16:21:01] (03PS3) 10Jbond: sre.hosts.reboot-single: Simplify icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) [16:21:21] (03CR) 10Jbond: [C: 03+2] sre.hosts.reboot-single: Simplify icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [16:22:13] RECOVERY - Host ps1-b3-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [16:22:24] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) You are welcome! I am curious @BCornwall, why did it take more than two years for this task to be completed? [16:22:31] !log rebooting mgmt switch in rack c3 [16:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:31] (03Merged) 10jenkins-bot: sre.hosts.reboot-single: Simplify icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [16:24:39] Amir1: https://bash.toolforge.org/quip/wyzKdIYBtR_B8fLxgTWk [16:24:42] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [16:26:39] (03CR) 10Muehlenhoff: Fix condition for including haveged (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890816 (owner: 10Muehlenhoff) [16:27:13] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: asw-a-codfw management interface unreachable - https://phabricator.wikimedia.org/T330048 (10Papaul) 05Open→03Resolved Rebooting the mgmt switch fix the issue [16:27:38] :D [16:35:54] (03PS9) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) [16:36:14] (03CR) 10Jbond: "thanks see responses inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [16:40:05] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.34 ms [16:40:29] looks like I forgot to promote group0 wikis [16:41:00] I am doing it now [16:41:24] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890878 (https://phabricator.wikimedia.org/T325587) [16:41:26] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890878 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot) [16:42:01] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890878 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot) [16:45:02] 10SRE, 10ops-eqiad, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q3): Decommission netmon1002 - https://phabricator.wikimedia.org/T322321 (10andrea.denisse) [16:45:03] (03PS6) 10Vgutierrez: varnish: Limit ESI processing to text/html pages [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) [16:47:14] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10RLazarus) We decided we'll put these into service after the upcoming DC switchover, so we'll make a plan at the March 6 serviceops meeting. [16:47:32] (03PS3) 10Clément Goubert: sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844 [16:48:06] !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1003.eqiad.wmnet with reason: host reimage [16:48:53] (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844 (owner: 10Clément Goubert) [16:49:57] !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: T327991 - None [16:50:00] 10ops-codfw, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10Papaul) 05Open→03Resolved a:03Papaul it has been an hour now no more errors on the interface `` Input errors: Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Bucket drops: 0, [16:50:03] T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 [16:50:34] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844 (owner: 10Clément Goubert) [16:51:07] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1003.eqiad.wmnet with reason: host reimage [16:52:35] (03Merged) 10jenkins-bot: sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844 (owner: 10Clément Goubert) [16:53:01] 10ops-codfw, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10jcrespo) Thank you! [16:53:19] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond) [16:53:54] scap is still going on [16:54:16] (03CR) 10Vgutierrez: "tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [16:55:25] (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (owner: 10Jbond) [16:56:54] (03PS7) 10Vgutierrez: varnish: Limit ESI processing to text/html pages [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) [16:57:17] (MediaWikiLatencyExceeded) firing: Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede [16:57:29] !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter status all services in codfw: None - None [16:57:31] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in codfw: None - None [16:59:10] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 [16:59:16] !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=codfw [16:59:23] !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal,name=codfw [16:59:31] !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wcqs,name=codfw [17:00:04] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:04] cwhite: May I have your attention please! Grafana 9 Upgrade. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1700) [17:00:05] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2186.mgmt.codfw.wmnet with reboot policy FORCED [17:00:18] (03CR) 10BBlack: [C: 03+1] "This looks right to me!" [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [17:00:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2186.mgmt.codfw.wmnet with reboot policy FORCED [17:01:09] somehow `scap` is blocked on deploying to codfw Kubernetes namespaces `mw-api-int` and `mw-web` :-\ [17:01:30] (03PS1) 10David Caro: profile.cloudceph: Add some tests [puppet] - 10https://gerrit.wikimedia.org/r/890881 [17:01:35] probably because it's trying to schedule too many replicas [17:02:13] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/890870 (owner: 10Jbond) [17:02:14] at least the helm3 upgrade command has a 600s timeout [17:02:17] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee [17:02:19] so I guess that will eventually fail [17:02:57] yeah [17:03:24] (03CR) 10Andrew Bogott: [C: 03+2] [cloudceph.client.rbd_backy] Fix wrong reduce call [puppet] - 10https://gerrit.wikimedia.org/r/890845 (owner: 10David Caro) [17:03:24] So what happened was we had to scale back mw-* deployments during the upgrade because some nodes couldn't be reimaged, because a mgmt switch is dead [17:03:39] That was done manually and not reflected in the helmfiles for the services [17:03:58] scap uses the helmfiles, and is curently trying to schedule way too many pods [17:04:12] can we update the helmfiles? [17:04:16] I'll go make a CR [17:04:18] yes [17:04:23] but they won't get picked up rn [17:04:34] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: T327991 - None [17:04:36] isn't there a timer updating them every minutes or so? [17:04:38] T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 [17:04:53] hashar: Yes, but I don't think helmfile picks up changes *during* an applt [17:04:54] apply [17:05:00] ah yeah [17:05:09] well I can always cancel `scap` and start again [17:05:19] * hashar orders another demi [17:05:40] ok, done. codfw wikikube cluster repooled [17:05:40] 17:05:24 K8s deployment to stage production failed: K8s deployment had the following errors: [17:05:40] codfw: Deployment of mw-web-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. [17:06:21] claime: maybe akosiaris change above is related? it says it is repooling stuff [17:06:28] so maybe there is no need to mess with the helmfiles [17:07:11] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=kubernetes2023.codfw.wmnet [17:07:21] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=kubernetes2024.codfw.wmnet [17:07:58] !log elukey@puppetmaster1001 conftool action : set/weight=10; selector: name=kubernetes2024.codfw.wmnet [17:07:59] hashar: we do, because we lost capacity [17:08:02] !log elukey@puppetmaster1001 conftool action : set/weight=10; selector: name=kubernetes2023.codfw.wmnet [17:08:30] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:09:00] (03CR) 10Physikerwelt: "Please remember that this needs to be deployed with the related restbase change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot) [17:09:12] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1003.eqiad.wmnet with OS bullseye [17:09:46] (03Abandoned) 10BCornwall: varnish: Check upload.wm.o for analytics cookies [puppet] - 10https://gerrit.wikimedia.org/r/889846 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [17:10:16] (03CR) 10Physikerwelt: [C: 04-1] "Abandon, please. Outdated." [deployment-charts] - 10https://gerrit.wikimedia.org/r/890350 (owner: 10PipelineBot) [17:10:44] (03PS1) 10Clément Goubert: mw-on-k8s: reduce codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/890882 (https://phabricator.wikimedia.org/T330048) [17:10:46] (03CR) 10Physikerwelt: [C: 04-1] "Abandon, please. Outdated." [deployment-charts] - 10https://gerrit.wikimedia.org/r/885482 (owner: 10PipelineBot) [17:11:01] (03CR) 10Physikerwelt: [C: 04-1] "Abandon, please. Outdated." [deployment-charts] - 10https://gerrit.wikimedia.org/r/872978 (owner: 10PipelineBot) [17:13:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: reduce codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/890882 (https://phabricator.wikimedia.org/T330048) (owner: 10Clément Goubert) [17:13:33] (03Abandoned) 10Hashar: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/890350 (owner: 10PipelineBot) [17:13:36] (03Abandoned) 10Hashar: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/872978 (owner: 10PipelineBot) [17:13:38] (03Abandoned) 10Hashar: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/885482 (owner: 10PipelineBot) [17:14:21] I have manually killed the helm3 upgrade command [17:14:30] hashar: ok [17:14:45] that was for `mw-api-ext-deploy-codfw.config` [17:14:50] hashar: I'm waiting on CI for the scale, then I'll go ahead and apply that manually [17:14:52] a a couple others failed earlier [17:15:00] mw-debug and mw-web [17:15:06] sorry for the mess [17:15:06] :D [17:15:09] Not your fault [17:15:23] We should have immediately transcribed the manual action in code [17:15:28] (03CR) 10Vgutierrez: [C: 03+2] varnish: Limit ESI processing to text/html pages [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [17:15:38] fun thing, if you `kill` the `helm3 upgrade` nothing happens, I had to `kill -9` it [17:16:14] (03PS1) 10Elukey: admin_ng: fix knative settings for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/890883 (https://phabricator.wikimedia.org/T327767) [17:16:22] yes... [17:16:24] fpm restarting [17:16:33] Because if you kill it it tries to rollback iirc [17:16:58] ahh [17:17:29] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.24 refs T325587 [17:17:34] T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587 [17:17:35] pfiou [17:17:42] (03PS3) 10JHathaway: Add jaeger-{builder,query,collector} [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) [17:17:48] (03PS1) 10Cwhite: profile: Re-enable grafana db sync post 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/890849 (https://phabricator.wikimedia.org/T317887) [17:18:29] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Upgrade plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/886861 (https://phabricator.wikimedia.org/T317887) (owner: 10Cwhite) [17:18:31] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: reduce codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/890882 (https://phabricator.wikimedia.org/T330048) (owner: 10Clément Goubert) [17:18:52] group0 looks good so far [17:19:10] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10BCornwall) Good question. I fear I'm not equipped to give an authoritative answer, but generally low priority combined with ownership doubts (who o... [17:20:02] (03PS1) 10Andrea Denisse: rsyslog: Remove centrallog1001 as TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T329131) [17:20:06] (03PS5) 10Jbond: sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) [17:20:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] CI runner: skip helm library charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/888275 (owner: 10JHathaway) [17:21:10] (03CR) 10JHathaway: [C: 03+2] CI runner: skip helm library charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/888275 (owner: 10JHathaway) [17:22:54] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) Thank you @BCornwall! Same, please flag any tickets that need my attention. Three months ago, I started managing the [[ https://wikimed... [17:23:40] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [17:23:50] (03Merged) 10jenkins-bot: mw-on-k8s: reduce codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/890882 (https://phabricator.wikimedia.org/T330048) (owner: 10Clément Goubert) [17:24:02] group 0 looks good, I am calling it a day [17:25:17] (03CR) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (owner: 10Jbond) [17:25:38] (03PS5) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (https://phabricator.wikimedia.org/T328593) [17:25:41] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:25:45] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:25:56] (03PS5) 10Jbond: sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) [17:25:59] !log Grafana 9x upgrade in production complete T317887 [17:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:03] T317887: Upgrade to Grafana 9 - https://phabricator.wikimedia.org/T317887 [17:26:23] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond) [17:27:39] (HelmReleaseBadStatus) firing: (3) Helm release mw-api-ext/main on k8s@codfw in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:27:43] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: sync [17:27:48] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: sync [17:28:17] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond) [17:28:21] (03CR) 10Jbond: [C: 03+2] redfish: ensure versions are parsed as packging.version.Version instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/890870 (owner: 10Jbond) [17:29:22] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/890476 (owner: 10Jbond) [17:29:52] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/890477 (owner: 10Jbond) [17:29:58] (KubernetesCalicoDown) firing: (4) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:30:18] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:30:36] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:31:04] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:31:09] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:31:16] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:31:25] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:31:35] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:31:45] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:32:05] (03Merged) 10jenkins-bot: redfish: ensure versions are parsed as packging.version.Version instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/890870 (owner: 10Jbond) [17:32:39] (HelmReleaseBadStatus) firing: (3) Helm release mw-api-ext/main on k8s@codfw in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:33:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2186.mgmt.codfw.wmnet with reboot policy FORCED [17:33:13] (03PS4) 10JHathaway: Add jaeger-es-index-cleaner [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) [17:33:35] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2186'] [17:33:43] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['db2186'] [17:33:52] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2186'] [17:34:15] (03CR) 10JHathaway: [C: 03+2] Add jaeger-{builder,query,collector} [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [17:34:17] (03CR) 10JHathaway: [V: 03+2 C: 03+2] Add jaeger-{builder,query,collector} [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [17:34:55] (03CR) 10JHathaway: [V: 03+2 C: 03+2] "I think most of the concerns have been addressed, so going ahead with the merge" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [17:35:13] (03CR) 10JHathaway: [C: 03+2] Add jaeger-es-index-cleaner [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [17:35:17] (03CR) 10JHathaway: [V: 03+2 C: 03+2] Add jaeger-es-index-cleaner [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [17:36:23] (03CR) 10JHathaway: [C: 03+2] Add jaeger chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 (owner: 10JHathaway) [17:36:30] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:36:33] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:37:26] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: sync [17:37:32] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: sync [17:39:04] (03CR) 10Elukey: [C: 03+2] admin_ng: fix knative settings for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/890883 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [17:41:05] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [17:41:55] (03Merged) 10jenkins-bot: Add jaeger chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 (owner: 10JHathaway) [17:42:12] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [17:42:39] (HelmReleaseBadStatus) resolved: (3) Helm release mw-api-ext/main on k8s@codfw in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:44:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2186'] [17:45:26] (03CR) 10Nray: "Thank you! Planning to deploy this today during the UTC late backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890509 (https://phabricator.wikimedia.org/T326147) (owner: 10Nray) [17:45:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [17:45:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [17:47:09] (03CR) 10Dzahn: "ah, right, others already use underscores too, thanks, I'll do that" [puppet] - 10https://gerrit.wikimedia.org/r/890014 (owner: 10Dzahn) [17:48:30] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:50:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2186.codfw.wmnet with OS bullseye [17:50:07] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye [17:50:51] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:51:00] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [17:52:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:52:29] (03PS2) 10Dzahn: site: differentiate between both serviceops teams for insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/890014 [17:52:58] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:53:07] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [17:53:39] (03CR) 10Ssingh: [C: 03+2] Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/890847 (owner: 10Ayounsi) [17:53:43] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:54:01] !log run authdns-update for Gerrit: 890847. repooling codfw [17:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:12] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [17:55:56] (03PS4) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 [17:57:48] !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route depool restbase-async in eqiad: T327991 [17:57:49] !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache restbase-async.discovery.wmnet on all recursors [17:57:52] T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 [17:57:53] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase-async.discovery.wmnet on all recursors [17:58:54] jouncebot: nowandnext [17:58:54] For the next 0 hour(s) and 1 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1700) [17:58:54] For the next 0 hour(s) and 1 minute(s): Grafana 9 Upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1700) [17:58:54] In 0 hour(s) and 1 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1800) [17:59:49] 10SRE, 10DNS, 10Traffic, 10Chinese-Sites: Let all requests from mainland China will be processed to codfw/esams/drmrs - https://phabricator.wikimedia.org/T330024 (10BCornwall) 05Open→03Declined Hi, @I. Thank you for reporting and for your detailed descriptions. The team's limited capacity prevents the... [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1800) [18:00:09] 10SRE, 10Traffic, 10IPv6: Start a pure IPv6 web site for wikimedia services - https://phabricator.wikimedia.org/T330020 (10BCornwall) 05Open→03Declined p:05Triage→03Lowest Hi, @I. Thank you for reporting and for your detailed descriptions. The team's limited capacity prevents the maintenance work re... [18:02:18] (03PS1) 10Hnowlan: api-gateway: add REST gateway LUA CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T329049) [18:02:20] (03CR) 10CI reject: [V: 04-1] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (owner: 10JHathaway) [18:02:52] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in eqiad: T327991 [18:02:56] T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 [18:03:11] (03PS2) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) [18:03:14] (03CR) 10CI reject: [V: 04-1] api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [18:04:07] (03CR) 10CI reject: [V: 04-1] api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [18:06:37] (03CR) 10Dzahn: [C: 03+1] gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) (owner: 10Jelto) [18:08:55] (03CR) 10Andrew Bogott: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/890843 (owner: 10Jbond) [18:09:37] (03CR) 10Andrew Bogott: systemd timers: Add the 'After' requirement to the timer module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott) [18:09:50] (03Abandoned) 10Andrew Bogott: systemd timers: Add the 'After' requirement to the timer module [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott) [18:10:58] (03PS3) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) [18:11:25] 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10BCornwall) p:05Triage→03Lowest a:03BCornwall [18:18:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10phaultfinder) [18:21:54] !log dancy@deploy1002 Installing scap version "4.38.0" for 564 hosts [18:23:23] (03CR) 10Dzahn: "+1 removed based on "no consensus comment" on ticket, which is from 2016 though" [puppet] - 10https://gerrit.wikimedia.org/r/527912 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix) [18:23:55] (03PS3) 10BCornwall: Remove FLoC headers [puppet] - 10https://gerrit.wikimedia.org/r/889892 (https://phabricator.wikimedia.org/T312823) [18:26:10] 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10BCornwall) a:05BCornwall→03Legoktm [18:26:26] 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10BCornwall) 05Open→03In progress [18:30:42] (03CR) 10Krinkle: "Might be worth covering by a test. We're going to seriously rely on this, i.e. avoid "oh this seems old, lets remove it" or accidental bre" [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm) [18:36:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:37:09] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:38:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2186.codfw.wmnet with OS bullseye [18:38:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:38:23] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye executed with errors: - db2186 (**FAIL... [18:38:36] (03CR) 10BCornwall: [V: 03+2 C: 03+2] "PCC is happy again (with the same caveat of the one unrelated fail):" [puppet] - 10https://gerrit.wikimedia.org/r/889892 (https://phabricator.wikimedia.org/T312823) (owner: 10BCornwall) [18:42:06] (03CR) 10Cwhite: [C: 03+2] profile: Re-enable grafana db sync post 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/890849 (https://phabricator.wikimedia.org/T317887) (owner: 10Cwhite) [18:43:15] (03PS1) 10Slyngshede: idm.wikimedia.org CNAME to idm1001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/890891 [18:44:36] (03CR) 10Slyngshede: "Required to complete production setup, and enable OIDC." [dns] - 10https://gerrit.wikimedia.org/r/890891 (owner: 10Slyngshede) [18:46:29] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01382 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:52:29] er [18:53:06] sukhe: seems like all cp* hosts, maybe more [18:53:18] mutante: yeah definitely more than one thing happening here! [18:53:45] it's reload-vcl-failed-frontend [18:54:08] brett: ^ [18:54:13] seems like https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/19b9dcb8264cb3fc57e5160d54856ea30d9f897e is failing [18:54:17] https://puppetboard.wikimedia.org/report/cp5031.eqsin.wmnet/3ba20e8655ef419095dc627d4bd28bef357fb054 [18:54:20] looking [18:54:48] returns: ('/etc/varnish/wikimedia_upload-frontend.vcl' Line 1043 [18:55:06] Undefined sub https_deliver_permissionspolicy [18:55:52] yep [18:56:00] brett: modules/varnish/templates/wikimedia-frontend.vcl.erb, line 1124 [18:56:03] 1124 call https_deliver_permissionspolicy; [18:56:11] we should probably remove this? [18:56:55] cwhite: apologies but https://puppetboard.wikimedia.org/report/grafana1002.eqiad.wmnet/68b1c08326f83f7501a26875e12b44cbe144cbcb ? [18:56:58] is this expected? [18:57:35] so yeah, there is the cp hosts failure which is the bigger issue [18:57:37] let me patch it [18:59:24] (03PS1) 10Ssingh: varnish: update wikimedia-frontend.vcl.erb for 19b9dcb8264 [puppet] - 10https://gerrit.wikimedia.org/r/890895 (https://phabricator.wikimedia.org/T312823) [19:00:05] hashar and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1900). [19:01:24] mutante: thanks for the debug above [19:01:25] 10SRE, 10API Platform, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10JArguello-WMF) [19:01:30] will wait for brett to review the patch once [19:03:18] sukhe: that failure on grafana1002 has been there for 10 hours. Possibly some manual action with the grafana-grizzly repo? [19:04:12] yep, just flagged it in case it was missed [19:05:24] sukhe: thanks for being on top of it :)) [19:05:28] (03CR) 10Ssingh: [C: 03+2] varnish: update wikimedia-frontend.vcl.erb for 19b9dcb8264 [puppet] - 10https://gerrit.wikimedia.org/r/890895 (https://phabricator.wikimedia.org/T312823) (owner: 10Ssingh) [19:06:56] cwhite: sorry for the confusion, that wasn't your change! the grafana thing was just a coincidence :) [19:08:36] ok, the cp errors should clear up, forcing a puppet run on all cp [19:14:14] ok some other failed runs here [19:14:15] https://puppetboard.wikimedia.org/nodes?status=failed [19:14:19] but the cp ones have cleared up [19:17:03] (03PS2) 10Andrea Denisse: rsyslog: Remove centrallog1001 as TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803) [19:36:25] (03CR) 10MSantos: [C: 03+1] api-gateway: add rest gateway configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/890012 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [19:38:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Andrew) [19:40:07] 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10BCornwall) @bblack, @Vgutierrez: Is it reasonable to put this header into Varnish itself as per https://gerrit.wikimedia.org/r/c/890512? Seems sound... [19:42:36] (03PS1) 10Ladsgroup: mariadb: Update grants to use wikiuser@10.% only [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) [20:00:20] (03PS5) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 [20:04:10] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Dzahn) @Sbenchagra and @BCornwall Thank you soooo much for resolving this. It's great to see long-standing tickets closed. @Sbenchagra regarding w... [20:04:20] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10colewhite) [20:05:16] (03CR) 10CI reject: [V: 04-1] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (owner: 10JHathaway) [20:10:45] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2186'] [20:10:55] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2186'] [20:13:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2108 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P44705 and previous config saved to /var/cache/conftool/dbconfig/20230221-201308-root.json [20:26:21] (03PS1) 10Dzahn: doc: fix hostname used in http::blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/890903 (https://phabricator.wikimedia.org/T327973) [20:28:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2108 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P44706 and previous config saved to /var/cache/conftool/dbconfig/20230221-202813-root.json [20:32:01] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [20:35:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2186.codfw.wmnet with OS bullseye [20:35:07] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye [20:43:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2108 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P44707 and previous config saved to /var/cache/conftool/dbconfig/20230221-204317-root.json [20:44:37] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@5edcd7b]: Test deployment of search airflow dags [20:45:45] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@5edcd7b]: Test deployment of search airflow dags (duration: 01m 08s) [20:53:01] (03PS2) 10Dzahn: doc: fix hostname used in http::blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/890903 (https://phabricator.wikimedia.org/T327973) [20:54:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2186.codfw.wmnet with reason: host reimage [20:57:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [20:58:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2186.codfw.wmnet with reason: host reimage [20:58:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2108 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P44708 and previous config saved to /var/cache/conftool/dbconfig/20230221-205822-root.json [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T2100) [21:00:04] RoanKattouw, nray, and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:10] I can deploy, but I note RoanKattouw you have some patches? Did you want to self-serve and do the others in the queue? [21:00:17] Yes I'd be happy to [21:00:26] I have been remiss doing deployments lately [21:00:32] okay ^^ all yours! [21:00:42] o/ [21:03:46] nray: I will start with your patch [21:04:08] (03PS9) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 [21:04:24] (03CR) 10Jbond: "updated, not sure what to do about the remaining pylint issues" [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [21:07:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890509 (https://phabricator.wikimedia.org/T326147) (owner: 10Nray) [21:07:51] (03CR) 10CI reject: [V: 04-1] redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [21:07:59] (03Merged) 10jenkins-bot: Add static "Cleopatra" page to facilitate synthetic testing of 885362 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890509 (https://phabricator.wikimedia.org/T326147) (owner: 10Nray) [21:08:29] !log catrope@deploy1002 Started scap: Backport for [[gerrit:890509|Add static "Cleopatra" page to facilitate synthetic testing of 885362 (T326147 T293303)]] [21:08:36] T326147: Stop fragmenting ParserCache entries for mobile frontend - https://phabricator.wikimedia.org/T326147 [21:08:36] T293303: Mobile Wikipedia displays blurry thumbnails on hi-res devices - https://phabricator.wikimedia.org/T293303 [21:10:23] !log catrope@deploy1002 catrope and nray: Backport for [[gerrit:890509|Add static "Cleopatra" page to facilitate synthetic testing of 885362 (T326147 T293303)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:10:36] nray: Your patch is on the debug servers, please test [21:11:57] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2186.codfw.wmnet with OS bullseye [21:12:03] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye executed with errors: - db2186 (**FAIL... [21:12:05] RoanKattouw: thank you [21:12:45] (03CR) 10BCornwall: [C: 03+1] Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 (owner: 10SBassett) [21:12:50] @RoanKattouw looks good! [21:14:48] arlolra: Are you here for your backport deployment (Remove wgLinterSubmitterWhitelist)? [21:15:14] yes [21:15:29] This just cleans up the var, it is unused [21:16:07] Is it ununsed in production currently? The change removing action=record-lint in the Linter extension was only just merged a few hours ago, but maybe I'm missing something [21:16:52] https://github.com/wikimedia/mediawiki/blob/master/includes/parser/Parsoid/Config/DataAccess.php#L434-L451 [21:17:16] Parsoid calls the hook directly, the api action stopped with Parsoid/JS [21:17:25] so, yes, unused in production [21:17:32] OK [21:18:07] I'll deploy it as soon as Nick's patch finishes deploying [21:18:34] thanks [21:18:35] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [21:18:58] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:890509|Add static "Cleopatra" page to facilitate synthetic testing of 885362 (T326147 T293303)]] (duration: 10m 28s) [21:19:05] T326147: Stop fragmenting ParserCache entries for mobile frontend - https://phabricator.wikimedia.org/T326147 [21:19:06] T293303: Mobile Wikipedia displays blurry thumbnails on hi-res devices - https://phabricator.wikimedia.org/T293303 [21:19:16] (03PS2) 10Catrope: Remove wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890496 (https://phabricator.wikimedia.org/T329992) (owner: 10Arlolra) [21:19:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890496 (https://phabricator.wikimedia.org/T329992) (owner: 10Arlolra) [21:20:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [21:21:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [21:21:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T328255)', diff saved to https://phabricator.wikimedia.org/P44709 and previous config saved to /var/cache/conftool/dbconfig/20230221-212123-ladsgroup.json [21:21:28] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [21:22:57] thank you for your help @RoanKattouw ! [21:23:36] (03PS1) 10Ebernhardson: Deploy analytics-refinery to search airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/890906 (https://phabricator.wikimedia.org/T329870) [21:24:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Jclark-ctr) [21:24:32] !log catrope@deploy1002 Started scap: Backport for [[gerrit:890496|Remove wgLinterSubmitterWhitelist (T329992)]] [21:24:36] T329992: Remove Linter API action=record-lint - https://phabricator.wikimedia.org/T329992 [21:25:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T328255)', diff saved to https://phabricator.wikimedia.org/P44710 and previous config saved to /var/cache/conftool/dbconfig/20230221-212503-ladsgroup.json [21:25:08] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.00592 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:26:13] !log catrope@deploy1002 arlolra and catrope: Backport for [[gerrit:890496|Remove wgLinterSubmitterWhitelist (T329992)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:29:26] (03PS2) 10Aklapper: redirects.dat: Provide acme-chief/TLS SNI list support in compile_redirects() [puppet] - 10https://gerrit.wikimedia.org/r/514477 (https://phabricator.wikimedia.org/T225096) (owner: 10Vgutierrez) [21:29:58] (KubernetesCalicoDown) firing: (4) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:30:37] RoanKattouw: I think you can continue [21:30:44] Yes, I hit continue [21:35:08] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:890496|Remove wgLinterSubmitterWhitelist (T329992)]] (duration: 10m 36s) [21:35:13] T329992: Remove Linter API action=record-lint - https://phabricator.wikimedia.org/T329992 [21:36:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope) [21:36:19] (03PS3) 10Catrope: Add VueTest to extension-list, add config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) [21:36:23] (03CR) 10TrainBranchBot: "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope) [21:36:38] (03CR) 10Catrope: [C: 03+2] Add VueTest to extension-list, add config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope) [21:36:55] RoanKattouw: can confirm that linting is still working https://en.wikipedia.org/wiki/Special:LintErrors/obsolete-tag?namespace=&titlecategorysearch=User%3AArlolra%2Fsandbox&exactmatch=1 [21:36:59] Thanks! [21:37:01] Great! [21:37:01] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:37:26] (03Merged) 10jenkins-bot: Add VueTest to extension-list, add config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope) [21:37:50] !log catrope@deploy1002 Started scap: Backport for [[gerrit:884361|Add VueTest to extension-list, add config var (T315621)]] [21:37:54] T315621: Install VueTest extension in beta labs - https://phabricator.wikimedia.org/T315621 [21:38:26] (03PS2) 10Aklapper: add sretools.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/815376 (https://phabricator.wikimedia.org/T313355) (owner: 10CDanis) [21:39:23] 10SRE, 10Security-Team, 10Traffic-Icebox, 10WMF-General-or-Unknown, and 2 others: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618 (10Aklapper) [21:40:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P44711 and previous config saved to /var/cache/conftool/dbconfig/20230221-214009-ladsgroup.json [21:44:17] ger [21:45:46] (03PS1) 10Zabe: Add mobile domain for ombuds.wm.o [dns] - 10https://gerrit.wikimedia.org/r/890908 [21:46:17] 10SRE, 10noc.wikimedia.org, 10serviceops: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Aklapper) [21:47:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:48:51] ^ somehow cxserver but perfect example for "alert because no data" [21:54:25] 10SRE, 10Traffic: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467 - https://phabricator.wikimedia.org/T274431 (10Aklapper) [21:55:14] 10SRE, 10Traffic: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10Aklapper) [21:55:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P44712 and previous config saved to /var/cache/conftool/dbconfig/20230221-215515-ladsgroup.json [22:01:16] !log catrope@deploy1002 catrope: Backport for [[gerrit:884361|Add VueTest to extension-list, add config var (T315621)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [22:01:20] T315621: Install VueTest extension in beta labs - https://phabricator.wikimedia.org/T315621 [22:02:51] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2001.codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:08:35] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [22:09:00] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2001.codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:10:08] 10SRE: Make bpfcc-tools available fleet-wide - https://phabricator.wikimedia.org/T261193 (10Aklapper) [22:10:15] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2002.codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:10:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T328255)', diff saved to https://phabricator.wikimedia.org/P44713 and previous config saved to /var/cache/conftool/dbconfig/20230221-221021-ladsgroup.json [22:10:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:10:26] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [22:10:28] 10SRE, 10Patch-Needs-Improvement, 10User-jbond: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10Aklapper) [22:10:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:10:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T328255)', diff saved to https://phabricator.wikimedia.org/P44714 and previous config saved to /var/cache/conftool/dbconfig/20230221-221042-ladsgroup.json [22:14:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10IPv6, 10User-jbond: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10Aklapper) [22:14:39] 10Puppet, 10Infrastructure-Foundations, 10Patch-Needs-Improvement, 10User-jbond: Refactor puppet-merge - https://phabricator.wikimedia.org/T254249 (10Aklapper) [22:14:58] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:884361|Add VueTest to extension-list, add config var (T315621)]] (duration: 37m 07s) [22:15:02] T315621: Install VueTest extension in beta labs - https://phabricator.wikimedia.org/T315621 [22:15:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T328255)', diff saved to https://phabricator.wikimedia.org/P44715 and previous config saved to /var/cache/conftool/dbconfig/20230221-221529-ladsgroup.json [22:15:34] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [22:15:41] (03PS5) 10Aklapper: api-gateway: add script for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/722411 (https://phabricator.wikimedia.org/T254917) (owner: 10Daniel Kinzler) [22:15:54] (03PS12) 10Aklapper: jobrunner: Standard mediawiki webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) (owner: 10Hnowlan) [22:16:26] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2002.codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:16:32] (03CR) 10CI reject: [V: 04-1] api-gateway: add script for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/722411 (https://phabricator.wikimedia.org/T254917) (owner: 10Daniel Kinzler) [22:16:46] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:16:47] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2003.codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:18:00] (03PS3) 10Catrope: Enable VueTest in labs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884364 (https://phabricator.wikimedia.org/T315621) [22:18:06] (03CR) 10Catrope: [C: 03+2] Enable VueTest in labs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884364 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope) [22:18:44] (03Merged) 10jenkins-bot: Enable VueTest in labs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884364 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope) [22:22:58] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2003.codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:24:21] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1001.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:25:03] (03PS6) 10Aklapper: [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup) [22:25:40] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-Needs-Improvement: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Aklapper) [22:26:12] (03CR) 10CI reject: [V: 04-1] [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup) [22:30:31] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1001.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:30:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P44716 and previous config saved to /var/cache/conftool/dbconfig/20230221-223036-ladsgroup.json [22:31:27] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1002.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:36:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:37:37] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1002.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:37:55] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1003.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:38:40] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [22:43:33] !log removing 15 files for legal compliance [22:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:06] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1003.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:45:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P44717 and previous config saved to /var/cache/conftool/dbconfig/20230221-224542-ladsgroup.json [22:49:17] 10SRE, 10Privacy Engineering, 10Traffic: Remove obsolete "Permissions-Policy: interest-cohort" header - https://phabricator.wikimedia.org/T312823 (10BCornwall) 05In progress→03Resolved Thanks @ssingh for that followup patch ._. [22:51:40] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/890903/39770/" [puppet] - 10https://gerrit.wikimedia.org/r/890903 (https://phabricator.wikimedia.org/T327973) (owner: 10Dzahn) [22:58:38] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [23:00:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T328255)', diff saved to https://phabricator.wikimedia.org/P44718 and previous config saved to /var/cache/conftool/dbconfig/20230221-230048-ladsgroup.json [23:00:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [23:00:53] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [23:01:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [23:01:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T328255)', diff saved to https://phabricator.wikimedia.org/P44719 and previous config saved to /var/cache/conftool/dbconfig/20230221-230109-ladsgroup.json [23:04:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T328255)', diff saved to https://phabricator.wikimedia.org/P44720 and previous config saved to /var/cache/conftool/dbconfig/20230221-230454-ladsgroup.json [23:09:17] !log removing 5 files for legal compliance [23:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:44] 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Unexpected auditd service restart failure - https://phabricator.wikimedia.org/T287266 (10BCornwall) AFAICT we aren't packaging auditd ourselves. It might be easiest to just notify a trigger to re-start the stupid service after install since it looks like Debian isn... [23:15:22] 10SRE, 10Traffic, 10Patch-For-Review: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 (10BCornwall) 05Open→03Stalled [23:17:09] (03PS1) 10Dzahn: ci::firewall: allow monitoring hosts to check httpd on contint* [puppet] - 10https://gerrit.wikimedia.org/r/890919 (https://phabricator.wikimedia.org/T327972) [23:17:40] (03PS2) 10Dzahn: ci::firewall: allow monitoring hosts to check http on contint* [puppet] - 10https://gerrit.wikimedia.org/r/890919 (https://phabricator.wikimedia.org/T327972) [23:20:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P44721 and previous config saved to /var/cache/conftool/dbconfig/20230221-232000-ladsgroup.json [23:20:15] (03CR) 10Dzahn: [V: 04-1] "sigh, the old discussion about DNS names or IPs in firewall rules strikes again. parameter 'monitoring_hosts' index 1 expects a match fo" [puppet] - 10https://gerrit.wikimedia.org/r/890919 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn) [23:23:15] (03PS3) 10Dzahn: ci::firewall: allow monitoring hosts to check http on contint* [puppet] - 10https://gerrit.wikimedia.org/r/890919 (https://phabricator.wikimedia.org/T327972) [23:25:30] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/890919/39772/" [puppet] - 10https://gerrit.wikimedia.org/r/890919 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn) [23:25:58] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "this is to fix ": dial tcp 208.80.153.15:443: connect: connection refused"" errors in logstash" [puppet] - 10https://gerrit.wikimedia.org/r/890919 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn) [23:35:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P44722 and previous config saved to /var/cache/conftool/dbconfig/20230221-233506-ladsgroup.json [23:43:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [23:45:28] 10SRE: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10Dzahn) [23:45:35] 10SRE: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10Dzahn) {F36864393} [23:45:54] 10SRE, 10observability: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10Dzahn) [23:46:57] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:50:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T328255)', diff saved to https://phabricator.wikimedia.org/P44723 and previous config saved to /var/cache/conftool/dbconfig/20230221-235012-ladsgroup.json [23:50:18] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [23:51:57] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:55:20] (03PS1) 10Dzahn: ci::firewall: allow http monitoring from prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/890920 (https://phabricator.wikimedia.org/T327972) [23:55:23] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:58:44] (03CR) 10Dzahn: [V: 04-1] "oh my.. back to DNS names, opposite from previous comment:" [puppet] - 10https://gerrit.wikimedia.org/r/890920 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn)