[00:39:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:39:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921321 [00:39:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921321 (owner: 10TrainBranchBot) [00:43:29] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:56:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921321 (owner: 10TrainBranchBot) [00:58:15] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:03:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T337276 (10phaultfinder) [02:00:06] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T0200) [02:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.10 [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/921322 (https://phabricator.wikimedia.org/T330216) [02:08:11] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.10 [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/921322 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [02:25:36] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.10 [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/921322 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:02] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T0300) [03:01:25] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922187 (https://phabricator.wikimedia.org/T330216) [03:01:27] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922187 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [03:02:12] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922187 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [03:02:42] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.10 refs T330216 [03:02:46] T330216: 1.41.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T330216 [03:33:35] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:38:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:51:46] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.10 refs T330216 (duration: 49m 04s) [03:51:51] T330216: 1.41.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T330216 [03:54:05] !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.8 (duration: 02m 17s) [04:09:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:09:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:11:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.309 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:12:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49992 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:43:50] (03PS1) 10Naif212: Revert "Enable VE on new wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921565 [04:43:52] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921565 (owner: 10Naif212) [05:03:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:18:58] (03PS1) 10Marostegui: Revert "db2124: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/922386 [05:19:44] (03CR) 10Marostegui: [C: 03+2] Revert "db2124: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/922386 (owner: 10Marostegui) [05:20:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48463 and previous config saved to /var/cache/conftool/dbconfig/20230523-052014-root.json [05:33:21] (03PS8) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243 [05:35:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48464 and previous config saved to /var/cache/conftool/dbconfig/20230523-053519-root.json [05:45:21] (03PS2) 10KartikMistry: cxserver: Remove Flores MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/922064 (https://phabricator.wikimedia.org/T331505) [05:46:25] (03PS1) 10Marostegui: db-production: Disable es4 writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922376 (https://phabricator.wikimedia.org/T337283) [05:47:54] * kart_ updating cxserver [05:48:43] (03CR) 10KartikMistry: [C: 03+2] cxserver: Remove Flores MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/922064 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [05:49:29] (03Merged) 10jenkins-bot: cxserver: Remove Flores MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/922064 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [05:50:02] kart_: let me know when I can push mediawiki :) [05:50:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48465 and previous config saved to /var/cache/conftool/dbconfig/20230523-055024-root.json [05:56:23] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:56:47] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:56:58] marostegui: sorry, was looking into few things. This would be quick. [05:59:03] no rush! [05:59:07] just let me know when done :) [05:59:45] Sure! [06:00:00] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T0600) [06:00:05] kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T0600). Please do the needful. [06:00:37] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:02:25] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:03:01] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:03:20] marostegui: Done. [06:03:47] 10SRE, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-April-June): Remove Flores key from production - https://phabricator.wikimedia.org/T337284 (10KartikMistry) [06:04:44] !log cxserver: Remove Flores MT service (T331505) [06:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:48] T331505: Self hosted machine translation service - https://phabricator.wikimedia.org/T331505 [06:05:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48466 and previous config saved to /var/cache/conftool/dbconfig/20230523-060528-root.json [06:16:18] kart_: thanks! [06:16:23] (03CR) 10Marostegui: [C: 03+2] db-production: Disable es4 writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922376 (https://phabricator.wikimedia.org/T337283) (owner: 10Marostegui) [06:17:09] (03Merged) 10jenkins-bot: db-production: Disable es4 writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922376 (https://phabricator.wikimedia.org/T337283) (owner: 10Marostegui) [06:17:59] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:922376|db-production: Disable es4 writes (T337283)]] [06:18:04] T337283: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T337283 [06:19:32] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:922376|db-production: Disable es4 writes (T337283)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [06:20:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48467 and previous config saved to /var/cache/conftool/dbconfig/20230523-062033-root.json [06:26:21] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:922376|db-production: Disable es4 writes (T337283)]] (duration: 08m 21s) [06:26:28] T337283: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T337283 [06:35:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48468 and previous config saved to /var/cache/conftool/dbconfig/20230523-063538-root.json [06:38:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T337283 [06:38:20] T337283: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T337283 [06:38:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T337283 [06:38:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1020 with weight 0 T337283', diff saved to https://phabricator.wikimedia.org/P48469 and previous config saved to /var/cache/conftool/dbconfig/20230523-063836-root.json [06:44:39] (03PS1) 10Marostegui: es1020: Promote to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/922453 (https://phabricator.wikimedia.org/T337283) [06:46:21] (03CR) 10Marostegui: [C: 03+2] es1020: Promote to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/922453 (https://phabricator.wikimedia.org/T337283) (owner: 10Marostegui) [06:46:53] !log Starting es4 eqiad failover from es1021 to es1020 - T337283 [06:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:58] T337283: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T337283 [06:47:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1020 to es4 primary T337283', diff saved to https://phabricator.wikimedia.org/P48470 and previous config saved to /var/cache/conftool/dbconfig/20230523-064729-root.json [06:48:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1021 T337283', diff saved to https://phabricator.wikimedia.org/P48471 and previous config saved to /var/cache/conftool/dbconfig/20230523-064820-root.json [06:48:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change es1020 weight', diff saved to https://phabricator.wikimedia.org/P48472 and previous config saved to /var/cache/conftool/dbconfig/20230523-064850-root.json [06:49:46] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:49:47] (03PS1) 10Marostegui: wmnet: Change dns for es4-master [dns] - 10https://gerrit.wikimedia.org/r/922455 (https://phabricator.wikimedia.org/T337283) [06:50:02] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:50:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48473 and previous config saved to /var/cache/conftool/dbconfig/20230523-065042-root.json [06:50:44] (03PS1) 10Marostegui: Revert "db-production: Disable es4 writes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922387 [06:51:24] (03CR) 10Marostegui: [C: 03+2] Revert "db-production: Disable es4 writes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922387 (owner: 10Marostegui) [06:51:30] (03CR) 10Marostegui: [C: 03+2] wmnet: Change dns for es4-master [dns] - 10https://gerrit.wikimedia.org/r/922455 (https://phabricator.wikimedia.org/T337283) (owner: 10Marostegui) [06:52:20] (03Merged) 10jenkins-bot: Revert "db-production: Disable es4 writes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922387 (owner: 10Marostegui) [06:53:02] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:922387|Revert "db-production: Disable es4 writes"]] [06:54:01] (03PS1) 10Marostegui: es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922456 [06:54:28] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:922387|Revert "db-production: Disable es4 writes"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [06:54:29] (03CR) 10Marostegui: [C: 03+2] es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922456 (owner: 10Marostegui) [06:54:32] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 76 probes of 783 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:59:48] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 7 probes of 783 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:00:00] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:922387|Revert "db-production: Disable es4 writes"]] (duration: 06m 58s) [07:00:05] Amir1, Urbanecm, and taavi: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:26] * kart_ is here [07:01:54] (03PS1) 10Marostegui: Revert "es1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/922388 [07:02:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921049 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [07:02:27] (03PS3) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921049 (https://phabricator.wikimedia.org/T327868) [07:02:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:03:34] (03CR) 10TrainBranchBot: "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921049 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [07:04:18] (03Merged) 10jenkins-bot: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921049 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [07:04:46] !log kartik@deploy1002 Started scap: Backport for [[gerrit:921049|Enable the new Special:Contribute page entry point for desktop on selected wikis (T327868)]] [07:04:50] T327868: Enable the new Special:Contribute page entry point for desktop on selected wikis - https://phabricator.wikimedia.org/T327868 [07:05:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48474 and previous config saved to /var/cache/conftool/dbconfig/20230523-070547-root.json [07:06:22] !log kartik@deploy1002 kartik: Backport for [[gerrit:921049|Enable the new Special:Contribute page entry point for desktop on selected wikis (T327868)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [07:06:50] (03CR) 10Marostegui: [C: 03+2] Revert "es1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/922388 (owner: 10Marostegui) [07:07:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48475 and previous config saved to /var/cache/conftool/dbconfig/20230523-070713-root.json [07:07:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:11:53] (03PS1) 10Marostegui: wmnet: Update es5-master cname [dns] - 10https://gerrit.wikimedia.org/r/922457 (https://phabricator.wikimedia.org/T337285) [07:12:31] (03PS1) 10Marostegui: mariadb: Promote es1023 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/922458 (https://phabricator.wikimedia.org/T337285) [07:13:52] (03PS1) 10Marostegui: db-production.php: Disable writes in es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922459 (https://phabricator.wikimedia.org/T337285) [07:14:28] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:921049|Enable the new Special:Contribute page entry point for desktop on selected wikis (T327868)]] (duration: 09m 42s) [07:14:33] T327868: Enable the new Special:Contribute page entry point for desktop on selected wikis - https://phabricator.wikimedia.org/T327868 [07:14:38] kart_: you done? [07:15:46] marostegui: Yes. [07:16:02] scap just finished and I verified the patch. [07:16:30] awesome thanks [07:16:35] (03CR) 10Marostegui: [C: 03+2] db-production.php: Disable writes in es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922459 (https://phabricator.wikimedia.org/T337285) (owner: 10Marostegui) [07:17:22] (03Merged) 10jenkins-bot: db-production.php: Disable writes in es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922459 (https://phabricator.wikimedia.org/T337285) (owner: 10Marostegui) [07:17:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T337285 [07:17:52] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:922459|db-production.php: Disable writes in es5 (T337285)]] [07:17:53] T337285: Switchover es5 master (es1024 -> es1023) - https://phabricator.wikimedia.org/T337285 [07:18:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T337285 [07:19:07] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es1023 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/922458 (https://phabricator.wikimedia.org/T337285) (owner: 10Marostegui) [07:19:19] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:922459|db-production.php: Disable writes in es5 (T337285)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:22:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48476 and previous config saved to /var/cache/conftool/dbconfig/20230523-072218-root.json [07:22:26] (03CR) 10Filippo Giunchedi: [C: 03+1] x509-bundle: skip popping first if we have an empty list [puppet] - 10https://gerrit.wikimedia.org/r/922147 (https://phabricator.wikimedia.org/T283001) (owner: 10Jbond) [07:25:09] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:922459|db-production.php: Disable writes in es5 (T337285)]] (duration: 07m 16s) [07:25:14] T337285: Switchover es5 master (es1024 -> es1023) - https://phabricator.wikimedia.org/T337285 [07:26:48] (03CR) 10Filippo Giunchedi: [C: 03+1] Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/921382 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis) [07:27:02] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: use idle_timeout instead of route timeout for long-running requests [puppet] - 10https://gerrit.wikimedia.org/r/922144 (https://phabricator.wikimedia.org/T337251) (owner: 10Filippo Giunchedi) [07:28:36] (03PS1) 10Filippo Giunchedi: thanos: fixup upstream_response_timeout to be a float [puppet] - 10https://gerrit.wikimedia.org/r/922461 [07:29:18] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: fixup upstream_response_timeout to be a float [puppet] - 10https://gerrit.wikimedia.org/r/922461 (owner: 10Filippo Giunchedi) [07:31:55] (03CR) 10Stevemunene: [C: 03+2] Create the jupyter notebook config folder [puppet] - 10https://gerrit.wikimedia.org/r/921885 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [07:32:24] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Create stat user home directory [puppet] - 10https://gerrit.wikimedia.org/r/922115 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [07:32:49] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Set mariadb-client to pull the right version [puppet] - 10https://gerrit.wikimedia.org/r/922158 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [07:33:15] (03CR) 10Stevemunene: [C: 03+2] Grant stat1009 access to cloud dumps [puppet] - 10https://gerrit.wikimedia.org/r/922091 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [07:35:49] (03CR) 10Filippo Giunchedi: "Kindly review, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/921023 (https://phabricator.wikimedia.org/T309009) (owner: 10Filippo Giunchedi) [07:35:58] (03PS1) 10Marostegui: es1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922462 [07:36:17] !log Starting es5 eqiad failover from es1024 to es1023 T337285 [07:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:22] T337285: Switchover es5 master (es1024 -> es1023) - https://phabricator.wikimedia.org/T337285 [07:36:35] (03CR) 10Marostegui: [C: 03+2] es1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922462 (owner: 10Marostegui) [07:37:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1023 to es5 primary T337285', diff saved to https://phabricator.wikimedia.org/P48477 and previous config saved to /var/cache/conftool/dbconfig/20230523-073710-root.json [07:37:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48478 and previous config saved to /var/cache/conftool/dbconfig/20230523-073722-root.json [07:37:39] (03CR) 10Marostegui: [C: 03+2] wmnet: Update es5-master cname [dns] - 10https://gerrit.wikimedia.org/r/922457 (https://phabricator.wikimedia.org/T337285) (owner: 10Marostegui) [07:37:59] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes in es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922389 [07:38:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1024 T337285', diff saved to https://phabricator.wikimedia.org/P48479 and previous config saved to /var/cache/conftool/dbconfig/20230523-073841-root.json [07:39:13] (03CR) 10Marostegui: [C: 03+2] Revert "db-production.php: Disable writes in es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922389 (owner: 10Marostegui) [07:39:54] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:922389|Revert "db-production.php: Disable writes in es5"]] [07:41:24] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:922389|Revert "db-production.php: Disable writes in es5"]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:41:53] (03PS1) 10Marostegui: Revert "es1024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/922390 [07:42:58] (03CR) 10Hashar: [C: 03+2] wm-zuul-status: offer to reload on CI completion [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/920708 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [07:43:30] (03Merged) 10jenkins-bot: wm-zuul-status: offer to reload on CI completion [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/920708 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [07:44:02] !log hashar@deploy1002 Started deploy [gerrit/gerrit@e815301]: wm-zuul-status: offer to reload on CI completion | T214068 [07:44:07] T214068: Display Zuul status of jobs for a change on Gerrit UI - https://phabricator.wikimedia.org/T214068 [07:44:09] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@e815301]: wm-zuul-status: offer to reload on CI completion | T214068 (duration: 00m 07s) [07:47:14] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:922389|Revert "db-production.php: Disable writes in es5"]] (duration: 07m 19s) [07:47:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:48:32] (03CR) 10Zabe: "Hey, could you explain why you would like to revert the other change? :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921565 (owner: 10Naif212) [07:51:39] !log hashar@deploy1002 Started deploy [gerrit/gerrit@d151775]: wm-zuul-status: offer to reload on CI completion | T214068 [07:51:43] T214068: Display Zuul status of jobs for a change on Gerrit UI - https://phabricator.wikimedia.org/T214068 [07:51:46] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@d151775]: wm-zuul-status: offer to reload on CI completion | T214068 (duration: 00m 07s) [07:52:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48480 and previous config saved to /var/cache/conftool/dbconfig/20230523-075227-root.json [07:52:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:00:08] (03PS1) 10KartikMistry: Special:Contribute: Correct language code for Albanian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922464 (https://phabricator.wikimedia.org/T327868) [08:01:50] marostegui: Can I deploy quick fix (typo) if we still have time to do that for backport? [08:02:22] kart_: yeah, all done from my side [08:02:25] (03CR) 10Marostegui: [C: 03+2] Revert "es1024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/922390 (owner: 10Marostegui) [08:03:16] Thanks [08:04:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922464 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [08:05:00] (03Merged) 10jenkins-bot: Special:Contribute: Correct language code for Albanian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922464 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [08:05:29] !log kartik@deploy1002 Started scap: Backport for [[gerrit:922464|Special:Contribute: Correct language code for Albanian (T327868)]] [08:05:34] T327868: Enable the new Special:Contribute page entry point for desktop on selected wikis - https://phabricator.wikimedia.org/T327868 [08:07:01] !log kartik@deploy1002 kartik: Backport for [[gerrit:922464|Special:Contribute: Correct language code for Albanian (T327868)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:07:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48481 and previous config saved to /var/cache/conftool/dbconfig/20230523-080732-root.json [08:11:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48482 and previous config saved to /var/cache/conftool/dbconfig/20230523-081148-root.json [08:12:39] (03PS1) 10Marostegui: instances.yaml: Remove db1119 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/922470 (https://phabricator.wikimedia.org/T337206) [08:13:05] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1119 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/922470 (https://phabricator.wikimedia.org/T337206) (owner: 10Marostegui) [08:13:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1119 from dbctl T337206', diff saved to https://phabricator.wikimedia.org/P48483 and previous config saved to /var/cache/conftool/dbconfig/20230523-081342-marostegui.json [08:13:47] T337206: decommission db1119.eqiad.wmnet - https://phabricator.wikimedia.org/T337206 [08:14:07] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:922464|Special:Contribute: Correct language code for Albanian (T327868)]] (duration: 08m 37s) [08:14:11] T327868: Enable the new Special:Contribute page entry point for desktop on selected wikis - https://phabricator.wikimedia.org/T327868 [08:14:27] (03CR) 10Btullis: [C: 03+1] "Thank you and apologies for the delay in reviewing. Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/921023 (https://phabricator.wikimedia.org/T309009) (owner: 10Filippo Giunchedi) [08:15:26] PROBLEM - Check systemd state on mw1454 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:49] (03CR) 10Filippo Giunchedi: [C: 03+2] "No worries, thank you Ben!" [puppet] - 10https://gerrit.wikimedia.org/r/921023 (https://phabricator.wikimedia.org/T309009) (owner: 10Filippo Giunchedi) [08:20:30] (03CR) 10Btullis: [C: 03+1] Bounce keyholder-proxy when keyholder-auth.d group -> key mapping changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779897 (owner: 10Ottomata) [08:22:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48484 and previous config saved to /var/cache/conftool/dbconfig/20230523-082237-root.json [08:26:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48485 and previous config saved to /var/cache/conftool/dbconfig/20230523-082653-root.json [08:27:10] (03PS1) 10Marostegui: mariadb: Decommission db1122 [puppet] - 10https://gerrit.wikimedia.org/r/922471 (https://phabricator.wikimedia.org/T336833) [08:27:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1122.eqiad.wmnet [08:29:40] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) [08:32:33] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [08:35:05] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1122.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [08:35:35] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1122 [puppet] - 10https://gerrit.wikimedia.org/r/922471 (https://phabricator.wikimedia.org/T336833) (owner: 10Marostegui) [08:36:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1122.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [08:36:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:36:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1122.eqiad.wmnet [08:37:22] 10SRE, 10Infrastructure-Foundations, 10Keyholder: Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10jbond) [08:37:29] 10ops-eqiad, 10decommission-hardware: decommission db1122.eqiad.wmnet - https://phabricator.wikimedia.org/T336833 (10Marostegui) [08:37:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48486 and previous config saved to /var/cache/conftool/dbconfig/20230523-083741-root.json [08:39:13] (03CR) 10Jaime Nuche: "This is the same as my original patch minus the changes introduced by Eoghan: https://gerrit.wikimedia.org/r/c/operations/puppet/+/920669" [puppet] - 10https://gerrit.wikimedia.org/r/921429 (owner: 10EoghanGaffney) [08:40:03] (NodeTextfileStale) firing: (3) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:41:28] (03PS1) 10Hashar: wm-zuul-status: show reload immediately [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922472 (https://phabricator.wikimedia.org/T214068) [08:41:46] (03CR) 10Hashar: [C: 03+2] wm-zuul-status: show reload immediately [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922472 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [08:41:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48487 and previous config saved to /var/cache/conftool/dbconfig/20230523-084157-root.json [08:42:18] (03Merged) 10jenkins-bot: wm-zuul-status: show reload immediately [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922472 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [08:44:08] !log hashar@deploy1002 Started deploy [gerrit/gerrit@69bc27c]: wm-zuul-status: show reload immediately | T214068 [08:44:13] T214068: Display Zuul status of jobs for a change on Gerrit UI - https://phabricator.wikimedia.org/T214068 [08:44:15] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@69bc27c]: wm-zuul-status: show reload immediately | T214068 (duration: 00m 07s) [08:46:56] those gerrit deploys are for a JavaScript UI plugin and should not affect the course of normal operations :] [08:51:54] (03CR) 10David Caro: "Some comments for the irl chat" [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [08:52:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48488 and previous config saved to /var/cache/conftool/dbconfig/20230523-085246-root.json [08:57:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48489 and previous config saved to /var/cache/conftool/dbconfig/20230523-085702-root.json [09:03:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:09:00] (03PS1) 10Jelto: miscweb: disableDefaultHosts in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922473 (https://phabricator.wikimedia.org/T337041) [09:09:46] (03PS2) 10Jelto: miscweb: disableDefaultHosts in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922473 (https://phabricator.wikimedia.org/T337041) [09:10:19] (03PS1) 10JMeybohm: Install helm-state-metrics by default on all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/922474 (https://phabricator.wikimedia.org/T334647) [09:12:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48490 and previous config saved to /var/cache/conftool/dbconfig/20230523-091207-root.json [09:17:16] (03CR) 10Hashar: "Thank you for the deployment and sorry I missed some commands/paths that required updates!" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [09:18:32] (03CR) 10Btullis: [C: 03+2] Use the spark3 shuffle jars to yarn on a test host [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [09:24:06] (03CR) 10Hashar: gerrit: remove duplicate $gerrit_site definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [09:25:07] (03PS7) 10EoghanGaffney: Move doc-gitlab rsync endpoint to doc1002 (primary) [puppet] - 10https://gerrit.wikimedia.org/r/921429 [09:25:26] (03CR) 10EoghanGaffney: Move doc-gitlab rsync endpoint to doc1002 (primary) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921429 (owner: 10EoghanGaffney) [09:25:58] (03CR) 10Hashar: "I think the use case for `profile::gerrit::migration` is to apply it to a host that is being provisioned and does not have Gerrit yet henc" [puppet] - 10https://gerrit.wikimedia.org/r/922107 (owner: 10Jbond) [09:26:49] (03CR) 10Jaime Nuche: [C: 03+1] Move doc-gitlab rsync endpoint to doc1002 (primary) [puppet] - 10https://gerrit.wikimedia.org/r/921429 (owner: 10EoghanGaffney) [09:27:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48491 and previous config saved to /var/cache/conftool/dbconfig/20230523-092711-root.json [09:27:33] (03CR) 10EoghanGaffney: [C: 03+2] Move doc-gitlab rsync endpoint to doc1002 (primary) [puppet] - 10https://gerrit.wikimedia.org/r/921429 (owner: 10EoghanGaffney) [09:27:52] (03CR) 10Hashar: Use same php version for doc and integration websites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [09:27:55] (03Abandoned) 10Jaime Nuche: doc: add password-protected rsync module for publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [09:28:14] (03PS1) 10Marostegui: db_inventory: Change gtid_domain_id to 0 [puppet] - 10https://gerrit.wikimedia.org/r/922476 (https://phabricator.wikimedia.org/T336228) [09:33:15] (03PS1) 10Filippo Giunchedi: grafana: remove varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/922477 (https://phabricator.wikimedia.org/T288196) [09:33:18] (03CR) 10JMeybohm: [C: 03+1] miscweb: disableDefaultHosts in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922473 (https://phabricator.wikimedia.org/T337041) (owner: 10Jelto) [09:33:48] (03CR) 10Marostegui: [C: 03+2] db_inventory: Change gtid_domain_id to 0 [puppet] - 10https://gerrit.wikimedia.org/r/922476 (https://phabricator.wikimedia.org/T336228) (owner: 10Marostegui) [09:34:12] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-coord1001.eqiad.wmnet [09:34:41] (03PS2) 10Effie Mouzeli: admin_ng: Add iPoid namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/921704 (https://phabricator.wikimedia.org/T336163) [09:41:16] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-coord1001.eqiad.wmnet [09:42:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thank you" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922474 (https://phabricator.wikimedia.org/T334647) (owner: 10JMeybohm) [09:42:14] !log reboot an-test-worker1003.eqiad.wmnet December 2022 Buster reboots T325132 [09:42:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48492 and previous config saved to /var/cache/conftool/dbconfig/20230523-094216-root.json [09:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:41] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-worker1003.eqiad.wmnet [09:49:28] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-worker1003.eqiad.wmnet [09:50:09] !log reboot an-test-master1002.eqiad.wmnet December 2022 Buster reboots T325132 [09:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:18] (03PS1) 10EoghanGaffney: Remove temporary firewall rule for doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/922479 [09:51:59] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-master1002.eqiad.wmnet [09:53:06] (03CR) 10Jelto: [C: 03+2] miscweb: disableDefaultHosts in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922473 (https://phabricator.wikimedia.org/T337041) (owner: 10Jelto) [09:53:55] (03Merged) 10jenkins-bot: miscweb: disableDefaultHosts in ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922473 (https://phabricator.wikimedia.org/T337041) (owner: 10Jelto) [09:54:26] (03CR) 10Filippo Giunchedi: "Apologies for the late reply! Thank you for the feedback" [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma) [09:54:28] 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on backup2010 - https://phabricator.wikimedia.org/T337174 (10jcrespo) I will run an io load test and see if we get some errors or strange hw logs. [09:54:33] (03PS12) 10Filippo Giunchedi: sre: Port swift alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma) [09:55:24] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [09:55:32] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [09:56:28] (03CR) 10CI reject: [V: 04-1] sre: Port swift alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma) [09:56:43] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:56:46] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:57:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48493 and previous config saved to /var/cache/conftool/dbconfig/20230523-095720-root.json [09:57:31] (03PS13) 10Filippo Giunchedi: sre: Port swift alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma) [09:57:49] (03PS1) 10DCausse: ttm: use new config option to separate readable and writable services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922481 (https://phabricator.wikimedia.org/T322284) [09:59:13] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-master1002.eqiad.wmnet [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T1000) [10:02:00] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-master1001.eqiad.wmnet [10:02:51] (03CR) 10Zabe: [C: 03+2] "I'm gonna merge this now as is. It works well enough for me (in my perception this command does not need to cover every edge case). Fixes " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) (owner: 10Zabe) [10:03:36] (03Merged) 10jenkins-bot: manage-dblist: Add init command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) (owner: 10Zabe) [10:05:07] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2009.codfw.wmnet [10:06:56] !log, reboot rdb2009 for kernel upgrades. ORES in codfw will have a 5m downtime. Other things that might be impacted (but won't): changeprop/cpjobqueue/api-gateway/docker-registry/filebackend.php [10:07:06] !log reboot rdb2009 for kernel upgrades. ORES in codfw will have a 5m downtime. Other things that might be impacted (but won't): changeprop/cpjobqueue/api-gateway/docker-registry/filebackend.php [10:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:57] (03Abandoned) 10Lucas Werkmeister: Perform rolling restarts on kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) (owner: 10Lucas Werkmeister) [10:08:07] (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:08:18] !incidents [10:08:18] 3671 (UNACKED) ProbeDown sre (10.2.1.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 codfw) [10:08:22] !ack 3671 [10:08:23] 3671 (ACKED) ProbeDown sre (10.2.1.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 codfw) [10:08:29] akosiaris: is that expected? [10:08:37] ah, rdb reboot [10:08:48] hah, thanks for the quick action akosiaris [10:08:51] and jayme [10:08:53] I wasn't expecting it fwiw [10:09:03] it didn't page during the last reboot of those hosts [10:09:07] (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:09:20] probably depends on luck? [10:09:27] lol [10:10:03] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-master1001.eqiad.wmnet [10:10:30] not unreasonable! if the unavailability window is small enough [10:10:39] jayme: akosiaris missed the ritual dance before rebooting the server [10:10:54] he didn't please the on-call gods [10:11:01] hence the page [10:11:09] (03PS1) 10Btullis: Fix quoting in the spark3_yarn_shuffle_jar_install.sh script [puppet] - 10https://gerrit.wikimedia.org/r/922483 (https://phabricator.wikimedia.org/T332765) [10:12:39] lol [10:13:07] (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:13:47] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2009.codfw.wmnet [10:14:07] (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:16:30] (03CR) 10Btullis: [C: 03+2] Fix quoting in the spark3_yarn_shuffle_jar_install.sh script [puppet] - 10https://gerrit.wikimedia.org/r/922483 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [10:21:17] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1011.eqiad.wmnet [10:21:32] !log reboot rdb1011 for kernel upgrades. ORES in codfw will have a 5m downtime. Other things that might be impacted (but won't): changeprop/cpjobqueue/api-gateway/docker-registry/filebackend.php [10:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:43] let's see, this time around I submitted a silence first [10:22:10] (03PS1) 10Btullis: Re-enable an-test-worker1001 in the analytics_test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/922484 (https://phabricator.wikimedia.org/T332765) [10:23:21] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - ores_443: Servers ores1004.eqiad.wmnet, ores1007.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:23:43] hello ores [10:23:58] I know you don't like what Alex did [10:24:37] :-) [10:25:04] elukey: when do we also say "goodbye ORES, for good!" ? [10:25:30] from what I figured, ores-legacy is ready to replace the original one? [10:26:35] also, since you are around: I will also use the sre.kafka.roll-restart-brokers cookbook to do kafka-main* hosts, [10:26:44] scream if you want me to abort :-) [10:26:51] akosiaris: sort of, Ilias and me are working on it, but we are close. Sadly, due to some constraints (I can give you more details in pvt) we'll probably need to add support for the redis cache and keep it for a bit longer, but in theory by September we should kill the ores* worker nodes (fingers crossed) [10:27:11] akosiaris: nono it should be fine! [10:27:49] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41262/console" [puppet] - 10https://gerrit.wikimedia.org/r/922484 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [10:28:07] the metrics in grafana are good for both kafkas, the cookbook should work [10:28:18] (03CR) 10Btullis: [V: 03+1 C: 03+2] Re-enable an-test-worker1001 in the analytics_test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/922484 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [10:28:45] awesome, thanks! [10:29:32] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1011.eqiad.wmnet [10:31:02] (03PS1) 10Ayounsi: Add Python 3.11 support [software/homer] - 10https://gerrit.wikimedia.org/r/922485 [10:32:20] (03PS1) 10EoghanGaffney: Switch doc host from doc1002 to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/922487 (https://phabricator.wikimedia.org/T319477) [10:32:45] (03CR) 10CI reject: [V: 04-1] Add Python 3.11 support [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi) [10:33:23] (03CR) 10EoghanGaffney: "This should not be merged until we're ready to do the switchover, see the linked task for conversation." [puppet] - 10https://gerrit.wikimedia.org/r/922487 (https://phabricator.wikimedia.org/T319477) (owner: 10EoghanGaffney) [10:33:35] (03PS1) 10Ayounsi: Add Python 3.11 support [cookbooks] - 10https://gerrit.wikimedia.org/r/922488 [10:35:33] (03PS1) 10Ayounsi: Add Python 3.11 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/922489 [10:36:18] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41263/console" [puppet] - 10https://gerrit.wikimedia.org/r/922487 (https://phabricator.wikimedia.org/T319477) (owner: 10EoghanGaffney) [10:39:32] (03CR) 10CI reject: [V: 04-1] Add Python 3.11 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/922489 (owner: 10Ayounsi) [10:40:09] !log akosiaris@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons. [10:40:25] (03PS1) 10Zabe: Start reading from rev_comment_id in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922492 (https://phabricator.wikimedia.org/T299954) [10:41:06] (03CR) 10Zabe: [C: 04-2] "we need to wait for rev_comment_id being fully populated in s8 (that will be the case soon)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922492 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [10:41:08] (03CR) 10CI reject: [V: 04-1] Start reading from rev_comment_id in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922492 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [10:42:50] (03PS2) 10Zabe: Start reading from rev_comment_id in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922492 (https://phabricator.wikimedia.org/T299954) [10:43:28] (03PS3) 10Zabe: Start reading from rev_comment_id in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922492 (https://phabricator.wikimedia.org/T299954) [10:47:09] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:47:27] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:48:52] (03CR) 10Hnowlan: [C: 03+1] helmfile.d: add Lift Wing's revert risk model server to api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/922073 (https://phabricator.wikimedia.org/T333124) (owner: 10Elukey) [10:49:20] (03PS1) 10EoghanGaffney: Move doc.discovery.wmnet to new bullseye hosts [dns] - 10https://gerrit.wikimedia.org/r/922493 (https://phabricator.wikimedia.org/T319477) [10:50:19] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:52:53] (03CR) 10Jaime Nuche: [C: 03+1] Remove temporary firewall rule for doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/922479 (owner: 10EoghanGaffney) [10:54:19] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:54:45] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.204 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:55:41] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:57:27] 10SRE, 10IRCecho: Restarting systemd-journald breaks ircecho service - https://phabricator.wikimedia.org/T216607 (10jbond) 05Open→03Resolved a:03jbond as far as i can tell this is no longer an issue ` lang=shell $ systemctl status ircd.service | grep Active... [10:57:51] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:58:09] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs20 [10:58:09] .wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:00:53] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:02:23] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:02:55] (03PS1) 10Btullis: Deploy the spark3 yarn shuffler to the hadoop test workers [puppet] - 10https://gerrit.wikimedia.org/r/922494 (https://phabricator.wikimedia.org/T332765) [11:03:49] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.417 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:04:01] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:05:25] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.204 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:05:39] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:05:58] 10SRE, 10SRE-Unowned: tcpircbot-logmsgbot was not able to deliver messages - https://phabricator.wikimedia.org/T284123 (10jbond) [11:06:40] (03CR) 10Cathal Mooney: [C: 03+1] Introduce mgmt_junos variable [homer/public] - 10https://gerrit.wikimedia.org/r/922161 (https://phabricator.wikimedia.org/T327862) (owner: 10Ayounsi) [11:07:07] (03PS4) 10Cathal Mooney: Disable IPv6 RA generation on spine layer switches [homer/public] - 10https://gerrit.wikimedia.org/r/921400 (https://phabricator.wikimedia.org/T337057) [11:08:07] (03CR) 10Cathal Mooney: [C: 03+2] Disable IPv6 RA generation on spine layer switches [homer/public] - 10https://gerrit.wikimedia.org/r/921400 (https://phabricator.wikimedia.org/T337057) (owner: 10Cathal Mooney) [11:08:37] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:08:37] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:08:42] (03Merged) 10jenkins-bot: Disable IPv6 RA generation on spine layer switches [homer/public] - 10https://gerrit.wikimedia.org/r/921400 (https://phabricator.wikimedia.org/T337057) (owner: 10Cathal Mooney) [11:10:01] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:10:15] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:11:41] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.204 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:16:23] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs20 [11:16:23] .wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:17:02] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41264/console" [puppet] - 10https://gerrit.wikimedia.org/r/922494 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [11:19:03] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:19:17] 10SRE, 10Platform Engineering (Icebox): New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10jbond) >>! In T178839#5310232, @Eevans wrote: >>>! In T178839#5309937, @WDoranWMF wrote: >> @Eevans Do want to move this along or has it stalled? > > I'm not sure how to parse this. I would love to t... [11:19:37] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:19:37] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:19:53] (03PS2) 10Btullis: Deploy the spark3 yarn shuffler to the hadoop test workers [puppet] - 10https://gerrit.wikimedia.org/r/922494 (https://phabricator.wikimedia.org/T332765) [11:21:06] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41265/console" [puppet] - 10https://gerrit.wikimedia.org/r/922494 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [11:22:33] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.201 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:22:35] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.419 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:22:37] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:22:53] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimed [11:22:53] iki/PyBal [11:23:23] (03CR) 10Btullis: [V: 03+1 C: 03+2] Deploy the spark3 yarn shuffler to the hadoop test workers [puppet] - 10https://gerrit.wikimedia.org/r/922494 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [11:23:31] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 1.345 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:24:17] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:25:41] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:27:03] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.220 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:27:17] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:28:43] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.530 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:28:51] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:30:15] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled ht [11:30:15] kitech.wikimedia.org/wiki/PyBal [11:30:17] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.193 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:30:33] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled ht [11:30:33] kitech.wikimedia.org/wiki/PyBal [11:37:55] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:38:00] inflatador: dcausse: do you happen to know what's wrong with wdqs? [11:38:09] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:41:17] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:41:57] jayme: dcausse was saying on #-search that he was away from a computer [11:42:20] Sorry gehel was saying that, dcausse is at lunch [11:42:41] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 2.778 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:43:41] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:44:14] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) [11:44:20] 10SRE, 10Infrastructure-Foundations, 10netops: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) 05Open→03Resolved Merged patch based on option 5, but using hostname rather than any other var to determine device class.... [11:44:53] (03PS1) 10Ottomata: Allow access to conf cluster zookeeper from wikikube and dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/922497 (https://phabricator.wikimedia.org/T331283) [11:45:03] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.205 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:45:49] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:47:22] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41266/console" [puppet] - 10https://gerrit.wikimedia.org/r/922497 (https://phabricator.wikimedia.org/T331283) (owner: 10Ottomata) [11:47:55] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Allow access to conf cluster zookeeper from wikikube and dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/922497 (https://phabricator.wikimedia.org/T331283) (owner: 10Ottomata) [11:48:35] (03CR) 10Cathal Mooney: [C: 03+1] "Hey thanks for the info Fillipo appreciate the answers. Ping me if I can help with anything otherwise LGTM." [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma) [11:50:27] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:50:51] (03CR) 10Ottomata: "Okay, I'm inclined to merge this. Or, do you think we should wait?" [puppet] - 10https://gerrit.wikimedia.org/r/779897 (owner: 10Ottomata) [11:51:11] o/ looking into wdqs issue [11:51:19] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:51:41] dcausse: cool, thanks. It seems that 2009 2010 and 2011 have issues 202* seem fine [11:52:06] lmk if I can help with something [11:52:39] dcausse: thanks ! I'll be around in 15-20' [11:52:43] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.235 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:53:31] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 9.164 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:54:01] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [11:54:03] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [11:54:17] (03CR) 10AikoChou: [C: 03+1] "Q - is this config adding both revertrisk-language-agnostic and revertrisk-multilingual model to api-gateway? or only the multilingual mod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922073 (https://phabricator.wikimedia.org/T333124) (owner: 10Elukey) [11:54:53] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:54:55] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 3.748 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:54:55] wdqs issue: at a glance it's someone with a bad query hammering the service in codfw [11:55:16] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [11:55:43] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [11:56:08] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [11:56:26] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [11:56:27] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 9.477 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:57:31] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:58:13] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled ht [11:58:13] kitech.wikimedia.org/wiki/PyBal [11:59:01] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 5.435 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:00:58] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41267/console" [puppet] - 10https://gerrit.wikimedia.org/r/921348 (owner: 10Filippo Giunchedi) [12:01:19] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:01:19] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:01:54] (03CR) 10MVernon: [C: 03+1] "looks reasonable to me, thank you" [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma) [12:04:19] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:04:19] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.190 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:04:38] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: Port swift alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma) [12:05:25] 10SRE-Unowned, 10noc.wikimedia.org: Investigate using php-fpm for noc - https://phabricator.wikimedia.org/T337302 (10jbond) [12:05:38] 10SRE-Unowned, 10noc.wikimedia.org: Investigate using php-fpm for noc - https://phabricator.wikimedia.org/T337302 (10jbond) p:05Triage→03Low [12:05:43] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled ht [12:05:43] kitech.wikimedia.org/wiki/PyBal [12:07:50] (03PS1) 10Filippo Giunchedi: icinga: remove swift alerts, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/922499 (https://phabricator.wikimedia.org/T288196) [12:10:23] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:10:43] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:11:04] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: remove swift alerts, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/922499 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [12:11:09] (03PS2) 10Filippo Giunchedi: icinga: remove swift alerts, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/922499 (https://phabricator.wikimedia.org/T288196) [12:11:56] (03CR) 10Filippo Giunchedi: [V: 03+2] icinga: remove swift alerts, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/922499 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [12:12:17] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:13:14] 10SRE-swift-storage, 10Observability-Alerting, 10Patch-For-Review: Port swift prometheus-based alerts from icinga to alertmanager - https://phabricator.wikimedia.org/T312765 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done [12:13:43] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.195 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:13:43] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.214 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:15:15] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:16:06] (03CR) 10Filippo Giunchedi: [C: 03+2] "Merging, there's 6+ months of data now" [puppet] - 10https://gerrit.wikimedia.org/r/861826 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [12:17:42] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: add 'ensure' for ::server [puppet] - 10https://gerrit.wikimedia.org/r/921348 (owner: 10Filippo Giunchedi) [12:17:47] (03PS2) 10Filippo Giunchedi: prometheus: add 'ensure' for ::server [puppet] - 10https://gerrit.wikimedia.org/r/921348 [12:18:35] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:19:45] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:19:49] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.195 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:21:27] 10SRE, 10User-Marostegui, 10User-fgiunchedi: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10jbond) [12:21:49] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:21:52] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:23:03] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:23:41] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-Marostegui, 10User-fgiunchedi: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10jbond) i Is there documentation on what should and should not be put in misc? Also should we add some type of alert for this. it seems like... [12:26:09] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:26:56] 10SRE, 10IRCecho, 10Wikimedia-IRC-RC-Server: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875 (10jbond) 05Open→03Resolved a:03jbond Setting this to resolved, please reopen if there are still actions [12:28:39] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:29:06] 10SRE: should we make privatewiki list available to puppet without maintaining two lists? - https://phabricator.wikimedia.org/T152100 (10jbond) s anyone able to comment weather this task is still valid and if anymore context is needed. further what is the best team to route this too if it is still valid. thanks [12:29:07] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:29:26] (03PS1) 10Jelto: service: add host to miscweb http probe [puppet] - 10https://gerrit.wikimedia.org/r/922500 (https://phabricator.wikimedia.org/T300171) [12:31:37] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.219 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:31:58] (03PS2) 10EoghanGaffney: Remove temporary firewall rule for doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/922479 [12:32:00] (03PS2) 10EoghanGaffney: Switch doc host from doc1002 to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/922487 (https://phabricator.wikimedia.org/T319477) [12:32:02] (03PS1) 10EoghanGaffney: Create entry for new releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/922501 (https://phabricator.wikimedia.org/T334435) [12:32:25] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:33:37] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.193 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:35:19] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 2.137 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:36:20] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-Marostegui, 10User-fgiunchedi: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10fgiunchedi) I don't think there's any documentation per-se. However looking at this task with fresh eyes I'd say (short of rethinking the who... [12:36:51] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:37:01] (03CR) 10Elukey: helmfile.d: add Lift Wing's revert risk model server to api-gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922073 (https://phabricator.wikimedia.org/T333124) (owner: 10Elukey) [12:38:09] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs20 [12:38:09] .wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:38:11] (03Abandoned) 10Elukey: role::syslog::centralserver: tune benthos config [puppet] - 10https://gerrit.wikimedia.org/r/919268 (https://phabricator.wikimedia.org/T331801) (owner: 10Elukey) [12:39:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:39:57] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:39:57] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled ht [12:39:57] kitech.wikimedia.org/wiki/PyBal [12:40:03] (NodeTextfileStale) firing: (3) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:40:07] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:40:32] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-Marostegui, 10User-fgiunchedi: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10jbond) [12:40:36] (03PS23) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [12:41:00] (03CR) 10JMeybohm: [C: 03+1] service: add host to miscweb http probe [puppet] - 10https://gerrit.wikimedia.org/r/922500 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [12:41:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.285 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:41:21] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49992 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:42:35] (03CR) 10Jelto: [C: 03+2] service: add host to miscweb http probe [puppet] - 10https://gerrit.wikimedia.org/r/922500 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [12:42:39] 10SRE, 10SRE Observability, 10Patch-For-Review, 10User-fgiunchedi: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) 05Open→03Resolved a:03elukey From https://github.com/benthosdev/benthos/issues/1806 it seems... [12:43:09] (03CR) 10JMeybohm: Allow access to conf cluster zookeeper from wikikube and dse-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922497 (https://phabricator.wikimedia.org/T331283) (owner: 10Ottomata) [12:43:15] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:43:29] 10SRE, 10SRE-Unowned, 10User-AKlapper: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10jbond) > You can't add a TXT entry for www when www exists as a CNAME. indeed as per RFC1034 s3.6.2, > If a CNAME RR is present at a node, no other data should b... [12:44:09] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:44:43] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.999 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:46:17] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers thanos-fe1004.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers thanos-fe1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:46:27] (03PS1) 10Ottomata: flink-operator custom basic egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922505 (https://phabricator.wikimedia.org/T331283) [12:46:31] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers thanos-fe1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:47:07] (03PS2) 10Ottomata: flink-operator custom basic egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922505 (https://phabricator.wikimedia.org/T331283) [12:47:09] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 1.985 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:48:18] (03PS39) 10JMeybohm: Make kubernetes_clusters the central place for k8s config [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) [12:48:20] (03PS3) 10JMeybohm: Remove profile::kubernetes::deployment_server from role::releases [puppet] - 10https://gerrit.wikimedia.org/r/912785 (https://phabricator.wikimedia.org/T288629) [12:48:22] (03PS13) 10JMeybohm: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [12:48:24] (03PS2) 10JMeybohm: profile::imagecatalog migrate from user token to client cert [puppet] - 10https://gerrit.wikimedia.org/r/912842 (https://phabricator.wikimedia.org/T325268) [12:48:26] (03PS7) 10JMeybohm: prometheus::k8s: Use kubernetes::clusters_defaults [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) [12:48:28] (03PS7) 10JMeybohm: prometheus::k8s switch staging-codfw to client cert auth [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) [12:49:21] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs20 [12:49:21] .wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:49:21] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:49:31] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.162 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:50:17] (03PS2) 10EoghanGaffney: Create entry for new releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/922501 (https://phabricator.wikimedia.org/T334435) [12:50:51] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 4.701 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:51:31] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Allow access to conf cluster zookeeper from wikikube and dse-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922497 (https://phabricator.wikimedia.org/T331283) (owner: 10Ottomata) [12:52:39] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:53:24] (03CR) 10Herron: [C: 03+1] prometheus: soft-disable 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921347 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [12:54:22] 10SRE, 10SRE-Unowned, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10jbond) [12:54:45] (03CR) 10JMeybohm: Make kubernetes_clusters the central place for k8s config (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:55:37] (03CR) 10JMeybohm: Make kubernetes_clusters the central place for k8s config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:55:39] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:56:09] (03PS1) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [12:56:46] (03PS3) 10Ottomata: flink-operator custom basic egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922505 (https://phabricator.wikimedia.org/T331283) [12:57:40] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41269/console" [puppet] - 10https://gerrit.wikimedia.org/r/912785 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [12:57:48] 10SRE, 10SRE-Unowned, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10jbond) @joanna_borun are you able to raise this in the sre managers meeting to see how best to route the task. [12:58:15] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T1300) [13:00:06] No Gerrit patches in the queue for this window AFAICS. [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T1300) [13:00:15] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled ht [13:00:15] kitech.wikimedia.org/wiki/PyBal [13:01:13] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:01:38] (03PS1) 10Ssingh: depool eqiad (emergency patch, do not merge until required) [dns] - 10https://gerrit.wikimedia.org/r/922508 (https://phabricator.wikimedia.org/T322937) [13:01:46] * TheresNoTime is in a meeting, but sees theres nothing in the window [13:02:22] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Release-Engineering-Team (Radar), 10User-Joe: [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023 (10jbond) [13:03:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:03:08] 10SRE, 10SRE-Unowned: should we make privatewiki list available to puppet without maintaining two lists? - https://phabricator.wikimedia.org/T152100 (10jbond) [13:03:33] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:03:33] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:03:48] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41273/console" [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:03:50] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41272/console" [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:04:57] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.283 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:04:59] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.248 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:05:41] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:46] 10SRE-swift-storage, 10Commons: Renaming file on Commons doesn't work: inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T337231 (10Aklapper) [13:06:09] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:06:25] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:06:38] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-Marostegui, 10User-fgiunchedi: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10jbond) 05Open→03Declined > In other words we might as well decline the task SGTM thanks [13:08:09] (03PS2) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [13:09:09] (03PS1) 10Hoo man: Restore targets declarations temporarily [extensions/Wikibase] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922394 (https://phabricator.wikimedia.org/T336956) [13:09:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41275/console" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [13:09:40] (03PS1) 10Hoo man: Restore targets declarations temporarily [extensions/Wikibase] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922395 (https://phabricator.wikimedia.org/T336956) [13:09:47] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:11:09] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:11:13] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.228 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:11:23] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons. [13:12:20] (03CR) 10AikoChou: [C: 03+1] helmfile.d: add Lift Wing's revert risk model server to api-gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922073 (https://phabricator.wikimedia.org/T333124) (owner: 10Elukey) [13:13:28] Just a heads up, I'm going to "steal" the empty SWAT for two Wikibase backports [13:13:41] Going to also put that on wikitech in a second too [13:13:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hoo@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922394 (https://phabricator.wikimedia.org/T336956) (owner: 10Hoo man) [13:13:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hoo@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922395 (https://phabricator.wikimedia.org/T336956) (owner: 10Hoo man) [13:14:13] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 5.187 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:14:15] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:14:29] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:15:55] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:16:17] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the quick review" [puppet] - 10https://gerrit.wikimedia.org/r/921347 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [13:16:26] (03PS2) 10Filippo Giunchedi: prometheus: soft-disable 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921347 (https://phabricator.wikimedia.org/T288196) [13:17:07] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled ht [13:17:07] kitech.wikimedia.org/wiki/PyBal [13:17:23] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimed [13:17:23] iki/PyBal [13:17:53] ^ too many depools? [13:18:55] (03CR) 10JMeybohm: [C: 04-1] flink-operator custom basic egress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922505 (https://phabricator.wikimedia.org/T331283) (owner: 10Ottomata) [13:19:16] {"wdqs2004.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs-heavy-queries"} [13:19:19] {"wdqs2007.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs-heavy-queries"} [13:19:22] {"wdqs2009.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs-heavy-queries"} [13:19:25] {"wdqs2010.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs-heavy-queries"} [13:19:28] {"wdqs2011.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs-heavy-queries"} [13:19:31] {"wdqs2012.codfw.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs-heavy-queries"} [13:19:34] {"wdqs2021.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs-heavy-queries"} [13:19:35] bblack: yes the service is having problem with a bad client, can't identify it [13:19:37] {"wdqs2022.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs-heavy-queries"} [13:19:40] ^ looks like 2 inactive, one depooled, and I'm guessing a few of the others are failing healthchecks [13:19:47] ok [13:20:31] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are mar [13:20:31] but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:23:20] (03Abandoned) 10Hashar: extdist: switch git URLs from gerrit-replica to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/920761 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [13:23:37] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:26:51] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:26:53] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:29:51] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.188 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:30:35] 10SRE, 10IRCecho: Make ircecho run as its own user - https://phabricator.wikimedia.org/T76203 (10jbond) 05Open→03Resolved a:03jbond closing as ircecho s ran as the irc user ` $ systemctl cat ircecho.service | grep User User=irc ` [13:30:55] (03Merged) 10jenkins-bot: Restore targets declarations temporarily [extensions/Wikibase] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922394 (https://phabricator.wikimedia.org/T336956) (owner: 10Hoo man) [13:31:11] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.261 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:31:21] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.210 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:31:58] (03Merged) 10jenkins-bot: Restore targets declarations temporarily [extensions/Wikibase] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922395 (https://phabricator.wikimedia.org/T336956) (owner: 10Hoo man) [13:32:25] !log hoo@deploy1002 Started scap: Backport for [[gerrit:922394|Restore targets declarations temporarily (T336956)]], [[gerrit:922395|Restore targets declarations temporarily (T336956)]] [13:32:30] T336956: Mobile version of Wikidata broken on some pages - https://phabricator.wikimedia.org/T336956 [13:33:54] !log hoo@deploy1002 hoo: Backport for [[gerrit:922394|Restore targets declarations temporarily (T336956)]], [[gerrit:922395|Restore targets declarations temporarily (T336956)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:34:20] 10SRE, 10SRE-Unowned, 10IRCecho: ircecho should accept input via unix sockets - https://phabricator.wikimedia.org/T95053 (10jbond) [13:34:44] 10SRE, 10SRE-Unowned, 10IRCecho: Move ircecho config file to be YAML - https://phabricator.wikimedia.org/T95054 (10jbond) [13:34:46] (03CR) 10Vgutierrez: [C: 03+1] grafana: remove varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/922477 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [13:36:13] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:36:36] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [13:36:56] 10SRE: Create a simple puppet role for setting up a singlenode kubernetes install - https://phabricator.wikimedia.org/T138799 (10jbond) 05Open→03Resolved a:03jbond going to close this task i think things have move on significantly since the task was created and we now have k8s clusteres and puppet code [13:37:57] (03CR) 10Hokwelum: [C: 03+1] "checks okay :-)" [puppet] - 10https://gerrit.wikimedia.org/r/915455 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [13:38:25] 10SRE, 10LDAP: Update/add/remove LDAP entries based on changes to data.yaml - https://phabricator.wikimedia.org/T142819 (10jbond) [13:39:10] Grr… trying to figure out why it doesn't work and then realizing the Wikimedia Debug extension is no longer active [13:39:15] (03CR) 10ArielGlenn: [C: 03+2] introduce a script to create output dirs on xml dumps test nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915455 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [13:40:00] 10SRE, 10WMF-General-or-Unknown: Production error message (when servers are down) points users to donate link which is likely to produce the same error message - https://phabricator.wikimedia.org/T154627 (10jbond) 05Open→03Resolved a:03jbond im going to close this task as i belive the message has since c... [13:40:03] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:40:15] I wonder if it has a timeout after which it disables itself [13:41:03] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.802 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:41:27] (03PS2) 10ArielGlenn: add a custom xml dumps config file for testing new nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915463 (https://phabricator.wikimedia.org/T325232) [13:41:45] (03CR) 10Elukey: [C: 03+2] helmfile.d: add Lift Wing's revert risk model server to api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/922073 (https://phabricator.wikimedia.org/T333124) (owner: 10Elukey) [13:42:57] (03PS1) 10AikoChou: ml-services: fix revertrisk-wikidata model inference [deployment-charts] - 10https://gerrit.wikimedia.org/r/922510 (https://phabricator.wikimedia.org/T333125) [13:43:17] !log eoghan@cumin1001 START - Cookbook sre.ganeti.makevm for new host releases1003.eqiad.wmnet [13:43:18] !log eoghan@cumin1001 START - Cookbook sre.dns.netbox [13:43:43] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:10] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [13:44:18] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [13:44:57] (03CR) 10CDanis: Set NetworkProbeLimit cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921437 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [13:45:07] 10SRE-Unowned, 10IRCecho: Add flood protection to the ircecho bot (icinga-wm) - https://phabricator.wikimedia.org/T163698 (10jbond) [13:45:13] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:45:14] !log hoo@deploy1002 Finished scap: Backport for [[gerrit:922394|Restore targets declarations temporarily (T336956)]], [[gerrit:922395|Restore targets declarations temporarily (T336956)]] (duration: 12m 49s) [13:45:18] T336956: Mobile version of Wikidata broken on some pages - https://phabricator.wikimedia.org/T336956 [13:45:33] !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM releases1003.eqiad.wmnet - eoghan@cumin1001" [13:46:32] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts kafkamon1002.eqiad.wmnet [13:46:39] !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM releases1003.eqiad.wmnet - eoghan@cumin1001" [13:46:39] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:46:39] !log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache releases1003.eqiad.wmnet on all recursors [13:46:42] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) releases1003.eqiad.wmnet on all recursors [13:47:04] !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM releases1003.eqiad.wmnet - eoghan@cumin1001" [13:47:16] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Proposal: Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10jbond) [13:48:19] (03PS1) 10Volans: sre.SREBatchRunnerBase: simplify overriding action [cookbooks] - 10https://gerrit.wikimedia.org/r/922511 [13:50:00] (03PS1) 10Ilias Sarantopoulos: ORES: add model versions configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) [13:50:07] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 693 bytes in 5.778 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:50:15] !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM releases1003.eqiad.wmnet - eoghan@cumin1001" [13:50:16] !log eoghan@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host releases1003.eqiad.wmnet [13:50:39] !log herron@cumin1001 START - Cookbook sre.dns.netbox [13:52:00] (03PS2) 10Volans: sre.SREBatchRunnerBase: simplify overriding action [cookbooks] - 10https://gerrit.wikimedia.org/r/922511 [13:52:17] (03Abandoned) 10Elukey: utils: fix k8s-ingress-ml-serve discovery config [dns] - 10https://gerrit.wikimedia.org/r/920709 (owner: 10Elukey) [13:52:19] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:52:49] (03CR) 10Elukey: [C: 03+1] Install helm-state-metrics by default on all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/922474 (https://phabricator.wikimedia.org/T334647) (owner: 10JMeybohm) [13:53:01] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:53:15] (03CR) 10Elukey: [C: 03+2] ml-services: fix revertrisk-wikidata model inference [deployment-charts] - 10https://gerrit.wikimedia.org/r/922510 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou) [13:54:51] !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafkamon1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - herron@cumin1001" [13:54:51] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.234 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:54:53] (03PS2) 10Jelto: microsites: remove 15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/761060 (owner: 10Dzahn) [13:54:55] (03PS2) 10Jelto: trafficserver: switch 15.wikipedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/761062 (owner: 10Dzahn) [13:55:55] !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafkamon1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - herron@cumin1001" [13:55:55] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:55:55] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafkamon1002.eqiad.wmnet [13:55:59] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled ht [13:55:59] kitech.wikimedia.org/wiki/PyBal [13:56:50] uh? [13:56:51] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts kafkamon2002.codfw.wmnet [13:57:05] (03CR) 10Jelto: "I rebased the change and updated the replacement to https://miscweb.discovery.wmnet:30443 (similar to static-bugzilla)" [puppet] - 10https://gerrit.wikimedia.org/r/761062 (owner: 10Dzahn) [13:57:42] anybody working on WDQS? [13:57:52] !log eoghan@cumin1001 START - Cookbook sre.ganeti.makevm for new host releases2003.codfw.wmnet [13:57:53] !log eoghan@cumin1001 START - Cookbook sre.dns.netbox [13:58:19] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/761063 (owner: 10Dzahn) [13:58:34] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10phaultfinder) [13:58:47] vgutierrez: I am but no luck so far identifying who's causing this [13:58:47] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:59:44] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922501 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [13:59:44] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled ht [13:59:44] kitech.wikimedia.org/wiki/PyBal [14:00:07] dcausse: that would be query.wikidata.org? [14:00:38] vgutierrez: yes query.wikidata.org but hit from codfw, eqiad is serving traffic correctly [14:00:45] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:00:55] dcausse: so codfw, ulsfo or eqsin [14:00:56] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 4.119 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:00:59] !log herron@cumin1001 START - Cookbook sre.dns.netbox [14:01:14] vgutierrez: wdqs is only codfw&eqiad [14:01:29] !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM releases2003.codfw.wmnet - eoghan@cumin1001" [14:01:53] dcausse: right, so the PoPs on codfw, ulsfo and eqsin target codfw and eqiad/esams/drmrs target eqiad in the service is active/active [14:02:06] (03PS1) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 [14:02:09] ok [14:02:47] !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM releases2003.codfw.wmnet - eoghan@cumin1001" [14:02:47] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:02:47] !log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache releases2003.codfw.wmnet on all recursors [14:02:48] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled ht [14:02:48] kitech.wikimedia.org/wiki/PyBal [14:02:50] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) releases2003.codfw.wmnet on all recursors [14:03:16] !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM releases2003.codfw.wmnet - eoghan@cumin1001" [14:03:19] (03PS1) 10Hashar: contint: Jenkins slave > agent [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) [14:04:14] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001" [14:04:22] !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM releases2003.codfw.wmnet - eoghan@cumin1001" [14:04:22] !log eoghan@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host releases2003.codfw.wmnet [14:04:41] topranks: +irb-1031.ssw1-e1-eqiad entry is appearing in my netbox run (for unrelated host decom) ok to commit that? [14:05:16] herron: this a dns entry? [14:05:16] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001" [14:05:16] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:05:31] (03PS2) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 [14:05:36] Yes should be ok, I just added and I'm running the DNS cookbook at the moment [14:05:42] topranks: yes dns entry [14:05:46] ok thx [14:05:46] actually cookbook just finished, but yes should be ok [14:05:48] !log herron@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:05:49] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts kafkamon2002.codfw.wmnet [14:05:51] thanks [14:05:54] (03CR) 10CI reject: [V: 04-1] P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:56] (03PS3) 10Jelto: trafficserver: switch 15.wikipedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/761062 (https://phabricator.wikimedia.org/T337041) (owner: 10Dzahn) [14:09:10] (03CR) 10Volans: [C: 04-1] "Small nit in the code, LGTM otherwise" [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [14:09:42] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled ht [14:09:42] kitech.wikimedia.org/wiki/PyBal [14:10:06] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled ht [14:10:06] kitech.wikimedia.org/wiki/PyBal [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:32] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:14:17] (03PS3) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 [14:14:34] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:56] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:15:40] (03PS4) 10Ottomata: flink-operator custom basic egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922505 (https://phabricator.wikimedia.org/T331283) [14:15:46] (03CR) 10Ottomata: flink-operator custom basic egress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922505 (https://phabricator.wikimedia.org/T331283) (owner: 10Ottomata) [14:16:28] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:16:30] (03PS1) 10Cathal Mooney: Puppet additions for ssw1-e1-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/922520 (https://phabricator.wikimedia.org/T322937) [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:46] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:19:12] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:20:14] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: remove varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/922477 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [14:20:20] (03PS2) 10Filippo Giunchedi: grafana: remove varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/922477 (https://phabricator.wikimedia.org/T288196) [14:20:47] (03PS5) 10Ottomata: flink-operator custom basic egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922505 (https://phabricator.wikimedia.org/T331283) [14:22:02] (03CR) 10EoghanGaffney: [C: 03+2] Create entry for new releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/922501 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [14:22:11] (03PS24) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [14:22:58] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2012.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2012.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimed [14:22:58] iki/PyBal [14:23:24] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2012.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled ht [14:23:24] kitech.wikimedia.org/wiki/PyBal [14:26:04] PROBLEM - WDQS SPARQL on wdqs2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:28:15] (03CR) 10Dzahn: "ok then. I will do it." [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [14:29:10] RECOVERY - WDQS SPARQL on wdqs2012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.200 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:30:22] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:30:27] (03CR) 10Ayounsi: [C: 03+1] Puppet additions for ssw1-e1-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/922520 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [14:30:29] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:31:13] (03CR) 10Dzahn: "well, it's not supposed to be dependent on the gerrit class though" [puppet] - 10https://gerrit.wikimedia.org/r/922107 (owner: 10Jbond) [14:31:14] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:32:06] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2011.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2011.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:32:40] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2011.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2011.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled ht [14:32:40] kitech.wikimedia.org/wiki/PyBal [14:32:46] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:32:47] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:32:57] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:32:59] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:33:38] (03CR) 10Jaime Nuche: [C: 03+1] "On the CI primary I found one file in /var/lib/jenkins in the jenkins group but not owned by jenkins:" [puppet] - 10https://gerrit.wikimedia.org/r/917919 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [14:33:53] (03PS4) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 [14:34:38] (03CR) 10Cathal Mooney: [C: 03+2] Puppet additions for ssw1-e1-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/922520 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [14:36:17] !log eoghan@cumin1001 START - Cookbook sre.hosts.reimage for host releases1003.eqiad.wmnet with OS bullseye [14:36:52] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:38:48] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 414 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:39:25] (03CR) 10Ssingh: "PCC says "Error: Could not find resource 'Service["pdns-recursor.service", "dnsdist.service"]' in parameter 'require'", which is expected " [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [14:39:54] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.202 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:42:42] (03CR) 10Dzahn: [C: 03+1] "after doing zuul just recently this seems fine to me. I'll handle it." [puppet] - 10https://gerrit.wikimedia.org/r/917919 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [14:43:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:45:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832 (10jbond) [14:45:45] (03PS6) 10Ottomata: flink-operator custom basic egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922505 (https://phabricator.wikimedia.org/T331283) [14:46:14] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1110 - https://phabricator.wikimedia.org/T336932 (10Jclark-ctr) 05Open→03Resolved Replaced failed drives [14:46:20] (03PS7) 10Cathal Mooney: Expose additional link information to Homer templates in wmf-netbox.py [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) [14:46:42] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on releases1003.eqiad.wmnet with reason: host reimage [14:46:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:47:15] (03CR) 10Cathal Mooney: [C: 03+2] Expose additional link information to Homer templates in wmf-netbox.py (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) (owner: 10Cathal Mooney) [14:47:25] (03PS7) 10Ottomata: flink-operator custom basic egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922505 (https://phabricator.wikimedia.org/T331283) [14:48:09] (03CR) 10Dzahn: [C: 03+2] httpbb: move tests for 15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/761063 (owner: 10Dzahn) [14:48:44] (03PS2) 10Dzahn: httpbb: move tests for 15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/761063 (https://phabricator.wikimedia.org/T300171) [14:48:48] 10SRE-swift-storage, 10Wikimedia-Site-requests, 10serviceops: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10jbond) [14:49:52] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases1003.eqiad.wmnet with reason: host reimage [14:50:17] (03CR) 10Dzahn: [V: 03+2] httpbb: move tests for 15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/761063 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [14:51:30] !log removed imagemagick 8:6.9.10.23+dfsg-2.1+deb10u1+wmf1 from apt.wikimedia.org/buster-wikimedia now that the Thumbor spec tests have been upgraded to match latest patches [14:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:39] (03CR) 10Dzahn: "did you have a specific time in mind to do this?" [puppet] - 10https://gerrit.wikimedia.org/r/761062 (https://phabricator.wikimedia.org/T337041) (owner: 10Dzahn) [14:52:57] (03CR) 10JMeybohm: [C: 03+1] flink-operator custom basic egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922505 (https://phabricator.wikimedia.org/T331283) (owner: 10Ottomata) [14:53:23] (03CR) 10Ottomata: [C: 03+2] flink-operator custom basic egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922505 (https://phabricator.wikimedia.org/T331283) (owner: 10Ottomata) [14:55:22] (03PS1) 10Jbond: install_console: restrict options used [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) [14:55:41] (03Merged) 10jenkins-bot: flink-operator custom basic egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/922505 (https://phabricator.wikimedia.org/T331283) (owner: 10Ottomata) [14:56:35] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) [14:57:04] (03PS40) 10JMeybohm: Make kubernetes::clusters the central place for k8s config [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) [14:57:06] (03PS4) 10JMeybohm: Remove profile::kubernetes::deployment_server from role::releases [puppet] - 10https://gerrit.wikimedia.org/r/912785 (https://phabricator.wikimedia.org/T288629) [14:57:08] (03PS14) 10JMeybohm: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [14:57:10] (03PS3) 10JMeybohm: profile::imagecatalog migrate from user token to client cert [puppet] - 10https://gerrit.wikimedia.org/r/912842 (https://phabricator.wikimedia.org/T325268) [14:57:12] (03PS8) 10JMeybohm: prometheus::k8s: Use kubernetes::clusters_defaults [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) [14:57:14] (03PS8) 10JMeybohm: prometheus::k8s switch staging-codfw to client cert auth [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) [14:57:23] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:57:29] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:57:35] (03CR) 10CI reject: [V: 04-1] Make kubernetes::clusters the central place for k8s config [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:58:16] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:58:23] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:58:39] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) [14:59:03] (03PS1) 10Cathal Mooney: Release v0.6.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/922560 (https://phabricator.wikimedia.org/T328313) [15:00:09] !log akosiaris@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka main-eqiad cluster: Reboot kafka nodes [15:00:33] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED [15:00:36] (03CR) 10JMeybohm: Make kubernetes::clusters the central place for k8s config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:02:02] (03CR) 10JMeybohm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:02:18] (03PS4) 10Hashar: jenkins: switch to fixed uid/gid 924 [puppet] - 10https://gerrit.wikimedia.org/r/917919 (https://phabricator.wikimedia.org/T324659) [15:02:42] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host releases1003.eqiad.wmnet with OS bullseye [15:03:07] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) [15:03:14] !log eoghan@cumin1001 START - Cookbook sre.hosts.reimage for host releases2003.codfw.wmnet with OS bullseye [15:03:24] (03CR) 10Hashar: "The tricks are:" [puppet] - 10https://gerrit.wikimedia.org/r/917919 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [15:03:46] (03PS3) 10Hokwelum: add a custom xml dumps config file for testing new nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915463 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [15:09:21] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [15:10:23] (03Merged) 10jenkins-bot: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [15:10:33] (03PS1) 10Giuseppe Lavagetto: Patch helm defaults in helmfile during CI tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/922563 [15:11:12] (03CR) 10JMeybohm: [C: 03+2] Install helm-state-metrics by default on all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/922474 (https://phabricator.wikimedia.org/T334647) (owner: 10JMeybohm) [15:13:15] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 29): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41282/console" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:14:09] (03Merged) 10jenkins-bot: Install helm-state-metrics by default on all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/922474 (https://phabricator.wikimedia.org/T334647) (owner: 10JMeybohm) [15:14:20] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:14:23] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:14:35] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:14] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED [15:16:32] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED [15:17:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Hghani) Hi my public key is: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIE9YrjjXUnDX0d8mk62yYBR6Pcflz/1pw/tkoMTeSrM0 hghani-ctr@wikimedia.org [15:19:40] (03PS1) 10Jdrewniak: Enable Vector "Zebra" AB test on Hebrew wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922564 (https://phabricator.wikimedia.org/T335972) [15:19:52] !log jayme@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:20:12] !log jayme@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:20:25] (03CR) 10CI reject: [V: 04-1] Enable Vector "Zebra" AB test on Hebrew wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922564 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak) [15:20:38] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on releases2003.codfw.wmnet with reason: host reimage [15:20:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED [15:21:17] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:21:33] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:21:35] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:21:46] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED [15:21:50] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:21:52] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:22:01] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:22:02] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [15:22:20] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:22:26] (03PS2) 10Jdrewniak: Enable Vector "Zebra" AB test on Hebrew wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922564 (https://phabricator.wikimedia.org/T335972) [15:22:52] (03CR) 10Hokwelum: [C: 03+1] "looks good :-)" [puppet] - 10https://gerrit.wikimedia.org/r/915463 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [15:24:04] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases2003.codfw.wmnet with reason: host reimage [15:25:36] (03CR) 10ArielGlenn: [C: 03+2] add a custom xml dumps config file for testing new nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915463 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [15:26:45] 10SRE-swift-storage, 10Wikimedia-Site-requests, 10serviceops: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10MatthewVernon) @dcausse just to be clear, do you still need the associated ms-swift account (which I think is `search_backup`?)? [15:28:48] (03PS19) 10ArielGlenn: add documentation on commands to run for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [15:30:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED [15:32:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [15:34:07] (03PS1) 10Hashar: No more require puppetlabs_spec_helper/rake_tasks directly [puppet] - 10https://gerrit.wikimedia.org/r/922565 [15:35:39] (03CR) 10Hashar: "That is a follow up to https://gerrit.wikimedia.org/r/c/operations/puppet/+/889990 which addresses the use case of running rake from a mod" [puppet] - 10https://gerrit.wikimedia.org/r/922565 (owner: 10Hashar) [15:35:56] (03CR) 10Hashar: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/922565 is the follow up for all of the modules/*/Rakefile." [puppet] - 10https://gerrit.wikimedia.org/r/889990 (owner: 10Nicolas Fraison) [15:38:11] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host releases2003.codfw.wmnet with OS bullseye [15:38:17] (03Abandoned) 10Cathal Mooney: Release v0.6.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/922560 (https://phabricator.wikimedia.org/T328313) (owner: 10Cathal Mooney) [15:40:16] 10SRE-swift-storage, 10Wikimedia-Site-requests, 10serviceops: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10dcausse) @MatthewVernon yes we'd like to keep it, we don't use it on a regular basis but it might happen that we need to dump the elasticsearch content to s... [15:40:40] (03PS1) 10Ottomata: mw-page-content-change-enrich - disable ZK based HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/922568 (https://phabricator.wikimedia.org/T331283) [15:41:33] 10SRE-swift-storage, 10Wikimedia-Site-requests, 10serviceops: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10MatthewVernon) OK, thanks for confirming; I think that means there's no swift-related action needed on this ticket. [15:42:46] (03PS2) 10Ottomata: mw-page-content-change-enrich - disable ZK based HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/922568 (https://phabricator.wikimedia.org/T331283) [15:43:49] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - disable ZK based HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/922568 (https://phabricator.wikimedia.org/T331283) (owner: 10Ottomata) [15:44:27] (03Merged) 10jenkins-bot: mw-page-content-change-enrich - disable ZK based HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/922568 (https://phabricator.wikimedia.org/T331283) (owner: 10Ottomata) [15:45:13] !log stop pybal on lvs1018: T322937 [15:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:18] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [15:45:35] (03PS1) 10David Martin: Declare Metrics Platform stream for wikifunctionswiki on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922569 (https://phabricator.wikimedia.org/T336722) [15:45:42] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:46:01] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:46:52] !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS maintenance in eqiad, blocking deploys T322937 [15:47:32] (03Abandoned) 10Hashar: gerrit: stop managing /srv/gerrit/plugins/lfs [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [15:48:35] (03CR) 10Hashar: "You can even drop profile::gerrit::lfs_dir entirely and hardcode it in the plugin config file modules/gerrit/templates/lfs.config.erb." [puppet] - 10https://gerrit.wikimedia.org/r/920765 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [15:48:50] 10SRE-tools, 10Infrastructure-Foundations, 10Datacenter-Switchover: Hide `systemctl is-enabled` output in switchover cookbooks - https://phabricator.wikimedia.org/T285520 (10jbond) [15:49:16] 10SRE-tools, 10Infrastructure-Foundations, 10Datacenter-Switchover: --live-test mode of switchdc cookbook should auto downtime "High average GET latency" alerts - https://phabricator.wikimedia.org/T285521 (10jbond) [15:49:44] (03PS1) 10Ottomata: Rename page content change enrich error stream to match convention [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922570 (https://phabricator.wikimedia.org/T336656) [15:50:10] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:50:18] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:50:22] ^ expected [15:50:45] (03CR) 10Ottomata: [C: 03+2] Rename page content change enrich error stream to match convention [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922570 (https://phabricator.wikimedia.org/T336656) (owner: 10Ottomata) [15:51:31] (03Merged) 10jenkins-bot: Rename page content change enrich error stream to match convention [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922570 (https://phabricator.wikimedia.org/T336656) (owner: 10Ottomata) [15:53:40] (03CR) 10EoghanGaffney: [C: 03+2] Add nginx logs for docker-registry host to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/919350 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney) [15:53:55] oh just noticed this lvs lock, sorry just tried to deploy a config change [15:54:00] ottomata: sorry :) [15:54:08] nice it stopped me. should I let it sit (wait up to 10 mintes)? or ctrl-c it [15:54:16] no sorries! my fault [15:54:30] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [15:54:38] ottomata: is your deploy urgent? [15:55:27] nope [15:55:36] ok then, I will ping you when this is done, thanks [15:55:41] k thanks [15:56:37] !log moving lvs1018 connection to rack E1 from lsw1-e1-eqiad to ssw1-e1-eqiad T322937 [15:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:42] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [15:58:38] 10SRE-swift-storage, 10Wikimedia-Site-requests, 10serviceops: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10dcausse) @MatthewVernon just double checked with the team and it seems that the account we requested for these recovery procedures is `search_platform` (c.f... [15:58:47] (03CR) 10Clément Goubert: [C: 03+1] Patch helm defaults in helmfile during CI tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/922563 (owner: 10Giuseppe Lavagetto) [15:59:43] jbond rvl: Anything going out during the puppet deploy window? I’d like to get a quick security update out for PrivateSettings.php if I can… [16:00:05] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:29] sbassett: deploys are blocked for LVS maintenance work if that helps [16:02:47] sbassett: nope, Puppet window is empty, it's all yours [16:06:47] sukhe: ok, do we know when that will wrap up? [16:07:59] sbassett: shouldn't be longer than 20 mins or so, toprank.s and jclark-ct.r are working on it [16:08:29] Ok, sounds good. I’ll check in again then. [16:09:43] sbassett: happy to ping you when done! [16:10:54] (03CR) 10Jbond: P:bird::anycast_healthchecker: allow binding to multiple services (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [16:13:04] sukhe: That would be great, thanks! [16:19:22] RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:19:28] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:20:19] (03PS1) 10Ayounsi: LVS: deny cadvisor access from the world [homer/public] - 10https://gerrit.wikimedia.org/r/922571 [16:20:32] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 34 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [16:22:34] (03CR) 10Ssingh: [C: 03+1] "Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/922571 (owner: 10Ayounsi) [16:22:54] !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS maintenance in eqiad, blocking deploys T322937 (duration: 36m 02s) [16:22:59] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [16:23:40] 10SRE: Misplaced file in python3-service-checker - https://phabricator.wikimedia.org/T284220 (10jbond) 05Open→03Resolved a:03jbond i checked the most recent version (0.2.1) and this no longer seems to be an issue, please reopen if i missed something [16:23:48] (03CR) 10Ayounsi: [C: 03+2] LVS: deny cadvisor access from the world [homer/public] - 10https://gerrit.wikimedia.org/r/922571 (owner: 10Ayounsi) [16:24:23] ottomata: sbassett: all done on the LVS maintenance, deploy lock is lifted [16:24:23] (03Merged) 10jenkins-bot: LVS: deny cadvisor access from the world [homer/public] - 10https://gerrit.wikimedia.org/r/922571 (owner: 10Ayounsi) [16:24:27] thank you! [16:25:40] (03PS20) 10Hokwelum: add documentation on commands to run for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [16:28:12] (03PS1) 10Kimberly Sarabia: Turn on the A/B test for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922572 (https://phabricator.wikimedia.org/T336969) [16:30:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission ms-be204[0-3].codfw.wmnet - https://phabricator.wikimedia.org/T337011 (10Jhancock.wm) [16:30:25] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission ms-be204[0-3].codfw.wmnet - https://phabricator.wikimedia.org/T337011 (10Jhancock.wm) 05Open→03Resolved [16:30:56] 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10Jhancock.wm) [16:31:38] 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10Jhancock.wm) [16:31:40] !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: EventStreamConfig - Rename page content change enrich error stream to match convention - T336656 (duration: 06m 58s) [16:31:46] T336656: mediawiki-page-content-change-enrichment checkpoints should be stored in Swift - https://phabricator.wikimedia.org/T336656 [16:32:45] Thanks [16:32:46] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:32:46] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:34:00] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:34:41] (03CR) 10Volans: [C: 04-1] "With the current implementation won't work as expected. See inline for the details." [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [16:34:43] 10SRE, 10Data-Platform-SRE, 10Patch-For-Review: udplog package and puppet disagree on what /etc/udp2log should be - https://phabricator.wikimedia.org/T276622 (10jbond) [16:35:23] 10SRE-tools, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: debmonitor: Traceback in the apt hook when purging a package in rc state - https://phabricator.wikimedia.org/T273269 (10jbond) [16:35:42] PROBLEM - WDQS SPARQL on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:35:52] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:36:25] (03CR) 10Clément Goubert: "This change is ready for review." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert) [16:36:56] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.199 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:37:40] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:37:51] (03PS1) 10Ottomata: mw-page-content-change-enrich - use proper config for flink state.backend.type [deployment-charts] - 10https://gerrit.wikimedia.org/r/922576 (https://phabricator.wikimedia.org/T336656) [16:38:09] 10SRE, 10Infrastructure-Foundations, 10Packaging, 10User-MoritzMuehlenhoff: Reprepro should bail if it can't read and sign using the root keys - https://phabricator.wikimedia.org/T116951 (10jbond) [16:38:48] RECOVERY - WDQS SPARQL on wdqs2012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.342 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:38:55] inflatador, ryankemper: issues again with WDQS? [16:39:06] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.194 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:39:49] (03CR) 10Ottomata: mediawiki: Change naming scheme for resources (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert) [16:40:03] (NodeTextfileStale) firing: (3) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:40:21] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - use proper config for flink state.backend.type [deployment-charts] - 10https://gerrit.wikimedia.org/r/922576 (https://phabricator.wikimedia.org/T336656) (owner: 10Ottomata) [16:40:26] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.213 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:40:30] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:40:35] gehel: looking [16:41:55] !log cmooney@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Homer Release v0.6.2 with updated wmf-plugin - cmooney@cumin1001 [16:42:06] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:42:18] oh fun [16:42:44] !log Deployed updated security mitigation for T336027 [16:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:48] We'll need to identify the offender(s) and ban them [16:42:51] (Unf, I now have one more) [16:43:13] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:43:17] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:43:30] Rolling restart will only temporarily mitigate the problem, so let's not bother with that. Likewise for depooling codfw (will just redirect the bad queries to eqiad and topple that too in all likelihood) [16:43:31] !log cmooney@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Homer Release v0.6.2 with updated wmf-plugin - cmooney@cumin1001 [16:43:32] PROBLEM - WDQS SPARQL on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:43:38] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2012.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2012.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:44:56] RECOVERY - WDQS SPARQL on wdqs2012 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.573 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:48:02] (03CR) 10Jbond: "lgtm thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/922511 (owner: 10Volans) [16:48:18] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2011.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:48:20] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2011.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:49:22] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:49:38] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:49:58] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:50:39] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:50:56] !log Deployed updated security mitigation for T336027, part 2 [16:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:22] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:52:38] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:53:22] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:55:27] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [16:57:46] (03PS1) 10Majavah: ssh: do not try to ca sign host keys if ca is not available [puppet] - 10https://gerrit.wikimedia.org/r/922581 (https://phabricator.wikimedia.org/T268344) [16:58:07] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:58:17] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:58:19] (03CR) 10CI reject: [V: 04-1] ssh: do not try to ca sign host keys if ca is not available [puppet] - 10https://gerrit.wikimedia.org/r/922581 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [16:59:02] (03PS2) 10Majavah: ssh: do not try to ca sign host keys if ca is not available [puppet] - 10https://gerrit.wikimedia.org/r/922581 (https://phabricator.wikimedia.org/T268344) [16:59:20] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:59:20] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T1700) [17:00:24] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:00:30] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka main-eqiad cluster: Reboot kafka nodes [17:00:48] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:00:58] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41286/console" [puppet] - 10https://gerrit.wikimedia.org/r/922581 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [17:01:08] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.960 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:02:24] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled ht [17:02:24] kitech.wikimedia.org/wiki/PyBal [17:02:32] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [17:03:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:03:14] !log Deployed updated security mitigation for T336027 and T333140 [17:03:17] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [17:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:46] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:04:54] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.203 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:05:20] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.202 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:05:20] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [17:05:25] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [17:06:44] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.188 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:08:29] (03CR) 10Eevans: cassandra: add support for version 4.1.1 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [17:08:38] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:08:38] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2007.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:11:36] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:13:00] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.192 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:13:26] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:14:52] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs_80: Servers wdqs2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:16:00] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:17:26] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.315 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:17:58] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.194 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:18:00] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are mar [17:18:00] but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:19:34] PROBLEM - WDQS SPARQL on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:21:06] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:22:32] RECOVERY - WDQS SPARQL on wdqs2012 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.667 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:23:30] (03PS2) 10Kimberly Sarabia: Turn on the A/B test for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922572 (https://phabricator.wikimedia.org/T336969) [17:24:04] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.203 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:25:44] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2011.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:25:46] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2011.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2011.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:27:28] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:28:54] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:30:24] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 6.387 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:30:28] PROBLEM - WDQS SPARQL on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:31:50] RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:31:54] RECOVERY - WDQS SPARQL on wdqs2012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.195 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:32:07] (03CR) 10Andrea Denisse: [C: 03+1] arclamp: switch redis server to arclamp1001 [puppet] - 10https://gerrit.wikimedia.org/r/920299 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [17:32:20] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) [17:32:47] (03CR) 10Andrea Denisse: [C: 03+1] arclamp: switch redis server to arclamp1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920298 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [17:33:42] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:35:08] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.789 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:35:36] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:36:38] (03PS9) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243 [17:36:52] (03PS1) 10DCausse: query_service: fix logback config [puppet] - 10https://gerrit.wikimedia.org/r/922584 [17:37:31] (03CR) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [17:38:10] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimed [17:38:10] iki/PyBal [17:38:14] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2007.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:39:29] (03Abandoned) 10Ssingh: depool eqiad (emergency patch, do not merge until required) [dns] - 10https://gerrit.wikimedia.org/r/922508 (https://phabricator.wikimedia.org/T322937) (owner: 10Ssingh) [17:40:10] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.188 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:41:06] (03CR) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [17:41:08] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:42:06] (03PS1) 10Btullis: Fix the script to install the spark3 yarn shuffler jar symlink [puppet] - 10https://gerrit.wikimedia.org/r/922585 (https://phabricator.wikimedia.org/T332765) [17:42:29] (03CR) 10CI reject: [V: 04-1] Fix the script to install the spark3 yarn shuffler jar symlink [puppet] - 10https://gerrit.wikimedia.org/r/922585 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [17:42:32] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:42:50] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:43:44] (03PS5) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 [17:44:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:44:23] (03CR) 10Bking: [C: 03+1] query_service: fix logback config [puppet] - 10https://gerrit.wikimedia.org/r/922584 (owner: 10DCausse) [17:44:24] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2011.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2011.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled ht [17:44:24] kitech.wikimedia.org/wiki/PyBal [17:44:56] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:45:37] (03PS5) 10Slyngshede: sre.ganeti.makevm call reimage after VM creation [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) [17:45:55] (03CR) 10Ssingh: "Tried removing the ".service" part but PCC is still unhappy about this https://puppet-compiler.wmflabs.org/output/922514/41287/ but I gues" [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [17:46:13] (03CR) 10Slyngshede: sre.ganeti.makevm call reimage after VM creation (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [17:46:55] (03CR) 10Bking: [C: 03+2] query_service: fix logback config [puppet] - 10https://gerrit.wikimedia.org/r/922584 (owner: 10DCausse) [17:47:26] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:47:26] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:48:37] (03CR) 10Ssingh: "Looks good but because we decided in the Traffic meeting that we should do both the systemd and Puppet-level bindings that we should do I6" [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T336792) (owner: 10BCornwall) [17:56:40] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled ht [17:56:40] kitech.wikimedia.org/wiki/PyBal [17:56:42] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled ht [17:56:42] kitech.wikimedia.org/wiki/PyBal [18:00:04] ^demon and dancy: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T1800). [18:01:04] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:01:20] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:01:20] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:02:28] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.199 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:04:22] (03PS14) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [18:04:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:05:04] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:07:30] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled ht [18:07:30] kitech.wikimedia.org/wiki/PyBal [18:07:32] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled ht [18:07:32] kitech.wikimedia.org/wiki/PyBal [18:07:48] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:08:58] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:09:12] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:10:22] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:10:36] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2011.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:10:38] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2011.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:11:08] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 3.037 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:12:46] !log [WDQS] T337327 New rule in place to ban potential source of WDQS codfw outage. Rolling restart will be done in a couple minutes to [attempt to] restore service availability [18:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:24] PROBLEM - Confd vcl based reload on cp4039 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:28] PROBLEM - Confd vcl based reload on cp1089 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:28] PROBLEM - Confd vcl based reload on cp3052 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:32] PROBLEM - Confd vcl based reload on cp4041 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:32] PROBLEM - Confd vcl based reload on cp3060 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:32] PROBLEM - Confd vcl based reload on cp2027 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:32] PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:32] PROBLEM - Confd vcl based reload on cp5021 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:36] PROBLEM - Confd vcl based reload on cp4038 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:38] PROBLEM - Confd vcl based reload on cp2031 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:40] PROBLEM - Confd vcl based reload on cp2037 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:40] PROBLEM - Confd vcl based reload on cp2039 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:48] PROBLEM - Confd vcl based reload on cp4044 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:50] PROBLEM - Confd vcl based reload on cp4043 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:52] PROBLEM - Confd vcl based reload on cp1087 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:52] PROBLEM - Confd vcl based reload on cp1081 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:54] PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:56] PROBLEM - Confd vcl based reload on cp3062 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:58] PROBLEM - Confd vcl based reload on cp5020 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:58] PROBLEM - Confd vcl based reload on cp6016 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:58] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:58] PROBLEM - Confd vcl based reload on cp5024 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:14:58] PROBLEM - Confd vcl based reload on cp5023 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:00] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:15:04] PROBLEM - Confd vcl based reload on cp1077 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:08] PROBLEM - Confd vcl based reload on cp6015 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:12] PROBLEM - Confd vcl based reload on cp3054 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:14] PROBLEM - Confd vcl based reload on cp5017 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:14] PROBLEM - Confd vcl based reload on cp5022 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:14] PROBLEM - Confd vcl based reload on cp5018 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:14] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:15:16] PROBLEM - Confd vcl based reload on cp4037 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:16] PROBLEM - Confd vcl based reload on cp1085 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:22] PROBLEM - Confd vcl based reload on cp4042 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:26] PROBLEM - Confd vcl based reload on cp2041 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:28] PROBLEM - WDQS SPARQL on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:32] PROBLEM - Confd vcl based reload on cp5019 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:34] PROBLEM - Confd vcl based reload on cp6011 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:34] PROBLEM - Confd vcl based reload on cp6013 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:38] PROBLEM - Confd vcl based reload on cp3056 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:46] PROBLEM - Confd vcl based reload on cp4040 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:48] PROBLEM - Confd vcl based reload on cp1079 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:50] PROBLEM - Confd vcl based reload on cp3050 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:52] PROBLEM - Confd vcl based reload on cp1083 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:52] PROBLEM - Confd vcl based reload on cp2035 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:54] PROBLEM - Confd vcl based reload on cp3058 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:56] PROBLEM - Confd vcl based reload on cp2033 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:56] PROBLEM - Confd vcl based reload on cp3064 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:56] PROBLEM - Confd vcl based reload on cp2029 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:58] PROBLEM - Confd vcl based reload on cp6010 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:58] PROBLEM - Confd vcl based reload on cp1075 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:15:59] oh [18:16:03] what's happening here [18:16:15] This is almost certainly us (recent requestctl change). Rolling back nomw [18:16:18] this is probably related to the requestctl change ryankemper and inflatador and I just made, we're rolling back [18:16:21] ah probably [18:16:25] ok thanks! [18:16:48] RECOVERY - Confd vcl based reload on cp3054 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:16:48] RECOVERY - Confd vcl based reload on cp5022 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:16:48] RECOVERY - Confd vcl based reload on cp5018 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:16:48] RECOVERY - Confd vcl based reload on cp5017 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:16:48] requestctl might have inserted some malformed vcl in which case I'm definitely curious about what and why but let's rollback first and ask questions after [18:16:50] !log [WDQS] Rolled back requestctl rule [18:16:50] RECOVERY - Confd vcl based reload on cp4037 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:16:50] RECOVERY - Confd vcl based reload on cp1085 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:56] RECOVERY - WDQS SPARQL on wdqs2012 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 3.177 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:16:56] RECOVERY - Confd vcl based reload on cp4042 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:02] RECOVERY - Confd vcl based reload on cp2041 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:08] RECOVERY - Confd vcl based reload on cp5019 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:10] RECOVERY - Confd vcl based reload on cp6011 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:10] RECOVERY - Confd vcl based reload on cp6013 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:12] RECOVERY - Confd vcl based reload on cp3056 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:22] RECOVERY - Confd vcl based reload on cp4040 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:22] RECOVERY - Confd vcl based reload on cp1079 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:26] RECOVERY - Confd vcl based reload on cp3050 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:26] RECOVERY - Confd vcl based reload on cp1083 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:26] RECOVERY - Confd vcl based reload on cp2035 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:30] RECOVERY - Confd vcl based reload on cp3058 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:30] RECOVERY - Confd vcl based reload on cp2033 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:30] RECOVERY - Confd vcl based reload on cp3064 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:32] RECOVERY - Confd vcl based reload on cp2029 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:32] RECOVERY - Confd vcl based reload on cp1075 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:34] RECOVERY - Confd vcl based reload on cp6010 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:34] RECOVERY - Confd vcl based reload on cp4039 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:36] RECOVERY - Confd vcl based reload on cp1089 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:40] RECOVERY - Confd vcl based reload on cp3052 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:42] RECOVERY - Confd vcl based reload on cp2027 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:42] RECOVERY - Confd vcl based reload on cp4041 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:42] RECOVERY - Confd vcl based reload on cp3060 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:42] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:44] RECOVERY - Confd vcl based reload on cp5021 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:48] RECOVERY - Confd vcl based reload on cp4038 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:48] RECOVERY - Confd vcl based reload on cp2031 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:50] RECOVERY - Confd vcl based reload on cp2037 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:17:50] RECOVERY - Confd vcl based reload on cp2039 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:00] RECOVERY - Confd vcl based reload on cp4044 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:00] RECOVERY - Confd vcl based reload on cp4043 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:00] RECOVERY - Confd vcl based reload on cp1087 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:02] RECOVERY - Confd vcl based reload on cp1081 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:04] RECOVERY - Confd vcl based reload on cp6014 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:06] RECOVERY - Confd vcl based reload on cp3062 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:08] RECOVERY - Confd vcl based reload on cp6016 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:08] RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:08] RECOVERY - Confd vcl based reload on cp5020 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:08] RECOVERY - Confd vcl based reload on cp5024 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:08] RECOVERY - Confd vcl based reload on cp5023 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:14] RECOVERY - Confd vcl based reload on cp1077 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:16] I'm around for backup train duties. [18:18:18] RECOVERY - Confd vcl based reload on cp6015 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:18:18] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:18:22] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:18:24] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:18:38] ooh don't flood yourself off icinga-wm [18:19:27] return (synth(403, "This request comes from an IP range that is banned due to (possibly) sending queries that led to outage in wdqs codfw. Please message us at noc@wikimedia.org, perhaps using a subject like "WDQS Ban".")); [18:19:30] yep that'll do it [18:19:36] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.192 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:19:52] aah. surprised though that this is the first time! [18:19:53] requestctl should either escape those quotes or reject them, I'll file a task after [18:20:06] as in, I would have expected someone to have used quotes before [18:20:11] yeah, same! [18:20:21] thanks ryankemper for exposing that, it's good we found it now [18:21:18] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:21:34] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:24:28] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:25:43] (03PS1) 10Andrew Bogott: add fake backy2 postgres passwords [labs/private] - 10https://gerrit.wikimedia.org/r/922589 (https://phabricator.wikimedia.org/T332734) [18:25:54] (03PS1) 10Andrew Bogott: backy2: install postgres on backy2 hosts [puppet] - 10https://gerrit.wikimedia.org/r/922590 (https://phabricator.wikimedia.org/T332734) [18:25:56] (03PS1) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 [18:26:22] !log [WDQS] T337327 Deployed new, hopefully-working rule after addressing previous syntax error (unescaped `"`). See `/srv/private` commit `6e2f5ab19427902994bb9d03d28277252f021474` [18:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:37] (03CR) 10CI reject: [V: 04-1] backy2: install postgres on backy2 hosts [puppet] - 10https://gerrit.wikimedia.org/r/922590 (https://phabricator.wikimedia.org/T332734) (owner: 10Andrew Bogott) [18:28:48] (03CR) 10CI reject: [V: 04-1] backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (owner: 10Andrew Bogott) [18:28:56] 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on backup2010 - https://phabricator.wikimedia.org/T337174 (10jcrespo) a:03Jhancock.wm Hi, @Jhancock.wm I left the host with high io load for a few hours, but didn't see any issue:https://grafana.wikimedia.org/goto/BnkmEIQ4k?orgId=1 {F37030470}... [18:29:02] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:29:10] !log bking@cumin1001 rolling restart of codfw wdqs public hosts T337327 [18:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:36] We just applied a new requestctl rule for WDQS. The service has calmed down, but we're still observing closely in case the rule does not help [18:37:13] (03PS2) 10Andrew Bogott: backy2: install postgres on backy2 hosts [puppet] - 10https://gerrit.wikimedia.org/r/922590 (https://phabricator.wikimedia.org/T332734) [18:37:15] (03PS2) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 [18:39:26] (03PS3) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 [18:40:15] (03CR) 10CI reject: [V: 04-1] backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (owner: 10Andrew Bogott) [18:42:01] (03CR) 10CI reject: [V: 04-1] backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (owner: 10Andrew Bogott) [18:44:19] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] add fake backy2 postgres passwords [labs/private] - 10https://gerrit.wikimedia.org/r/922589 (https://phabricator.wikimedia.org/T332734) (owner: 10Andrew Bogott) [18:49:33] (03PS4) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 [18:51:03] (03PS1) 10Ottomata: mw-page-content-change-enrich - bump image to v1.17.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922593 (https://phabricator.wikimedia.org/T330507) [18:51:50] (03CR) 10CI reject: [V: 04-1] backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (owner: 10Andrew Bogott) [18:56:57] (03PS5) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 [18:57:20] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - bump image to v1.17.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922593 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [18:58:00] (03Merged) 10jenkins-bot: mw-page-content-change-enrich - bump image to v1.17.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922593 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [18:58:33] (03PS1) 10Ottomata: mw-page-content-change-enrich - use bucket names created by T330693 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922595 (https://phabricator.wikimedia.org/T336656) [18:59:17] (03CR) 10CI reject: [V: 04-1] backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (owner: 10Andrew Bogott) [19:02:16] (03PS6) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 [19:04:13] (03PS1) 10Andrew Bogott: Add more fake backy2 passwords [labs/private] - 10https://gerrit.wikimedia.org/r/922597 [19:04:41] (03CR) 10CI reject: [V: 04-1] backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (owner: 10Andrew Bogott) [19:05:23] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add more fake backy2 passwords [labs/private] - 10https://gerrit.wikimedia.org/r/922597 (owner: 10Andrew Bogott) [19:08:04] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - use bucket names created by T330693 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922595 (https://phabricator.wikimedia.org/T336656) (owner: 10Ottomata) [19:09:47] (03PS7) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 [19:09:49] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney) [19:09:56] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:10:04] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:10:10] 10SRE, 10Infrastructure-Foundations, 10netops: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) 05Open→03Resolved This is now modeled in Netbox in the 'upstream_speed' field of the z-end of a circuit termination. The one service we have where it... [19:11:32] (03PS8) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 [19:12:08] (03CR) 10Gmodena: mw-page-content-change-enrich - use bucket names created by T330693 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922595 (https://phabricator.wikimedia.org/T336656) (owner: 10Ottomata) [19:12:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) Completed today 1 E1 lvs1018 lsw1-e1-eqiad xe-0/0/47 ssw1-e1-eqiad xe-0/0/33 [19:14:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) Reset idrac. still unable to login to an-worker1150 Fixed psu1 on 49 [19:16:50] (03PS9) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 [19:16:52] (03PS1) 10Ottomata: mw-page-content-change-enrich - set proper s3.access-key for swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/922600 (https://phabricator.wikimedia.org/T336656) [19:16:54] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922601 (https://phabricator.wikimedia.org/T330216) [19:16:56] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922601 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [19:17:17] (03CR) 10CI reject: [V: 04-1] backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (owner: 10Andrew Bogott) [19:17:43] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - set proper s3.access-key for swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/922600 (https://phabricator.wikimedia.org/T336656) (owner: 10Ottomata) [19:17:55] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922601 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [19:18:15] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:18:27] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:19:25] 10SRE, 10Infrastructure-Foundations, 10netops: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) 05Resolved→03Open [19:19:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney) [19:20:15] (03PS3) 10Andrew Bogott: backy2: install postgres on backy2 hosts [puppet] - 10https://gerrit.wikimedia.org/r/922590 (https://phabricator.wikimedia.org/T332734) [19:20:38] (03CR) 10CI reject: [V: 04-1] backy2: install postgres on backy2 hosts [puppet] - 10https://gerrit.wikimedia.org/r/922590 (https://phabricator.wikimedia.org/T332734) (owner: 10Andrew Bogott) [19:20:56] (03Abandoned) 10Andrew Bogott: backy2: install postgres on backy2 hosts [puppet] - 10https://gerrit.wikimedia.org/r/922590 (https://phabricator.wikimedia.org/T332734) (owner: 10Andrew Bogott) [19:21:05] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:21:08] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:21:48] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - use bucket names created by T330693 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922595 (https://phabricator.wikimedia.org/T336656) (owner: 10Ottomata) [19:24:08] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:24:08] (03PS10) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (https://phabricator.wikimedia.org/T332734) [19:24:11] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:24:15] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1022 [19:24:31] (03CR) 10CI reject: [V: 04-1] backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (https://phabricator.wikimedia.org/T332734) (owner: 10Andrew Bogott) [19:25:19] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.10 refs T330216 [19:25:23] T330216: 1.41.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T330216 [19:25:31] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1022 [19:25:52] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1023 [19:27:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1023 [19:27:15] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1024 [19:27:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host dbproxy1024 [19:27:36] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1024 [19:27:38] !log jclark@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host dbproxy1024 [19:28:44] (03PS11) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (https://phabricator.wikimedia.org/T332734) [19:29:12] (03PS1) 10Cathal Mooney: Add class-of-service parent interface shaper for sub-rated services [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220) [19:30:03] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1024 [19:30:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1024 [19:30:57] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1025 [19:30:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host dbproxy1025 [19:31:06] (03CR) 10CI reject: [V: 04-1] backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (https://phabricator.wikimedia.org/T332734) (owner: 10Andrew Bogott) [19:31:29] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:31:31] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:33:50] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1025 [19:34:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1025 [19:35:00] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1026 [19:35:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1026 [19:35:32] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy1027 [19:35:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy1027 [19:36:54] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [19:37:23] (03PS12) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (https://phabricator.wikimedia.org/T332734) [19:37:45] (03CR) 10CI reject: [V: 04-1] backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (https://phabricator.wikimedia.org/T332734) (owner: 10Andrew Bogott) [19:38:36] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) [19:39:13] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbproxy102{2..7} - jclark@cumin1001" [19:41:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbproxy102{2..7} - jclark@cumin1001" [19:41:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:41:58] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:42:01] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:42:03] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1022.mgmt.eqiad.wmnet with reboot policy FORCED [19:45:36] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1022.mgmt.eqiad.wmnet with reboot policy FORCED [19:45:50] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:45:53] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:46:11] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1023.mgmt.eqiad.wmnet with reboot policy FORCED [19:47:18] 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) We had a chat about this. The first iteration will be a manual cookbook that takes a host as parameter. The cookbook will connect to the device and see if there is alre... [19:50:13] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:50:17] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:50:29] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1023.mgmt.eqiad.wmnet with reboot policy FORCED [19:50:34] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:50:37] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:53:50] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:53:53] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:56:32] (03PS13) 10Andrew Bogott: backy2: switch to using postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (https://phabricator.wikimedia.org/T332734) [19:56:34] (03CR) 10Jdrewniak: [C: 03+1] Turn on the A/B test for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922572 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [19:56:55] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:56:56] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:57:58] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:58:01] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:58:12] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:58:15] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:58:26] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) @Volans I am having issues with provisioning script with all servers right now it is not limited to this servers on this ticket if you have time this... [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Dear deployers, time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T2000). [20:00:06] kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:24] Hello [20:00:28] * TheresNoTime can deploy [20:01:06] kimberly_sarabia: I assume we'll be wanting to backport https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/922155/ to 1.41.0-wmf.9 ? [20:01:34] (03PS1) 10Samtar: Remove centraluserid dependency in ABRequirement.php [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922397 (https://phabricator.wikimedia.org/T336969) [20:01:56] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:05] (03PS3) 10Samtar: Turn on the A/B test for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922572 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [20:02:09] TheresNoTime: and WMF.10 surely? [20:02:28] oh, yes! [20:02:50] (03PS1) 10Samtar: Remove centraluserid dependency in ABRequirement.php [skins/Vector] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922398 (https://phabricator.wikimedia.org/T336969) [20:03:45] kimberly_sarabia: am I correct in assuming you want 922397 on both .10 and .9 ? :) [20:04:00] TheresNoTime: one moment, let me double check [20:04:06] thank you :) [20:05:27] both is good too! [20:06:03] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Eevans) >>! In T330693#8841909, @Eevans wrote: > Per a discussion with @gmo... [20:06:35] TheresNoTime: both .10 and .9 is fine [20:06:44] great :) from looking at `922572: Turn on the A/B test for testwiki`, does it depend on `922397: Remove centraluserid dependency in ABRequirement.php` ? [20:07:26] TheresNoTime: Yes it does :) [20:08:01] Okay, we'll have to wait for 922397 and 922398 then [20:09:42] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:04] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:10:07] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:10:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) Thanks @Jclark-ctr I think we're good to do the other two lvs moves whenever you are ready. Please ping me on irc and we can arran... [20:18:30] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:18:32] 922397 had what looks like a transient CI failure, so I'm going to manually V+1 it [20:18:33] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:19:35] s/V+1/V+2 [20:19:48] (03CR) 10CI reject: [V: 04-1] Remove centraluserid dependency in ABRequirement.php [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922397 (https://phabricator.wikimedia.org/T336969) (owner: 10Samtar) [20:20:16] (03CR) 10Samtar: [V: 03+2] "Transient CI failure" [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922397 (https://phabricator.wikimedia.org/T336969) (owner: 10Samtar) [20:20:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922397 (https://phabricator.wikimedia.org/T336969) (owner: 10Samtar) [20:20:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922398 (https://phabricator.wikimedia.org/T336969) (owner: 10Samtar) [20:21:47] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:21:50] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:22:11] TheresNoTime: yeah looks like a selenium timeout, since it didn’t effect the master branch I’m thinking it’s a one-off failure too. [20:22:26] ^^ [20:23:16] ack :) I've +2'd, so just waiting for the merge - https://integration.wikimedia.org/zuul/#q=922398%20922397 [20:23:32] (doing both .9 and .10 together fwiw) [20:24:47] oh TIL that spaces don't work in that URL.. (: - https://integration.wikimedia.org/zuul/?#q=922398,922397 [20:29:25] (03PS1) 10Herron: pyrra: initial packaging for v0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 [20:31:17] (03PS2) 10Herron: pyrra: initial packaging for v0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 [20:33:55] (03PS3) 10Herron: pyrra: initial packaging for v0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 [20:36:07] 10SRE-tools, 10Infrastructure-Foundations, 10netops: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10Volans) p:05Triage→03High [20:36:33] (03PS1) 10Eevans: cassandra-dev2001: upgrade to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/922609 (https://phabricator.wikimedia.org/T337344) [20:36:39] (03Merged) 10jenkins-bot: Remove centraluserid dependency in ABRequirement.php [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922397 (https://phabricator.wikimedia.org/T336969) (owner: 10Samtar) [20:36:47] (03Merged) 10jenkins-bot: Remove centraluserid dependency in ABRequirement.php [skins/Vector] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922398 (https://phabricator.wikimedia.org/T336969) (owner: 10Samtar) [20:36:50] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Volans) @Jclark-ctr that's weird, I've opened T337345 as I don't see any DHCP traffic at all. [20:37:15] !log samtar@deploy1002 Started scap: Backport for [[gerrit:922397|Remove centraluserid dependency in ABRequirement.php (T336969)]], [[gerrit:922398|Remove centraluserid dependency in ABRequirement.php (T336969)]] [20:37:21] T336969: Fix the mixing of global and user IDs for AB Test Enrollment Bucketing - https://phabricator.wikimedia.org/T336969 [20:38:50] !log samtar@deploy1002 samtar: Backport for [[gerrit:922397|Remove centraluserid dependency in ABRequirement.php (T336969)]], [[gerrit:922398|Remove centraluserid dependency in ABRequirement.php (T336969)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:39:13] okay kimberly_sarabia, jan_drewniak — those two are live on mwdebug, can you test? [20:40:03] (NodeTextfileStale) firing: (3) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:40:08] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [20:40:19] TheresNoTime: thanks will do [20:42:41] TheresNoTime: ok, looks like nothing broke :P good to sync [20:42:56] woo, syncing [20:48:36] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:922397|Remove centraluserid dependency in ABRequirement.php (T336969)]], [[gerrit:922398|Remove centraluserid dependency in ABRequirement.php (T336969)]] (duration: 11m 20s) [20:48:41] T336969: Fix the mixing of global and user IDs for AB Test Enrollment Bucketing - https://phabricator.wikimedia.org/T336969 [20:48:46] live, moving on to 922572 [20:48:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922572 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [20:49:21] ok thanks [20:49:41] (03Merged) 10jenkins-bot: Turn on the A/B test for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922572 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [20:50:09] !log samtar@deploy1002 Started scap: Backport for [[gerrit:922572|Turn on the A/B test for testwiki (T336969)]] [20:51:37] !log samtar@deploy1002 ksarabia and samtar: Backport for [[gerrit:922572|Turn on the A/B test for testwiki (T336969)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:51:45] kimberly_sarabia, jan_drewniak — config patch live on mwdebug, can you test this? [20:52:28] yup [20:55:56] TheresNoTime: ok good to sync [20:56:09] ack :) [20:59:08] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:59:10] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [21:00:43] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [21:00:46] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [21:01:19] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [21:01:21] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [21:01:56] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:922572|Turn on the A/B test for testwiki (T336969)]] (duration: 11m 47s) [21:02:00] kimberly_sarabia: all patches now live :) [21:02:01] T336969: Fix the mixing of global and user IDs for AB Test Enrollment Bucketing - https://phabricator.wikimedia.org/T336969 [21:02:22] TheresNoTime: thanks so much! [21:02:29] you're welcome :) [21:02:46] !log close UTC late backport window [21:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:06:45] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [21:06:46] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [21:06:48] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [21:08:12] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:09:25] (03CR) 10Dzahn: [C: 03+2] trafficserver: switch 15.wikipedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/761062 (https://phabricator.wikimedia.org/T337041) (owner: 10Dzahn) [21:09:45] (03PS4) 10Dzahn: trafficserver: switch 15.wikipedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/761062 (https://phabricator.wikimedia.org/T337041) [21:15:56] (03CR) 10Dzahn: "After merging and running puppet on cp4*, I could see that my requests stopped appearing in /var/log/apache2/access.log on miscweb1003.. b" [puppet] - 10https://gerrit.wikimedia.org/r/761062 (https://phabricator.wikimedia.org/T337041) (owner: 10Dzahn) [21:16:57] (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/miscweb/test_miscweb-k8s.yaml --hosts=15.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/761062 (https://phabricator.wikimedia.org/T337041) (owner: 10Dzahn) [21:22:14] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10Volans) With @ayounsi we've checked a bunch of things and so far we didn't find anything wrong. The traffic seems to exit from `mr1` but dosn't make it to the... [21:28:04] (03CR) 10Dzahn: [C: 03+2] planet: add wikimediastatus.net to English feeds [puppet] - 10https://gerrit.wikimedia.org/r/921105 (https://phabricator.wikimedia.org/T336701) (owner: 10Dzahn) [21:33:38] (03PS2) 10Dzahn: gerrit: remove lfs_dir parameter, use hardcoded new default [puppet] - 10https://gerrit.wikimedia.org/r/920765 (https://phabricator.wikimedia.org/T334521) [21:34:09] 10SRE, 10Release-Engineering-Team, 10serviceops, 10Continuous-Integration-Config, 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10thcipriani) [21:36:00] (03CR) 10Dzahn: "migration classes are usually made to be applied on a new machine _before_ a production role, to allow us to copy data, give shell access " [puppet] - 10https://gerrit.wikimedia.org/r/922107 (owner: 10Jbond) [21:41:11] (03CR) 10Dzahn: "I don't think it's good that we use "$profile::ci::php::php_version" in the doc classes when there is no relation between profile::ci and " [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [21:44:29] 10SRE, 10Patch-For-Review: Add monitoring of upload rate on commons to icinga alerts - https://phabricator.wikimedia.org/T92322 (10Pppery) [21:45:13] (03CR) 10Dzahn: "could this have caused https://phabricator.wikimedia.org/T337345 ?" [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [22:02:00] (03CR) 10Hashar: Use same php version for doc and integration websites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [22:04:02] (03CR) 10Hashar: "An extra note PHP is only used for:" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [22:04:04] (03CR) 10Dzahn: Use same php version for doc and integration websites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [22:05:17] (03CR) 10Dzahn: "can we just use the PHP version to common.yaml then and look it up from each place that wants to use it? that will make it obvious" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [22:15:04] (03CR) 10Hashar: Use same php version for doc and integration websites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [22:17:28] (03PS5) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [22:19:47] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [22:34:18] (03PS2) 10Gergő Tisza: [beta] GrowthExperiments: Use ActionApiImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920283 (https://phabricator.wikimedia.org/T335641) [22:40:48] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad+codfw: 1 VM each site request for releases.wikimedia.org - https://phabricator.wikimedia.org/T337349 (10eoghan) [22:41:23] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad+codfw: 1 VM each site request for releases.wikimedia.org - https://phabricator.wikimedia.org/T337349 (10eoghan) [22:43:17] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [22:43:43] looking [22:45:24] (03PS6) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [22:45:24] esams, both text and upload [22:46:42] it's staying up, I'm depooling [22:47:22] * topranks here [22:47:52] topranks: know of anything? [22:48:09] know can't spot anything, reports are mostly from RU but other than that not seeing a pattern [22:48:14] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [22:48:15] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10KFrancis) The NDA has been sent for signatures. I'll confirm when it's complete. Thanks! [22:48:17] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [22:48:20] ack [22:48:55] oh it does look recovered, holding off on the depool for now [22:49:06] hrm [22:49:32] I see a few different Vodafone ASNs affected [22:50:03] starting a doc, I'll take IC even though it looks like we might not be taking much action :) [22:55:30] rzl: yes, I'm trying to disable peering to Vodafone NIKHEF at AMS-IX [22:55:41] as that seems to be common in every IP I've pulled from the NELs so far [22:55:52] ahh okay, let me know if you need anything [22:57:13] note these NELs aren't all age=0 so it's hard to establish an exact start time from the graphs, but it's still possible by Doing Math TM [23:00:44] nah it's made no difference, traffic avoid direct link to them at AMS-IX, but gets to the same place (the long way) and stops [23:01:44] I'll switch the peering back up as it's not had an effect [23:02:48] ah okay [23:02:52] is it the kind of thing we can fix with a prepend, or the problem is too far away from us? [23:03:42] I guess either way it's ok for now, so maybe not worth it [23:04:36] bblack: prepends can sometimes work, is the vague answer I'm obliged to give [23:04:43] :) [23:04:50] I shut the peering completely, which is more effective than pre-pending out to that peer, but no effect [23:05:09] if we could hone in on a particular network, transit provier etc where the issue was maybe we could do something [23:05:25] but it looks to distant from us for us to affect [23:05:42] *too [23:05:50] yeah ok [23:06:21] if it fires more and it's the same basic case, maybe just ACK it as something we can't do much about. [23:06:52] (which is kind of an interesting case, paging for something we can't do much about somewhere way out from us across the network) [23:07:59] topranks: would depooling esams have been effective here, do you think? [23:08:05] big hammer, I know [23:08:24] it could be, again hard to say [23:08:32] nod [23:08:45] the question there is if VF (and/or other networks affected) are having problems just reaching *parts* of the internet [23:08:55] in which case they may make it to say eiqad fine, but have trouble to esams [23:09:01] yeah, it /could/ be, but then there's some big tradeoffs. Maybe it fixes broken access for 2% of users, but adds 300ms of latency for another 30%, or whatever [23:09:07] or *more likely* the affected networks have a general problem [23:09:43] if we narrow down such a problem to a list of network ranges, we could also geodns-route those specific networks to another edge [23:09:49] (temporarily) [23:10:10] 300ms feels like a lot in a post-drmrs world but I still agree with you :P [23:10:26] by pushing an ops/dns commit to the "geo-maps" file, the "nets" section at the bottom. [23:10:30] nod [23:10:48] yeah my numbers were fake/hyperbolic :) [23:11:06] yeah obviously a full depool isn't something we'd want to sit with for very long, and I like the geoip appropach as a good intermediate stage [23:11:18] given the problem seems distant from our edge, though, we'd like have to send them somewhere further away than drmrs [23:11:23] oh that's true [23:11:46] actually you know what? we were getting NELs older than age=0 so that corroborates that [23:11:51] maybe for a case like RU, eqsin might be "the other direction" to avoid some bad link in the middle [23:11:53] it means they couldn't reach the geoip-next site either [23:12:17] (is it "next"? the second-best site we use for NEL reporting) [23:12:35] yeah, it's called "text-next" in the config [23:12:44] that one yeah, thanks [23:13:17] I gotta run out, text me if I can be useful! [23:13:19] which means (a) more data to suggest that something was wrong, far from us, and (b) more data to suggest that depooling esams wouldn't have fixed it [23:13:24] thanks for the help [23:14:53] (03CR) 10Dzahn: [C: 03+1] Move doc.discovery.wmnet to new bullseye hosts [dns] - 10https://gerrit.wikimedia.org/r/922493 (https://phabricator.wikimedia.org/T319477) (owner: 10EoghanGaffney) [23:15:16] (03CR) 10Dzahn: [C: 03+1] Switch doc host from doc1002 to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/922487 (https://phabricator.wikimedia.org/T319477) (owner: 10EoghanGaffney) [23:21:53] checking a few more IPs there I don't see there being much different in reachability from locations the eh US versus esams [23:22:35] nuber of NEL tcp.timeouts has also dropped considerably [23:23:43] yeah I think we can comfortably call it passed at this point [23:23:50] yep [23:23:53] I know it's super late for you, really appreciate the help [23:24:28] np, happened to be up :) [23:27:12] planned some maintenance on CI servers. it would mean potentially jenkins down for a little bit. but if you are in incident and might need it.. I will postpone [23:27:57] thanks,it looks like we're out of the woods though, you can go ahead [23:28:32] ok, thanks [23:30:05] !log contint*, releases* - maintenance - changing UID of jenkins user - jenkins will be stopped for a little bit, releases-jenkins is first though - T324659 [23:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:11] T324659: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 [23:30:24] (03CR) 10Dzahn: [C: 03+2] jenkins: switch to fixed uid/gid 924 [puppet] - 10https://gerrit.wikimedia.org/r/917919 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [23:34:48] (03CR) 10Dzahn: [C: 03+2] "additional commands needed:" [puppet] - 10https://gerrit.wikimedia.org/r/917919 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [23:41:48] !log releases1002 (releases.wikimedia.org) stopping jenkins for maintenance [23:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:06] (ProbeDown) firing: (3) Service releases1002:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:44:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on releases1002.eqiad.wmnet with reason: maintenance [23:44:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on releases1002.eqiad.wmnet with reason: maintenance [23:52:10] !log releases1002 - jenkins service running again, this is the active host behind releases-jenkins.wikimedia.org - maintenance for releases* done [23:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:44] (03CR) 10Dzahn: [C: 03+2] "releases server done first:" [puppet] - 10https://gerrit.wikimedia.org/r/917919 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)