[00:01:13] (DiskSpace) resolved: Disk space an-airflow1001:9100:/ 5.986% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:28:53] RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [00:36:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.998% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:39:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/913934 [00:39:18] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/913934 (owner: 10TrainBranchBot) [00:57:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [00:57:55] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/913934 (owner: 10TrainBranchBot) [01:07:33] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T335722 (10phaultfinder) [01:50:00] (PowerSupply) firing: Power Supply - Status - issue on mw1466:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw1466 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T0200) [02:01:34] (03PS1) 10RLazarus: Multiply sli_queries by 100 [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 [02:07:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.7 [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/913935 (https://phabricator.wikimedia.org/T330213) [02:07:52] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.7 [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/913935 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot) [02:09:34] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:34] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:52] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.7 [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/913935 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot) [02:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T0300) [03:00:59] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:13] (DiskSpace) resolved: Disk space an-airflow1001:9100:/ 5.942% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:01:21] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914036 (https://phabricator.wikimedia.org/T330213) [03:01:23] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914036 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot) [03:02:08] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914036 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot) [03:02:38] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.7 refs T330213 [03:02:41] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [03:12:31] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:19:37] (03CR) 10RLazarus: "Dashboard/slo-Trafficserver view: https://grafana.wikimedia.org/dashboard/snapshot/Z448EVFVjw38nVHbTgO5syAEqaSIML6a" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus) [03:33:35] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:38:35] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:51:59] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.7 refs T330213 (duration: 49m 21s) [03:52:02] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [03:54:19] !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.5 (duration: 02m 17s) [04:37:53] looks like there is no labor day for the train bot :) [04:57:01] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [05:03:24] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Radar): git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10hashar) 05Open→03Resolved a:03hashar For other use case, looks like that will be done as part {T335354} I... [05:36:51] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (gerrit1001, ...), Fresh: 123 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:50:00] (PowerSupply) firing: Power Supply - Status - issue on mw1466:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw1466 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [05:58:10] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 17961 [05:58:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 17961 [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T0600) [06:00:06] kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T0600). [06:01:06] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 10089 [06:02:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 10089 [06:03:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 132132 [06:04:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 132132 [06:05:59] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 48237 [06:06:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 48237 [06:06:56] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 293 [06:07:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 293 [06:07:48] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 136106 [06:08:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 136106 [06:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [06:49:29] (03PS1) 10Muehlenhoff: Remove access for cmjohnson [puppet] - 10https://gerrit.wikimedia.org/r/914178 [06:52:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for cmjohnson [puppet] - 10https://gerrit.wikimedia.org/r/914178 (owner: 10Muehlenhoff) [06:53:15] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Joe) we also need to add wikifunctions to our internal certs [06:56:25] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Cmjohnson out of all services on: 1274 hosts [06:57:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Cmjohnson out of all services on: 1274 hosts [06:57:36] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Cmjohnson out of all services on: 794 hosts [06:57:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Cmjohnson out of all services on: 794 hosts [07:00:05] Amir1, Urbanecm, and taavi: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T0700). [07:00:05] MdsShakil and tgr: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] Hello 🙋 [07:00:49] o/ [07:00:53] o/ [07:01:13] starting with MdsShakil's patch [07:01:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913225 (https://phabricator.wikimedia.org/T335705) (owner: 10MdsShakil) [07:03:07] (03Merged) 10jenkins-bot: Enable WikiLove extension on bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913225 (https://phabricator.wikimedia.org/T335705) (owner: 10MdsShakil) [07:03:44] !log taavi@deploy1002 Started scap: Backport for [[gerrit:913225|Enable WikiLove extension on bnwikibooks (T335705)]] [07:03:47] T335705: Enable WikiLove extension on bnwikibooks - https://phabricator.wikimedia.org/T335705 [07:05:32] !log taavi@deploy1002 taavi and mdsshakil: Backport for [[gerrit:913225|Enable WikiLove extension on bnwikibooks (T335705)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:05:52] MdsShakil: please test [07:06:06] LGTM [07:06:34] taavi: [07:06:42] thanks, syncing [07:11:43] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:913225|Enable WikiLove extension on bnwikibooks (T335705)]] (duration: 07m 59s) [07:11:46] T335705: Enable WikiLove extension on bnwikibooks - https://phabricator.wikimedia.org/T335705 [07:11:51] that's live [07:12:04] (03PS1) 10Muehlenhoff: Remove cmjohnson from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/914259 [07:12:05] tgr_: do you want to self-deploy or would you prefer if I did that for you? [07:12:24] taavi: I can deploy. [07:12:32] taavi: ty [07:12:34] sure, works for me [07:13:05] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Joe) [07:14:59] (03CR) 10Giuseppe Lavagetto: Re-vamp integration testing (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/909204 (owner: 10Giuseppe Lavagetto) [07:16:09] (03CR) 10Muehlenhoff: [C: 03+2] Remove cmjohnson from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/914259 (owner: 10Muehlenhoff) [07:16:12] (03CR) 10Jelto: [C: 03+2] gitlab runner: allow node:* images [puppet] - 10https://gerrit.wikimedia.org/r/911407 (https://phabricator.wikimedia.org/T335320) (owner: 10Mhurd) [07:16:15] (03CR) 10Gergő Tisza: [C: 03+2] [noop] Disable section image recommendations in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913965 (https://phabricator.wikimedia.org/T329276) (owner: 10Gergő Tisza) [07:18:06] (03PS1) 10Muehlenhoff: Remove router access for cmjohnson [homer/public] - 10https://gerrit.wikimedia.org/r/914260 [07:19:55] (03PS2) 10Gergő Tisza: [noop] Disable section image recommendations in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913965 (https://phabricator.wikimedia.org/T329276) [07:20:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913965 (https://phabricator.wikimedia.org/T329276) (owner: 10Gergő Tisza) [07:27:59] is zuul down? [07:29:36] the patch is CR+2 V+2 and not doing anything [07:29:55] oh well [07:30:39] !log tgr@deploy1002 Started scap: Backport for [[gerrit:913965|[noop] Disable section image recommendations in production (T329276)]] [07:30:42] T329276: Section-level images: create experiment variant and related tooling for opting in/out - https://phabricator.wikimedia.org/T329276 [07:32:08] !log tgr@deploy1002 tgr: Backport for [[gerrit:913965|[noop] Disable section image recommendations in production (T329276)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:33:14] (03PS6) 10Giuseppe Lavagetto: Re-visit scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/885281 (https://phabricator.wikimedia.org/T292818) [07:33:19] tgr_: it looks like you might have been hit by T333241 [07:33:20] T333241: scap backport applied +2 to wrong changeset when run shortly after a rebase and hung waiting for merge - https://phabricator.wikimedia.org/T333241 [07:34:25] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Jelto) [07:34:27] hm, right [07:37:50] I glanced at that bug but the title was misleading [07:38:08] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:913965|[noop] Disable section image recommendations in production (T329276)]] (duration: 07m 29s) [07:38:11] but yeah T333241#8731590 is exactly what I did [07:38:12] T329276: Section-level images: create experiment variant and related tooling for opting in/out - https://phabricator.wikimedia.org/T329276 [07:39:01] (03PS5) 10Gergő Tisza: OAuth: Do not require approval for read-only grants on public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910815 (https://phabricator.wikimedia.org/T67750) [07:39:20] (03CR) 10Muehlenhoff: [C: 03+2] Fix docker-reporter config for legacy images [puppet] - 10https://gerrit.wikimedia.org/r/913153 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [07:39:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910815 (https://phabricator.wikimedia.org/T67750) (owner: 10Gergő Tisza) [07:40:10] (03Merged) 10jenkins-bot: OAuth: Do not require approval for read-only grants on public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910815 (https://phabricator.wikimedia.org/T67750) (owner: 10Gergő Tisza) [07:40:40] !log tgr@deploy1002 Started scap: Backport for [[gerrit:910815|OAuth: Do not require approval for read-only grants on public wikis (T67750)]] [07:40:42] T67750: Low-risk OAuth consumers should be automatically approved - https://phabricator.wikimedia.org/T67750 [07:41:27] 10SRE, 10Traffic: Varnish SLI is impacted by external components performance|behavior - https://phabricator.wikimedia.org/T317051 (10Vgutierrez) 05In progress→03Resolved can be closed, cheers! [07:42:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:42:05] !log tgr@deploy1002 tgr: Backport for [[gerrit:910815|OAuth: Do not require approval for read-only grants on public wikis (T67750)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [07:43:34] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:23] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [07:47:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:48:13] (03PS1) 10Filippo Giunchedi: aptrepo: restrict Grafana updates to 9.3.x [puppet] - 10https://gerrit.wikimedia.org/r/914261 (https://phabricator.wikimedia.org/T335557) [07:48:17] (03PS2) 10Muehlenhoff: Revert "sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage" [cookbooks] - 10https://gerrit.wikimedia.org/r/912311 (https://phabricator.wikimedia.org/T330495) [07:48:19] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:910815|OAuth: Do not require approval for read-only grants on public wikis (T67750)]] (duration: 07m 39s) [07:48:22] T67750: Low-risk OAuth consumers should be automatically approved - https://phabricator.wikimedia.org/T67750 [07:49:17] !log UTC morning deploys done [07:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:53] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10fgiunchedi) [07:58:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/914261 (https://phabricator.wikimedia.org/T335557) (owner: 10Filippo Giunchedi) [07:58:57] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: restrict Grafana updates to 9.3.x [puppet] - 10https://gerrit.wikimedia.org/r/914261 (https://phabricator.wikimedia.org/T335557) (owner: 10Filippo Giunchedi) [07:59:26] (03PS4) 10Hashar: gerrit: relocate LFS data [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) [08:02:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [08:03:07] (03CR) 10Hashar: [C: 04-1] "I will revisit. I explicitly did not want to hardcode the path and instead use the default relatively to $GERRIT_SITE. I will have to chec" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [08:07:27] !log upgrade grafana to 9.3.13 [08:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:48] (03CR) 10Aqu: "The changes have been implemented." [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu) [08:08:50] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [08:13:22] (03CR) 10Ayounsi: [C: 03+1] Remove router access for cmjohnson [homer/public] - 10https://gerrit.wikimedia.org/r/914260 (owner: 10Muehlenhoff) [08:21:49] (03PS1) 10Elukey: fastapi-app: add configmap template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914262 [08:26:22] (03CR) 10Ilias Sarantopoulos: [C: 03+1] fastapi-app: add configmap template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914262 (owner: 10Elukey) [08:27:31] !log stage Junos 21 on asw-c-codfw - T334049 [08:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:34] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [08:27:48] 10Puppet, 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10Gehel) [08:28:08] !log updated netboot image for Bullseye 11.7 T335575 [08:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:11] T335575: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 [08:28:52] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Gehel) [08:29:10] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Gehel) [08:37:55] (03CR) 10Elukey: [C: 03+2] fastapi-app: add configmap template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914262 (owner: 10Elukey) [08:38:16] (03PS2) 10Ladsgroup: db1132: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/913662 (https://phabricator.wikimedia.org/T335632) [08:38:22] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1132: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/913662 (https://phabricator.wikimedia.org/T335632) (owner: 10Ladsgroup) [08:40:07] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [08:42:03] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) [08:44:17] !log testing haproxy 2.6.12-1~bpo10+1+wmf1 in cp1077 and cp1085 - T334448 [08:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:20] T334448: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448 [08:50:28] (03Abandoned) 10Alexandros Kosiaris: mesh: Fix a mess with trimming ending whitespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/912283 (owner: 10Alexandros Kosiaris) [08:50:31] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:50:50] (03CR) 10Volans: [C: 03+2] distros: add bookworm-wikimedia to known distros [puppet] - 10https://gerrit.wikimedia.org/r/912931 (owner: 10Volans) [08:51:37] (03CR) 10Clément Goubert: [C: 03+2] termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/908955 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [08:51:49] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [08:53:05] Lucas_WMDE: fyi I'm about to try and move termbox-test to mw-api-int again [08:53:14] ok [08:53:28] fingers crossed \o/ [08:57:01] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [08:59:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:59:12] (03Merged) 10jenkins-bot: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/908955 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [09:02:48] Good morning! We'll be starting the gitlab migration from codfw back to eqiad in the next few minutes, so there will be some downtime ahead. We expect the process to take approximately two hours. Any issues, please let us know! https://phabricator.wikimedia.org/T335504 [09:04:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:06:17] Lucas_WMDE: Ah, it's not using mesh, I suppose test.wikidata connects to port 3031 and not 4004 [09:06:51] (03CR) 10Btullis: [C: 03+1] "Looks good. You'll need a corresponding secret in the private repo for this." [puppet] - 10https://gerrit.wikimedia.org/r/911296 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [09:08:47] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Language-Team (Language-2023-April-June), and 2 others: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10Pginer-WMF) [09:09:43] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914267 (https://phabricator.wikimedia.org/T333137) (owner: 10WMDE-Fisch) [09:10:40] (03CR) 10Btullis: [C: 03+1] "Nice. Thanks for all of the cleanups too." [puppet] - 10https://gerrit.wikimedia.org/r/912301 (owner: 10Muehlenhoff) [09:10:46] !log eoghan@cumin1001 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab2002.wikimedia.org to gitlab1004.wikimedia.org [09:11:11] jouncebot: nowandnext [09:11:11] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [09:11:11] In 0 hour(s) and 48 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1000) [09:11:15] good [09:11:29] (03PS3) 10Ladsgroup: Remove 1024px and 1920px from pre-gen thumbsizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912837 (https://phabricator.wikimedia.org/T211661) [09:11:37] (03CR) 10Btullis: [C: 03+1] Configure product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [09:11:39] (03CR) 10Ladsgroup: [C: 03+2] Remove 1024px and 1920px from pre-gen thumbsizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912837 (https://phabricator.wikimedia.org/T211661) (owner: 10Ladsgroup) [09:12:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912837 (https://phabricator.wikimedia.org/T211661) (owner: 10Ladsgroup) [09:12:28] (03Merged) 10jenkins-bot: Remove 1024px and 1920px from pre-gen thumbsizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912837 (https://phabricator.wikimedia.org/T211661) (owner: 10Ladsgroup) [09:12:55] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:912837|Remove 1024px and 1920px from pre-gen thumbsizes (T211661)]] [09:12:58] T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 [09:13:01] !log ladsgroup@deploy1002 scap failed: CalledProcessError Command 'sudo -u mwbuilder /usr/local/bin/update-mediawiki-tools-release' returned non-zero exit status 1. (duration: 00m 05s) [09:13:20] fatal: unable to access 'https://gitlab.wikimedia.org/repos/releng/release.git/': The requested URL returned error: 502 [09:13:58] Amir1: See eoghan's message 10 minutes ago [09:14:00] gitlab dc migration just started :/ [09:14:07] aah, okay [09:14:11] I missed it, sorry [09:14:15] (03PS9) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [09:14:17] also phab tasks are failing to load, not sure if related [09:14:37] ping me once this is over [09:14:42] taavi: Should be unrelated, I appear to be able to load tasks ok. [09:14:43] please 🥺 [09:15:18] Amir1: Of course, will do. We expect this to take approximately 2 hours. [09:15:27] taavi: I'm having the same issue [09:15:45] "Unhandled Exception ("RuntimeException") [09:15:47] Invalid argument supplied for foreach() [09:15:56] eoghan: I am seeing `Invalid argument supplied for foreach()` `called at [/src/customfields/GitLabPatchesCustomField.php:113]` in logstash. for example https://phabricator.wikimedia.org/T329425 fails [09:16:12] PROBLEM - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [09:16:17] Yeah, I'm seeing that when I load a task that I haven't opened before. [09:17:45] yes, it calls [09:17:45] fatal: unable to access 'https://gitlab.wikimedia.org/toolforge-repos/toolpilot.git/': The requested URL returned error: 502 [09:17:55] and similar [09:18:27] volans: For the phabricator issue, or for a scap deployment? [09:18:53] phab [09:19:19] where is the source code of our phab extensions these days? is that gitlab too? [09:19:34] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:20:16] FYI I just got the above looking at /var/log/phd/daemons.log [09:21:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:ldap::client: split config and utils to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah) [09:21:38] !log eoghan@cumin1001 END (ERROR) - Cookbook sre.gitlab.failover (exit_code=97) Failover of gitlab from gitlab2002.wikimedia.org to gitlab1004.wikimedia.org [09:21:50] Any info about the Phabricator down? [09:21:57] it's down [09:22:08] eoghan is pausing the switch of gitlab and we will investiage the unknown dependency to phabricator first [09:23:08] I could probably work out a hotfix for the code to properly degrade when gitlab is down.. but I'm not finding the code. both rPHEX and phabricator/extensions.git on gerrit don't have the gitlab integration in the first place [09:23:26] PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:34] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:24:44] gitlab is back again, phabricator should recover soon [09:24:58] Gitlab is back, and tasks in phabricator are loading. However, the error pages may be cached so the errors may persist for a short while. [09:25:13] (SystemdUnitFailed) firing: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:16] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:25:56] RECOVERY - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [09:28:20] (03PS7) 10Giuseppe Lavagetto: Re-visit scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/885281 (https://phabricator.wikimedia.org/T292818) [09:28:54] (03CR) 10Giuseppe Lavagetto: Re-visit scaffolding (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/885281 (https://phabricator.wikimedia.org/T292818) (owner: 10Giuseppe Lavagetto) [09:30:40] (03PS1) 10Majavah: hieradata: remove files for long-gone hosts [puppet] - 10https://gerrit.wikimedia.org/r/914268 [09:30:42] (03PS1) 10Majavah: O:wmcs::nfs: delete old primary role files [puppet] - 10https://gerrit.wikimedia.org/r/914269 [09:30:44] (03PS1) 10Majavah: P::ldap::client::labs: drop support for production [puppet] - 10https://gerrit.wikimedia.org/r/914270 [09:30:46] (03PS1) 10Majavah: O:wmcs::nfs: delete old test role [puppet] - 10https://gerrit.wikimedia.org/r/914271 [09:30:48] (03PS1) 10Majavah: labstore: remove unused files [puppet] - 10https://gerrit.wikimedia.org/r/914272 [09:34:03] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10BTullis) [09:38:44] (03CR) 10David Caro: [C: 03+1] "LGTM, just to make sure, this will load /root/.config/openstack/clouds.yaml over /etc/openstack/clouds.yaml right?" [puppet] - 10https://gerrit.wikimedia.org/r/913985 (owner: 10Andrew Bogott) [09:40:30] taavi: it seems some phab extentions are hosted on gitlab already: https://gitlab.wikimedia.org/repos/phabricator/extensions [09:40:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Re-visit scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/885281 (https://phabricator.wikimedia.org/T292818) (owner: 10Giuseppe Lavagetto) [09:41:14] taavi: also phabricator is back again, can you remove the message from the title again? [09:41:53] We've decided to postpone the gitlab switchover due to the issue we discovered with phabricator. Gitlab should now be back up and running correctly, let us know in #wikimedia-gitlab if you have any problems [09:42:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:46:04] probably ping Amir1 because the gitlab migration is not currently ongoing anymore? (but cc eoghan) [09:46:14] jelto: yes, sorry [09:47:03] Lucas_WMDE: Yep, thanks! [09:47:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:47:04] (03PS1) 10Clément Goubert: InitialiseSettings.php: Change termbox url for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914274 (https://phabricator.wikimedia.org/T334064) [09:47:23] Amir1: Just in case you haven't seen, the gitlab maintenance is postponed so you can proceed with your deployment (: [09:48:15] (03Merged) 10jenkins-bot: Re-visit scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/885281 (https://phabricator.wikimedia.org/T292818) (owner: 10Giuseppe Lavagetto) [09:50:00] (PowerSupply) firing: Power Supply - Status - issue on mw1466:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw1466 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [09:50:45] thanks [09:51:33] (03PS1) 10Clément Goubert: termbox: Migrate from staging-test to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/914275 (https://phabricator.wikimedia.org/T334064) [09:51:34] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:912837|Remove 1024px and 1920px from pre-gen thumbsizes (T211661)]] [09:51:37] T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 [09:53:22] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:912837|Remove 1024px and 1920px from pre-gen thumbsizes (T211661)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [09:55:17] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis) [09:56:31] (03CR) 10Muehlenhoff: [C: 03+2] apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [09:59:25] !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter depool all active/active services in codfw: codfw row C switches upgrade - T334049 [09:59:29] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [09:59:47] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row C switches upgrade -... [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1000) [10:00:15] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:912837|Remove 1024px and 1920px from pre-gen thumbsizes (T211661)]] (duration: 08m 40s) [10:00:21] T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 [10:05:53] (03PS1) 10Btullis: Bump the mediawiki_history snapshot to include data for April [puppet] - 10https://gerrit.wikimedia.org/r/914276 [10:07:35] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40975/console" [puppet] - 10https://gerrit.wikimedia.org/r/914276 (owner: 10Btullis) [10:08:09] (03PS2) 10Muehlenhoff: Use signed-by to in apt::package_from_component on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) [10:10:38] (03PS1) 10WMDE-Fisch: Fix clearing wrong container when closing fullscreen map [extensions/Kartographer] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914286 (https://phabricator.wikimedia.org/T335648) [10:10:59] (03PS1) 10WMDE-Fisch: Fix clearing wrong container when closing fullscreen map [extensions/Kartographer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914287 (https://phabricator.wikimedia.org/T335648) [10:12:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:14:21] (03CR) 10Majavah: [C: 04-1] "The issues with file formats in https://gerrit.wikimedia.org/r/c/operations/puppet/+/913121/ apply here too" [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:14:34] (03CR) 10Btullis: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40976/console" [puppet] - 10https://gerrit.wikimedia.org/r/910460 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [10:16:04] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Michaelcochez) We do our development on github. Does it make more sense to restart with a new repository on gitlab to mirror that, or better to migrate? [10:16:40] (03PS1) 10Ladsgroup: Set externallinks migration to read new in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914280 (https://phabricator.wikimedia.org/T335343) [10:20:05] RECOVERY - Check whether ferm is active by checking the default input chain on sretest1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:22:59] jouncebot: nowandnext [10:22:59] For the next 0 hour(s) and 37 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1000) [10:22:59] In 2 hour(s) and 37 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1300) [10:22:59] In 2 hour(s) and 37 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1300) [10:28:01] (03CR) 10Volans: "The change is quite extensive and it's pretty hard to tell if it's doing the right thing in all cases. I trust that between you and Arzhel" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [10:28:56] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [10:30:21] (03CR) 10Ladsgroup: [C: 03+1] Set externallinks migration to read new in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914280 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [10:31:47] (03PS1) 10Muehlenhoff: Remove more stretch images [puppet] - 10https://gerrit.wikimedia.org/r/914282 (https://phabricator.wikimedia.org/T335282) [10:32:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914280 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [10:33:16] (03Merged) 10jenkins-bot: Set externallinks migration to read new in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914280 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [10:33:27] (03CR) 10Volans: Apply black to all python files (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/907880 (owner: 10Ayounsi) [10:33:42] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:914280|Set externallinks migration to read new in testwiki (T335343)]] [10:33:46] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [10:35:18] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:914280|Set externallinks migration to read new in testwiki (T335343)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [10:41:14] (03PS1) 10Hnowlan: Add all fonts from mediawiki::packages::fonts [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/914285 (https://phabricator.wikimedia.org/T335681) [10:42:50] hnowlan: my apologies for getting you into this mess [10:44:48] 10SRE, 10Infrastructure-Foundations, 10netops: Add network-layer protections to avoid inadvertently lowering IRB MTU - https://phabricator.wikimedia.org/T329799 (10cmooney) 05Open→03Resolved [10:47:10] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:914280|Set externallinks migration to read new in testwiki (T335343)]] (duration: 13m 27s) [10:47:13] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [10:51:20] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10hashar) >>! In T332953#8819607, @Michaelcochez wrote: > We do our development on github. Does it make more sense to restart with a new repository on... [10:52:40] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in codfw: codfw row C switches upgrade - T334049 [10:52:43] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [10:53:00] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row C switches upgrade -... [11:06:00] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10cmooney) p:05Triage→03Medium [11:06:21] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10cmooney) [11:10:30] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10aborrero) I plan to do {T335759} then we can specify the FQDN to use for the bird config. Otherwise I think we would need to hardcod... [11:10:39] (03PS1) 10Elukey: modules: duplicate the istio ingress template for 1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914306 (https://phabricator.wikimedia.org/T335756) [11:10:41] (03PS1) 10Elukey: modules: add ml-staging cfg to the istio template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914307 (https://phabricator.wikimedia.org/T335756) [11:12:43] (03PS1) 10Muehlenhoff: apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 [11:13:11] (03CR) 10CI reject: [V: 04-1] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [11:13:19] (03PS2) 10Elukey: modules: add ml-staging cfg to the istio template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914307 (https://phabricator.wikimedia.org/T335756) [11:13:21] (03PS2) 10Muehlenhoff: apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 [11:13:44] (03CR) 10CI reject: [V: 04-1] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [11:23:28] (03PS3) 10Elukey: modules: add ml-staging cfg to the istio template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914307 (https://phabricator.wikimedia.org/T335756) [11:23:30] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40979/console" [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu) [11:24:05] (03PS3) 10Muehlenhoff: apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 [11:25:16] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10cmooney) @aborrero hey. Yeah I can understand why having to hardcode the IPs in the puppet tree is not a great option. Unfortunate... [11:26:19] (03CR) 10CI reject: [V: 04-1] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [11:28:45] (03PS1) 10Arturo Borrero Gonzalez: wikimedia.cloud: add new codfw.hw.wikimedia.cloud addresses [dns] - 10https://gerrit.wikimedia.org/r/914310 (https://phabricator.wikimedia.org/T335759) [11:30:22] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10aborrero) yeah I'm thinking about doing something like `resolve_ipv4(whateverserver.codfw.hw.wikimedia.cloud)`, so basically let pup... [11:30:52] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [11:31:23] (03PS4) 10Elukey: modules: add ml-staging cfg to the istio template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914307 (https://phabricator.wikimedia.org/T335756) [11:31:30] (03PS2) 10Arturo Borrero Gonzalez: wikimedia.cloud: add new codfw.hw.wikimedia.cloud addresses [dns] - 10https://gerrit.wikimedia.org/r/914310 (https://phabricator.wikimedia.org/T335759) [11:32:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:33:24] (03PS4) 10Muehlenhoff: apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 [11:38:28] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Agree on driving it from Netbox, but that's trickier while the IPs are defined in puppet and then imported to netbox (as is the cas" [dns] - 10https://gerrit.wikimedia.org/r/914310 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [11:40:11] Amir1: not at all, glad to catch stuff like this. Relieved it's a relatively straight forward fix [11:41:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [11:46:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host poolcounter2004.codfw.wmnet [11:47:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimedia.cloud: add new codfw.hw.wikimedia.cloud addresses [dns] - 10https://gerrit.wikimedia.org/r/914310 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [11:47:17] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: let power supply issues open tasks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/913110 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [11:49:13] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest1003 [11:49:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on 41 hosts with reason: Row c switch maint T334049 [11:49:17] (03CR) 10Klausman: [C: 03+1] modules: add ml-staging cfg to the istio template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914307 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [11:49:19] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [11:49:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 41 hosts with reason: Row c switch maint T334049 [11:50:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2004.codfw.wmnet [11:50:14] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add label to prometheus3002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [11:50:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1003 [11:50:42] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host puppetmaster1006 [11:50:42] (03CR) 10Klausman: [C: 03+1] modules: duplicate the istio ingress template for 1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914306 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [11:50:48] (03CR) 10Muehlenhoff: "The PCC error for cloudbackup1002-dev is unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [11:51:15] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [11:51:27] !log stop slave on db1130 (eqiad master of s5) (T334049) [11:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:58] (03CR) 10Muehlenhoff: Use signed-by to in apt::package_from_component on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [11:52:01] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10cmooney) @aborrero yep that should work. Potentially a race condition there if we drive the DNS from Netbox, which will only get... [11:52:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host puppetmaster1006 [11:52:14] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host backup1010 [11:52:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1010 [11:52:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [11:52:32] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host backup1011 [11:52:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1011 [11:53:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host poolcounter2003.codfw.wmnet [11:55:00] (PowerSupply) resolved: Power Supply - Status - issue on mw1466:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw1466 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [11:56:19] (03CR) 10Filippo Giunchedi: prometheus::k8s switch staging-codfw to client cert auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [11:57:01] (PowerSupply) resolved: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [11:57:07] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Fix clearing wrong container when closing fullscreen map [extensions/Kartographer] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914286 (https://phabricator.wikimedia.org/T335648) (owner: 10WMDE-Fisch) [11:57:15] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Fix clearing wrong container when closing fullscreen map [extensions/Kartographer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914287 (https://phabricator.wikimedia.org/T335648) (owner: 10WMDE-Fisch) [11:57:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2003.codfw.wmnet [11:57:56] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Failover DNS from prometheus3001 to prometheus3002 in esams [dns] - 10https://gerrit.wikimedia.org/r/913192 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [11:58:04] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Failover DNS from prometheus6001 to prometheus6002 in drmrs [dns] - 10https://gerrit.wikimedia.org/r/913198 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [11:58:12] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Failover DNS from prometheus4001 to prometheus4002 in ulsfo [dns] - 10https://gerrit.wikimedia.org/r/913194 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [11:58:32] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Failover DNS from prometheus5001 to prometheus5002 in eqsin [dns] - 10https://gerrit.wikimedia.org/r/913196 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [12:00:21] (03CR) 10Hashar: [C: 04-1] Make Scap directories on deployment servers compatible with CVE-2022-24756 fix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912853 (https://phabricator.wikimedia.org/T335354) (owner: 10Muehlenhoff) [12:01:09] (03CR) 10Kamila Součková: [C: 03+1] "LGTM." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/914285 (https://phabricator.wikimedia.org/T335681) (owner: 10Hnowlan) [12:03:35] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host backup1011 [12:03:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1011 [12:03:41] (03CR) 10Hashar: [C: 04-1] Make Scap directories on deployment servers compatible with CVE-2022-24756 fix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912853 (https://phabricator.wikimedia.org/T335354) (owner: 10Muehlenhoff) [12:05:41] !log stop slave again on db1130 (eqiad master of s5) (T334049) [12:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:45] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [12:06:20] for people oncall, if you get a page, please let me know, it should not page but ... [12:10:13] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: name=ldap-replica2005.wikimedia.org [12:11:19] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [12:12:20] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable Kartographer Nearby on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914267 (https://phabricator.wikimedia.org/T333137) (owner: 10WMDE-Fisch) [12:15:25] (03CR) 10Btullis: [C: 03+2] analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu) [12:16:55] (03PS1) 10Elukey: ml-services: add private secretes to the ores-legacy helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/914313 (https://phabricator.wikimedia.org/T330414) [12:17:42] (03PS2) 10Elukey: ml-services: add private secretes to the ores-legacy helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/914313 (https://phabricator.wikimedia.org/T330414) [12:17:47] !log stop slave on eqiad masters of s1, x1, s8 (T334049) [12:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:51] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [12:19:09] (03PS1) 10Ssingh: depool codfw for switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/914314 (https://phabricator.wikimedia.org/T334049) [12:20:32] (03CR) 10Ssingh: [C: 03+2] depool codfw for switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/914314 (https://phabricator.wikimedia.org/T334049) (owner: 10Ssingh) [12:20:36] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:20:50] !log run authdns-update to depool codfwL T334049 [12:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:36] (03CR) 10Ssingh: [C: 03+2] hiera: temporarily remove dns2001 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/913952 (https://phabricator.wikimedia.org/T334049) (owner: 10Ssingh) [12:24:41] !log installing LInux 5.10.178 on bullseye hosts [12:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:31] (03PS1) 10Eevans: sessionstore: disable client connections to sessionstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/914315 (https://phabricator.wikimedia.org/T334049) [12:25:57] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ssingh) [12:26:17] (03CR) 10Elukey: [C: 03+2] ml-services: add private secretes to the ores-legacy helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/914313 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [12:27:03] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:27:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:27:50] ^ expected [12:28:07] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:28:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [12:29:34] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:30:47] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:30:55] PROBLEM - Bird Internet Routing Daemon on dns2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:31:08] (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: support using FQDNs instead of hardcoded IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) [12:31:32] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route check 2 services: maintenance [12:31:32] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check 2 services: maintenance [12:31:39] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:31:49] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:31:59] PROBLEM - Bird Internet Routing Daemon on durum2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:32:11] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:33:23] (03CR) 10CI reject: [V: 04-1] cloud_private_subnet: support using FQDNs instead of hardcoded IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [12:36:04] (03PS1) 10Elukey: ml-services: add network policies for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914319 (https://phabricator.wikimedia.org/T330414) [12:37:31] (03PS2) 10Elukey: ml-services: add network policies for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914319 (https://phabricator.wikimedia.org/T330414) [12:38:24] (03PS1) 10Clément Goubert: Revert "termbox: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914294 [12:38:37] (03CR) 10Clément Goubert: [C: 03+2] Revert "termbox: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914294 (owner: 10Clément Goubert) [12:38:51] (03CR) 10Ayounsi: wikimedia.cloud: add new codfw.hw.wikimedia.cloud addresses (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/914310 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [12:41:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimedia.cloud: add new codfw.hw.wikimedia.cloud addresses (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/914310 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [12:42:07] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:42:20] (03Abandoned) 10Eevans: sessionstore: disable client connections to sessionstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/914315 (https://phabricator.wikimedia.org/T334049) (owner: 10Eevans) [12:42:27] (03CR) 10Effie Mouzeli: [C: 03+1] recommendation-api: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905943 (https://phabricator.wikimedia.org/T334062) (owner: 10Clément Goubert) [12:42:31] (03CR) 10Klausman: [C: 03+1] ml-services: add network policies for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914319 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [12:42:36] (03PS1) 10Filippo Giunchedi: alertmanager: do not notify sretest instances [puppet] - 10https://gerrit.wikimedia.org/r/914320 (https://phabricator.wikimedia.org/T333204) [12:43:49] (03CR) 10Svantje Lilienthal: [C: 03+1] Fix clearing wrong container when closing fullscreen map [extensions/Kartographer] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914286 (https://phabricator.wikimedia.org/T335648) (owner: 10WMDE-Fisch) [12:44:40] (03Merged) 10jenkins-bot: Revert "termbox: Switch to mw-api-int-async on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914294 (owner: 10Clément Goubert) [12:45:49] (03CR) 10Ayounsi: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/914320 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi) [12:45:58] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:50:31] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:52:06] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: do not notify sretest instances [puppet] - 10https://gerrit.wikimedia.org/r/914320 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi) [12:53:09] (03PS1) 10Jcrespo: Fix bug by which %% wasn't adequately escaped on sql queries [software/mediabackups] - 10https://gerrit.wikimedia.org/r/914321 (https://phabricator.wikimedia.org/T327157) [12:54:11] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 186 hosts with reason: codfw row C upgrade [12:54:12] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on 186 hosts with reason: codfw row C upgrade [12:56:02] (03PS2) 10Arturo Borrero Gonzalez: cloud_private_subnet: support using FQDNs instead of hardcoded IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) [12:57:07] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:16] (03CR) 10CI reject: [V: 04-1] cloud_private_subnet: support using FQDNs instead of hardcoded IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [12:59:15] (03Abandoned) 10Alexandros Kosiaris: DNM: Showcase row-level mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/886321 (https://phabricator.wikimedia.org/T328523) (owner: 10Alexandros Kosiaris) [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1300) [13:00:06] WMDE-Fisch: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1300) [13:00:40] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [13:00:41] \o/ [13:01:07] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 185 hosts with reason: codfw row C upgrade [13:01:39] (03PS1) 10Alexandros Kosiaris: machinetranslation: Deploy 2023-05-02-080334-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914322 (https://phabricator.wikimedia.org/T331505) [13:02:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Deploy 2023-05-02-080334-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914322 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [13:03:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 185 hosts with reason: codfw row C upgrade [13:03:21] * urbanecm waves [13:03:27] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=21224f03-d3c2-4431-accb-64fcadd01a0f) set by ayounsi@cumin1001 for 2:00:00 on 185 host(s) and th... [13:03:33] WMDE-Fisch: hi, do you want to / can you self service with the deployment? or should i deploy? [13:04:01] Please go forward. Family today. [13:04:02] (03PS1) 10Hokwelum: make snapshot101[45] temporary testbed hosts [puppet] - 10https://gerrit.wikimedia.org/r/914323 [13:04:09] So a bit async [13:04:10] :-D [13:04:12] sure [13:04:28] (03CR) 10Urbanecm: [C: 03+2] Fix clearing wrong container when closing fullscreen map [extensions/Kartographer] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914286 (https://phabricator.wikimedia.org/T335648) (owner: 10WMDE-Fisch) [13:04:40] (03CR) 10Urbanecm: [C: 03+2] Fix clearing wrong container when closing fullscreen map [extensions/Kartographer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914287 (https://phabricator.wikimedia.org/T335648) (owner: 10WMDE-Fisch) [13:05:02] WMDE-Fisch: does the config change depend on the backports? or can i sync it before them? [13:05:12] No does not depend on them. [13:05:19] ack [13:05:38] !log rebooting asw-c-codfw for software upgrade - T334049 [13:05:39] (03PS2) 10Urbanecm: Enable Kartographer Nearby on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914267 (https://phabricator.wikimedia.org/T333137) (owner: 10WMDE-Fisch) [13:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:41] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [13:05:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914267 (https://phabricator.wikimedia.org/T333137) (owner: 10WMDE-Fisch) [13:06:47] (03Merged) 10jenkins-bot: Enable Kartographer Nearby on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914267 (https://phabricator.wikimedia.org/T333137) (owner: 10WMDE-Fisch) [13:07:16] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914267|Enable Kartographer Nearby on mobile (T333137)]] [13:07:19] T333137: Test and enable mobile support for the nearby feature - https://phabricator.wikimedia.org/T333137 [13:07:30] (Emergency syslog message) firing: Alert for device asw-c-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [13:08:23] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:08:30] (virtual-chassis crash) firing: Alert for device asw-c-codfw.mgmt.codfw.wmnet - virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [13:08:37] (03Merged) 10jenkins-bot: machinetranslation: Deploy 2023-05-02-080334-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914322 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [13:08:57] (ProbeDown) firing: (2) Service doc1002.eqiad.wmnet:443 has failed probes (http_doc1002_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:07] (ProbeDown) firing: (7) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:07] (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:42] !incidents [13:09:43] 3569 (UNACKED) ProbeDown sre (10.2.1.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 codfw) [13:09:51] !ack 3569 [13:09:52] 3569 (ACKED) ProbeDown sre (10.2.1.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 codfw) [13:09:58] (03CR) 10Muehlenhoff: [C: 03+2] Make Refinery deploys compatible with CVE-2022-24765 fix [puppet] - 10https://gerrit.wikimedia.org/r/912301 (owner: 10Muehlenhoff) [13:10:25] PROBLEM - Docker registry health on registry2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 235 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Docker [13:10:31] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2004.codfw.wmnet:443/v2/wikimedia-stretch/manifests/latest - 362 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Docker [13:10:41] PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2003.codfw.wmnet:443/v2/wikimedia-stretch/manifests/latest - 362 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Docker [13:10:51] PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 2 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:10:53] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:11:09] PROBLEM - Docker registry health on registry2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 235 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Docker [13:11:17] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:11:17] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:11:18] i see a lot of `13:07:24 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2023-05-02-103349-publish (ran as mwdeploy@kubernetes1017.eqiad.wmnet) returned [1]: Pulling 'docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2023-05-02-103349-publish'... Error response from daemon: received unexpected HTTP status: 503 Service Unavailable` in my scap [13:11:19] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 135, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:11:22] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:11:27] XioNoX: related to switch reboot? [13:11:39] urbanecm: yeah [13:11:55] everything codfw related in the next 10min [13:12:03] urbanecm: yeah, expected right now [13:12:50] ack, noted. thanks. bit unfortunate to do that during a backport window, but understood. should i just ignore those messages in scap? or wait & redo what i was deploying now once codfw is stable again? [13:13:30] (JobUnavailable) firing: (9) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:13:48] We can probably just run a scap pull on the appservers once XioNoX is done with the maintenance [13:13:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [13:13:50] urbanecm: I 'd wait it out. [13:13:57] (ProbeDown) firing: (3) Service doc1002.eqiad.wmnet:443 has failed probes (http_doc1002_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:14:01] it should be ~15-20m [13:14:03] PROBLEM - Check systemd state on registry2004 is CRITICAL: CRITICAL - degraded: The following units failed: build-homepage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:04] Overruled :D [13:14:07] (ProbeDown) firing: (7) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:14:31] okay. leaving the already started sync in progress and i'll redo it once someone pings me. [13:14:31] claime: not really? we both suggested waiting for XioNoX to be done? [13:14:53] i think claime suggested continuing with deployment and then running scap pull on all codfw appservers? [13:14:56] akosiaris: I was joking ;) [13:15:02] But yeah, what urbanecm said [13:15:49] But just ignoring the errors is an unnecessary risk [13:16:20] !log urbanecm@deploy1002 urbanecm and wmde-fisch: Backport for [[gerrit:914267|Enable Kartographer Nearby on mobile (T333137)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:16:23] On other news: The debug extension only gives me a blank menu o.O? [13:16:24] T333137: Test and enable mobile support for the nearby feature - https://phabricator.wikimedia.org/T333137 [13:16:42] hm. same for me. [13:16:42] !log urbanecm@deploy1002 Sync cancelled. [13:16:49] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [13:17:15] PROBLEM - ores on ores2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [13:17:17] (KafkaUnderReplicatedPartitions) firing: (2) Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:17:30] (Emergency syslog message) resolved: Device asw-c-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [13:17:37] PROBLEM - ores on ores2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [13:17:41] I'll try on the ESR FF [13:17:44] (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [13:17:49] RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:17:51] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:17:58] (SystemdUnitFailed) resolved: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:18:03] WMDE-Fisch: fwiw, i use FF 112.0.2 [13:18:05] RECOVERY - Docker registry health on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Docker [13:18:17] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:18:21] (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:18:28] it works in Chrome 112.0.5615.138 [13:18:32] (ProbeDown) firing: (5) Service miscweb2003:443 has failed probes (http_annual_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:18:38] Strange I have the same in Chrome. [13:18:42] (JobUnavailable) firing: (2) Reduced availability for job jmx_kafka in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:18:45] I'll look into Chromium [13:18:47] RECOVERY - ores on ores2002 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [13:18:51] (ProbeDown) firing: (39) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:18:53] Same here on 112.0.2 [13:18:57] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:19:02] (ff) [13:19:05] (HttpdUnreachable) firing: (2) httpd unavailable for deployment mw-api-ext at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [13:19:05] RECOVERY - Docker registry health on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Docker [13:19:11] RECOVERY - ores on ores2009 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [13:19:11] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3754 bytes in 0.457 second response time https://wikitech.wikimedia.org/wiki/Docker [13:19:14] claime: same as in "blank debug extension"? [13:19:18] urbanecm: yup [13:19:21] RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3754 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Docker [13:19:23] ack [13:19:35] it's back [13:19:36] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [13:19:53] weird ~_~ [13:19:59] urbanecm: Found a way to test it. [13:20:11] (SystemdUnitFailed) firing: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:16] (JobUnavailable) firing: (109) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:20:21] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:20:28] (ProbeDown) firing: (26) Service contint2001:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:20:36] (ProbeDown) resolved: (57) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:20:39] (ProbeDown) firing: (99) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:20:44] (03Merged) 10jenkins-bot: Fix clearing wrong container when closing fullscreen map [extensions/Kartographer] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914286 (https://phabricator.wikimedia.org/T335648) (owner: 10WMDE-Fisch) [13:20:52] (03Merged) 10jenkins-bot: Fix clearing wrong container when closing fullscreen map [extensions/Kartographer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914287 (https://phabricator.wikimedia.org/T335648) (owner: 10WMDE-Fisch) [13:20:58] and confirmed, debug extension came back in FF [13:21:17] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:21:17] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:21:23] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:21:25] anyway, waiting for a ping to resume deployment :) [13:21:26] same [13:21:31] urbanecm: switch maintenance finished [13:21:41] thanks, re-starting deployment then [13:21:45] cool [13:21:52] yay [13:21:55] ack [13:22:02] gg XioNoX :) [13:22:31] thanks everybody! It was smooth, a bit on the noisy side, but nothing user facing [13:22:40] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914267|Enable Kartographer Nearby on mobile (T333137)]], [[gerrit:914286|Fix clearing wrong container when closing fullscreen map (T335648)]], [[gerrit:914287|Fix clearing wrong container when closing fullscreen map (T335648)]] [13:22:44] T333137: Test and enable mobile support for the nearby feature - https://phabricator.wikimedia.org/T333137 [13:22:44] T335648: Fullscreen map is blank if opened twice - https://phabricator.wikimedia.org/T335648 [13:22:45] awesome. Thanks! [13:22:48] (KubernetesCalicoDown) resolved: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:22:53] (ProbeDown) resolved: (99) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:22:54] (ProbeDown) resolved: (26) Service contint2001:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:00] nicely done XioNoX! [13:23:00] (JobUnavailable) firing: (109) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:23:09] (HttpdUnreachable) resolved: (2) httpd unavailable for deployment mw-api-ext at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [13:23:11] 10SRE, 10Observability-Metrics, 10observability: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220 (10lmata) [13:23:30] (virtual-chassis crash) resolved: Device asw-c-codfw.mgmt.codfw.wmnet recovered from virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [13:23:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST serviceaccounts) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:23:38] (03CR) 10ArielGlenn: [C: 03+2] make snapshot101[45] temporary testbed hosts [puppet] - 10https://gerrit.wikimedia.org/r/914323 (owner: 10Hokwelum) [13:23:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [13:24:03] !log urbanecm@deploy1002 wmde-fisch and urbanecm: Backport for [[gerrit:914267|Enable Kartographer Nearby on mobile (T333137)]], [[gerrit:914286|Fix clearing wrong container when closing fullscreen map (T335648)]], [[gerrit:914287|Fix clearing wrong container when closing fullscreen map (T335648)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [13:24:04] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ldap-replica2005.wikimedia.org [13:24:15] WMDE-Fisch: can you test both changes at a debug server please? [13:24:36] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [13:24:57] (03PS1) 10Ssingh: Revert "hiera: temporarily remove dns2001 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/914295 [13:25:07] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:25:09] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:25:13] (03PS1) 10Ssingh: Revert "depool codfw for switch upgrade" [dns] - 10https://gerrit.wikimedia.org/r/914296 [13:25:16] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [13:25:19] RECOVERY - Bird Internet Routing Daemon on durum2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:25:26] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [13:25:28] urbanecm: Booth work like a charm, go on please. [13:25:34] syncing, thanks [13:25:38] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [13:25:45] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:25:53] RECOVERY - Bird Internet Routing Daemon on dns2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:28:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST serviceaccounts) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:28:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:29:28] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: temporarily remove dns2001 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/914295 (owner: 10Ssingh) [13:29:59] PROBLEM - Check systemd state on snapshot1014 is CRITICAL: CRITICAL - degraded: The following units failed: mcrouter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:49] jouncebot: nowandnext [13:31:50] For the next 0 hour(s) and 28 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1300) [13:31:50] For the next 0 hour(s) and 28 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1300) [13:31:50] In 0 hour(s) and 28 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1400) [13:31:57] (03CR) 10Hnowlan: [C: 03+2] Add all fonts from mediawiki::packages::fonts (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/914285 (https://phabricator.wikimedia.org/T335681) (owner: 10Hnowlan) [13:32:13] (03PS3) 10Elukey: ml-services: add network policies for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914319 (https://phabricator.wikimedia.org/T330414) [13:32:15] (03PS1) 10Elukey: _scaffold: fix typos for egress netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 [13:34:05] RECOVERY - Check systemd state on snapshot1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:11] (03CR) 10Milimetric: [C: 03+1] "Good to go" [puppet] - 10https://gerrit.wikimedia.org/r/914276 (owner: 10Btullis) [13:35:39] (03PS1) 10Klausman: homedirs: Add terminfo file for zterm-kitty to my homedir [puppet] - 10https://gerrit.wikimedia.org/r/914326 [13:36:14] (03PS2) 10Klausman: homedirs: Add terminfo file for xterm-kitty to my homedir [puppet] - 10https://gerrit.wikimedia.org/r/914326 [13:36:28] (03CR) 10Klausman: [C: 03+2] homedirs: Add terminfo file for xterm-kitty to my homedir [puppet] - 10https://gerrit.wikimedia.org/r/914326 (owner: 10Klausman) [13:36:37] (03Merged) 10jenkins-bot: Add all fonts from mediawiki::packages::fonts [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/914285 (https://phabricator.wikimedia.org/T335681) (owner: 10Hnowlan) [13:36:55] (03CR) 10Klausman: [V: 03+2 C: 03+2] homedirs: Add terminfo file for xterm-kitty to my homedir [puppet] - 10https://gerrit.wikimedia.org/r/914326 (owner: 10Klausman) [13:37:21] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Clement_Goubert) a:03Clement_Goubert [13:37:34] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914267|Enable Kartographer Nearby on mobile (T333137)]], [[gerrit:914286|Fix clearing wrong container when closing fullscreen map (T335648)]], [[gerrit:914287|Fix clearing wrong container when closing fullscreen map (T335648)]] (duration: 14m 54s) [13:37:39] T333137: Test and enable mobile support for the nearby feature - https://phabricator.wikimedia.org/T333137 [13:37:39] T335648: Fullscreen map is blank if opened twice - https://phabricator.wikimedia.org/T335648 [13:37:46] WMDE-Fisch: should be all synced [13:37:57] 10SRE-OnFire, 10Incident Tooling, 10SRE Observability (FY2022/2023-Q4): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata) [13:38:21] PROBLEM - mediawiki-installation DSH group on snapshot1014 is CRITICAL: Host snapshot1014 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:38:30] (JobUnavailable) resolved: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:39:15] urbanecm: Jep, works. Thanks again! [13:39:23] no worries [13:40:07] (ProbeDown) firing: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:07] (ProbeDown) firing: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:13] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:40:32] hmmm [13:42:47] (03PS1) 10ArielGlenn: allow dumpsdata hosts to see the new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/914328 [13:42:55] I'm goign to ack the parsoid page [13:43:10] if you see snapshot host whines, pelase ingore, they are being fixed up now [13:44:19] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.28:443]) https://wikitech.wikimedia.org/wiki/PyBal [13:44:31] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): 2 VMs for kafkamon - https://phabricator.wikimedia.org/T335426 (10lmata) [13:45:20] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [13:45:34] akosiaris: all the parsoid hosts are depooled in codfw [13:45:37] Repooling [13:46:06] Same issue for api_appserver and probably other mw clusters [13:46:12] :-( [13:46:15] :-( [13:46:25] (03PS2) 10Elukey: _scaffold: fix typos for egress netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 [13:46:27] (03PS4) 10Elukey: ml-services: add network policies for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914319 (https://phabricator.wikimedia.org/T330414) [13:47:23] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=api_appserver [13:47:31] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=appserver [13:47:41] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid [13:47:53] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Andrew) [13:48:05] All good [13:48:09] (03CR) 10Btullis: [V: 03+1 C: 03+2] Bump the mediawiki_history snapshot to include data for April [puppet] - 10https://gerrit.wikimedia.org/r/914276 (owner: 10Btullis) [13:48:43] Hmm what's that pyball alert [13:49:08] Ah it's parsoid [13:49:10] ofc. [13:49:35] parsoid pa.ge should recover soon, availability in dashboard goes up again [13:49:41] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:49:49] There pybal, you're all good now, shhh [13:50:03] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335775 (10phaultfinder) [13:50:05] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder) [13:50:07] (ProbeDown) resolved: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:07] (ProbeDown) resolved: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:13] don't let it see the rolled up newspaper you are hiding behind your back claime [13:50:25] godog: X) [13:50:31] (03CR) 10Andrew Bogott: Update a lot of mwopenstackclients uses to get creds from clouds.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913985 (owner: 10Andrew Bogott) [13:50:34] * claime backs away slowly [13:50:36] !incidents [13:50:36] 3570 (RESOLVED) ProbeDown sre (10.2.1.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 codfw) [13:50:37] 3569 (RESOLVED) ProbeDown sre (10.2.1.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 codfw) [13:50:49] PROBLEM - mediawiki-installation DSH group on snapshot1015 is CRITICAL: Host snapshot1015 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:50:54] (03CR) 10Ssingh: [C: 03+2] Revert "depool codfw for switch upgrade" [dns] - 10https://gerrit.wikimedia.org/r/914296 (owner: 10Ssingh) [13:51:04] !log run authdns-update to repool codfw [13:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:21] 10ops-codfw, 10Traffic: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh) [13:55:12] 10ops-codfw, 10Traffic: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh) p:05Triage→03Low [13:57:34] (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:58:10] this is me: ^ [14:00:04] sukhe: That opportune time is upon us again. Time for a LVS maintenance deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1400). [14:01:31] 10SRE, 10Infrastructure-Foundations, 10netops: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) [14:02:21] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:02:33] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:02:51] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) [14:04:32] (03PS1) 10Clément Goubert: ssl: Update api.svc, jobrunner.svc, and appservers.svc certs [puppet] - 10https://gerrit.wikimedia.org/r/914339 (https://phabricator.wikimedia.org/T313227) [14:05:05] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: install certificates before trying to use them [puppet] - 10https://gerrit.wikimedia.org/r/914340 (https://phabricator.wikimedia.org/T335052) [14:07:31] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [14:07:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:08:18] ^ expected [14:10:13] RECOVERY - Check systemd state on registry2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:27] (03PS3) 10Giuseppe Lavagetto: _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [14:10:40] (03PS1) 10Ssingh: lvs2007: decommission host for codfw hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/914341 (https://phabricator.wikimedia.org/T335777) [14:10:47] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:10:47] PROBLEM - PyBal backends health check on lvs2007 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:10:51] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), and 2 others: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Clement_Goubert) 05Open→03In progress [14:10:59] PROBLEM - pybal on lvs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:11:07] (03CR) 10CI reject: [V: 04-1] _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [14:12:34] (HelmReleaseBadStatus) resolved: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:13:05] PROBLEM - PyBal connections to etcd on lvs2007 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:13:06] (03PS1) 10Giuseppe Lavagetto: fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 [14:13:32] (03CR) 10CI reject: [V: 04-1] fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 (owner: 10Giuseppe Lavagetto) [14:15:23] (03PS1) 10Ssingh: depool codfw (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/914343 (https://phabricator.wikimedia.org/T335777) [14:15:53] (03PS4) 10Giuseppe Lavagetto: _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [14:15:55] (03PS2) 10Giuseppe Lavagetto: fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 [14:16:31] (03CR) 10CI reject: [V: 04-1] _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [14:16:33] (03CR) 10CI reject: [V: 04-1] fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 (owner: 10Giuseppe Lavagetto) [14:17:00] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Lacks the parsoid certs" [puppet] - 10https://gerrit.wikimedia.org/r/914339 (https://phabricator.wikimedia.org/T313227) (owner: 10Clément Goubert) [14:17:54] (03CR) 10Giuseppe Lavagetto: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [14:18:36] (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs2007 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/914344 (https://phabricator.wikimedia.org/T335777) [14:18:43] (03CR) 10CI reject: [V: 04-1] sites.yaml: remove decommissioned host lvs2007 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/914344 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:19:28] <_joe_> something is broken in CI [14:19:41] <_joe_> hashar, jnuche [14:19:43] oh yeah, same errors, different CR :) [14:19:51] > This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. [14:19:54] <_joe_> yep [14:20:12] <_joe_> I don't remember which part of the chain of hell is responsible for that error [14:21:01] (03PS1) 10Michael Große: wblistentityusage: Deprecate wbeu prefix, new output format [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914297 (https://phabricator.wikimedia.org/T300460) [14:21:43] (03PS2) 10Clément Goubert: ssl: Update api,jobrunner,appservers,parsoid certs [puppet] - 10https://gerrit.wikimedia.org/r/914339 (https://phabricator.wikimedia.org/T313227) [14:22:09] (03CR) 10Clément Goubert: "added parsoid certs" [puppet] - 10https://gerrit.wikimedia.org/r/914339 (https://phabricator.wikimedia.org/T313227) (owner: 10Clément Goubert) [14:22:23] gotta defer to hashar for that one, doesn't look familiar [14:22:39] (03PS2) 10Michael Große: wblistentityusage: Deprecate wbeu prefix, new output format [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914297 (https://phabricator.wikimedia.org/T300460) [14:22:44] (03CR) 10JMeybohm: [V: 03+1] prometheus::k8s switch staging-codfw to client cert auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:22:51] <_joe_> !log restarted zuul on contint2001 [14:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:30] <_joe_> !log also on contint1002, the current ci master [14:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:16] (03CR) 10Giuseppe Lavagetto: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [14:24:39] (03CR) 10Filippo Giunchedi: prometheus::k8s switch staging-codfw to client cert auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:24:47] (03CR) 10JMeybohm: [C: 03+1] Remove more stretch images [puppet] - 10https://gerrit.wikimedia.org/r/914282 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [14:24:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "I didn't test it but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:25:13] (03CR) 10Ssingh: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/914344 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:25:33] _joe_: <3 [14:26:00] (03PS1) 10Hnowlan: thumbor: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/914346 (https://phabricator.wikimedia.org/T335681) [14:26:21] (03PS1) 10Michael Große: Fix output path of list=wbsubscribers API [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) [14:27:34] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [14:27:58] !log sync prometheus3001 -> prometheus3002 [14:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:34] (03CR) 10Dzahn: "I don't understand your comment since I neither changed a single thing on the exisitng server nor hardcoded anything." [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [14:29:19] (03CR) 10Alexandros Kosiaris: [C: 03+1] ssl: Update api,jobrunner,appservers,parsoid certs [puppet] - 10https://gerrit.wikimedia.org/r/914339 (https://phabricator.wikimedia.org/T313227) (owner: 10Clément Goubert) [14:29:28] (03PS5) 10Giuseppe Lavagetto: _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [14:29:30] (03PS3) 10Giuseppe Lavagetto: fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 [14:29:40] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: sync [14:29:42] (03CR) 10Dzahn: "What I did was test your own change on the new hardware and made it _less_ hardcoded." [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [14:29:54] (03CR) 10CI reject: [V: 04-1] _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [14:30:10] (03CR) 10CI reject: [V: 04-1] fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 (owner: 10Giuseppe Lavagetto) [14:30:54] (03CR) 10JMeybohm: [V: 03+1] prometheus::k8s switch staging-codfw to client cert auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:30:59] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add label to prometheus3002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [14:31:28] (03CR) 10Dzahn: "so.. on new host it is already resolved and on old host there is no point in changing it anymore.. therefore I don't see what would be neg" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [14:31:30] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add label to prometheus3002 data blocks to prevent data duplication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [14:31:48] (03CR) 10Clément Goubert: [C: 03+2] ssl: Update api,jobrunner,appservers,parsoid certs [puppet] - 10https://gerrit.wikimedia.org/r/914339 (https://phabricator.wikimedia.org/T313227) (owner: 10Clément Goubert) [14:33:24] !log Merging new internal certs for api, jobrunner, appservers, parsoid - T313227 [14:33:25] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: sync [14:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:27] T313227: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 [14:34:50] (03CR) 10Hokwelum: [C: 03+1] "Looks good! Thank you :-)" [puppet] - 10https://gerrit.wikimedia.org/r/914328 (owner: 10ArielGlenn) [14:35:23] (03CR) 10ArielGlenn: [C: 03+2] allow dumpsdata hosts to see the new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/914328 (owner: 10ArielGlenn) [14:36:43] PROBLEM - Check that envoy is running on parse2020 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:38:09] (03PS1) 10Addshore: admin: Remove self from some wmde groups & fix email [puppet] - 10https://gerrit.wikimedia.org/r/914348 [14:38:11] PROBLEM - Check that envoy is running on parse1005 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:38:16] (03CR) 10Addshore: [C: 03+1] admin: Remove self from some wmde groups & fix email [puppet] - 10https://gerrit.wikimedia.org/r/914348 (owner: 10Addshore) [14:38:30] (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [14:38:53] PROBLEM - Check that envoy is running on parse2004 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:38:59] claime: ^ [14:39:03] PROBLEM - Check systemd state on parse2020 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:09] Crap [14:39:12] this is going to cause an outage [14:39:19] PROBLEM - Check systemd state on parse1005 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:24] (03PS2) 10Andrea Denisse: prometheus: Failover DNS from prometheus3001 to prometheus3002 in esams [dns] - 10https://gerrit.wikimedia.org/r/913192 (https://phabricator.wikimedia.org/T309979) [14:39:24] Revert both commits then ? [14:39:31] Failed to load private key from /etc/ssl/private/parsoid.svc.eqiad.wmnet.key [14:39:33] (03PS1) 10Muehlenhoff: Add python-all to make pybal buildable on build2001 [puppet] - 10https://gerrit.wikimedia.org/r/914349 [14:39:34] that's the error [14:39:37] Hmmm [14:39:38] if you have it handy [14:39:41] just commit it immediately [14:39:43] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/914346 (https://phabricator.wikimedia.org/T335681) (owner: 10Hnowlan) [14:39:51] gonna pause puppet [14:39:54] (03CR) 10Herron: [C: 03+1] "Thanks good catch. What I'm seeing are sli values being computed as 0.0-1, but being displayed with unit 0-100%. I think the bug is the " [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus) [14:40:08] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [14:40:21] PROBLEM - Check that envoy is running on parse1013 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:40:26] !log emergency disabling of puppet on parse hosts [14:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:50] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Failover DNS from prometheus3001 to prometheus3002 in esams [dns] - 10https://gerrit.wikimedia.org/r/913192 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [14:40:53] !log installing intel-microcode security updates on bullseye servers [14:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:03] PROBLEM - Check systemd state on parse2004 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:16] akosiaris: I don't get it, the key is in the private repo [14:41:29] PROBLEM - Check systemd state on parse1013 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:57] PROBLEM - Check that envoy is running on parse1022 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:42:01] PROBLEM - Check systemd state on parse1022 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:15] PROBLEM - Check that envoy is running on parse2017 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:42:22] claime: routines:OPENSSL_internal:KEY_VALUES_MISMATCH [14:42:27] wrong key ? [14:42:41] PROBLEM - Check that envoy is running on parse1023 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:42:49] PROBLEM - Check that envoy is running on parse2006 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:42:51] PROBLEM - Check that envoy is running on parse2018 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:43:09] PROBLEM - Check systemd state on parse2017 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:27] akosiaris: yeah [14:43:33] copying and commiting [14:43:33] PROBLEM - Check that envoy is running on parse1019 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:43:42] ok [14:43:54] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/914350 [14:43:56] (03CR) 10Muehlenhoff: [C: 03+2] Remove more stretch images [puppet] - 10https://gerrit.wikimedia.org/r/914282 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [14:44:43] PROBLEM - Check systemd state on parse1019 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:57] akosiaris: done, checking the public key [14:45:07] PROBLEM - Check systemd state on parse2018 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:09] PROBLEM - Check systemd state on parse2006 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:13] PROBLEM - Check systemd state on parse1023 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:22] (03Merged) 10jenkins-bot: thumbor: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/914346 (https://phabricator.wikimedia.org/T335681) (owner: 10Hnowlan) [14:45:28] I guess parois codfw key too [14:45:31] parsoid* [14:46:07] yep [14:47:14] And the discovery key too [14:47:34] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/913916 (owner: 10L10n-bot) [14:48:05] PROBLEM - Check no envoy runtime configuration is left persistent on parse2020 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:48:40] akosiaris: checking on parse1023 [14:49:17] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/914350 (owner: 10Muehlenhoff) [14:49:27] PROBLEM - Check no envoy runtime configuration is left persistent on parse1023 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:49:27] PROBLEM - Check no envoy runtime configuration is left persistent on parse2004 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:49:59] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:01] Failed to load private key from /etc/ssl/private/parsoid.svc.eqiad.wmnet.key, Cause: error:09000068:PEM routines:OPENSSL_internal:BAD_PASSWORD_READ [14:50:03] New error [14:50:11] I guess the pubkey is bad too ? [14:50:29] PROBLEM - Check no envoy runtime configuration is left persistent on parse2017 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:50:59] (03PS6) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (https://phabricator.wikimedia.org/T335782) [14:51:08] (03CR) 10Vgutierrez: "hmmm I'm not 100% sure our puppetization is happy without a high-traffic1 LVS in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/914341 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:52:49] (03PS1) 10Alexandros Kosiaris: Add machinetranslation service RRs [dns] - 10https://gerrit.wikimedia.org/r/914351 [14:52:58] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [14:53:05] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:53:55] PROBLEM - Check no envoy runtime configuration is left persistent on parse1005 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:54:50] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:55:31] (03Abandoned) 10Lucas Werkmeister (WMDE): Fix output path of list=wbsubscribers API [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große) [14:55:56] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:56:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:56:50] (03CR) 10Ssingh: lvs2007: decommission host for codfw hardware refresh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914341 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:56:52] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/914341/40985/lvs2010.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/914341 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:57:10] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Jelto) [14:57:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Jclark-ctr) Configured bios and Set password for device mgmt ip address is 10.64.40.217/26 [14:57:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Jclark-ctr) [14:58:38] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:59:06] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:59:30] (03CR) 10Dzahn: "once gerrit1001 is decom'ed the temp hiera key is to be removed, so nothing to worry about here" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [14:59:46] !log jiji@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: codfw row C switches upgrade - T334049 [14:59:56] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [15:00:04] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row C switches upgrade - T334049... [15:00:04] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:00:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:00:35] akosiaris: I'm confused now [15:01:04] (03CR) 10Muehlenhoff: "I'll take care of merging this, currently in the process of updating the NDA records with the WMF Legal department." [puppet] - 10https://gerrit.wikimedia.org/r/914348 (owner: 10Addshore) [15:01:07] claime? [15:01:10] akosiaris: I used the openssl command for passworded keys https://wikitech.wikimedia.org/wiki/Cergen#Cheatsheet [15:01:21] Commited, and I'm still erroring out [15:01:39] PROBLEM - Check no envoy runtime configuration is left persistent on parse2018 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:01:40] * akosiaris looking [15:02:07] PROBLEM - Check systemd state on parse2014 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:21] PROBLEM - Check that envoy is running on parse2014 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:02:21] PROBLEM - ircecho bot process on irc2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [15:02:27] !log enable puppet on parse1005 [15:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:49] claime: I didn't see any updates being brought in by puppet [15:03:54] no new file that is [15:03:58] wth [15:04:44] re-running just in case, but I 'll be surprised if I see something [15:04:51] !log enabling puppet on parse2013 [15:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:55] !log enabling puppet on parse2014 [15:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:11] <_joe_> uh wth happened? [15:05:11] PROBLEM - puppet last run on irc2002 is CRITICAL: CRITICAL: Puppet has been disabled for 605652 seconds, message: jmm testing, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:05:21] _joe_: Blunder with the cert regen [15:05:32] I didn't see the parsoid certs were password protected [15:05:41] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [15:05:45] (03PS6) 10Elukey: _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 [15:05:47] (03PS4) 10Elukey: fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 (owner: 10Giuseppe Lavagetto) [15:05:48] <_joe_> ok, is parsoid down right now? [15:05:58] <_joe_> elukey: please I'm fixing stuff still [15:06:15] (03CR) 10CI reject: [V: 04-1] _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [15:06:17] _joe_: a.kosiaris disabled puppet, so I think not [15:06:17] (03CR) 10CI reject: [V: 04-1] fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 (owner: 10Giuseppe Lavagetto) [15:06:38] <_joe_> elukey: let me handle the remaining issues [15:06:41] PROBLEM - Check no envoy runtime configuration is left persistent on parse2014 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:07:02] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) 05Open→03Resolved a:03ayounsi Upgrade went fine! Thanks everybody. [15:07:19] PROBLEM - Check no envoy runtime configuration is left persistent on parse1022 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:07:33] akosiaris: Basically my last commit was the result of the openssl ec command, but apparently that generated the same keys that I originally commited? [15:07:47] claime: it was a noop anyway I think [15:07:55] PROBLEM - Check no envoy runtime configuration is left persistent on parse2006 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:07:57] -r--r----- 1 root envoy 227 Aug 29 2022 /etc/ssl/private/parsoid.svc.eqiad.wmnet.key [15:07:58] ^ [15:08:05] hasn't changed since Aug 29 [15:08:16] that's on parse1005 [15:08:43] <_joe_> do you need me to take a look? [15:08:52] _joe_: we got 2 ppl on it, no need [15:09:18] akosiaris: Yeah it's the same as what I commited [15:09:26] which I don´t quite understand [15:09:42] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:10:04] (03PS7) 10Elukey: _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 [15:10:06] (03PS5) 10Elukey: fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 (owner: 10Giuseppe Lavagetto) [15:10:36] (03CR) 10CI reject: [V: 04-1] _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [15:10:37] PROBLEM - Check no envoy runtime configuration is left persistent on parse1013 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:10:38] (03CR) 10CI reject: [V: 04-1] fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 (owner: 10Giuseppe Lavagetto) [15:10:38] _joe_ ah sorry didn't see [15:10:58] <_joe_> there's an error I can't seem to find [15:11:10] claime... wait, why would the private key change? [15:11:35] only the cert needed to change, not the private key [15:12:24] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:12:36] akosiaris: I got confused when you told me: [15:12:39] 14:42 claime: routines:OPENSSL_internal:KEY_VALUES_MISMATCH [15:12:56] yeah, that's one of the clearer errors of openssl [15:12:58] So I copied the keys and commited [15:13:06] it's pretty clear that keys are mismatched [15:13:14] I am not sure why though [15:13:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:37] Except they were encrypted, so I used the openssl ec to restore them to the right key [15:13:42] So in the end they didn´t change [15:14:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:14:21] PROBLEM - Check no envoy runtime configuration is left persistent on parse1019 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:14:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:14:55] What I don´t understand is I did copy over the parsoid.svc.eqiad.wmnet/parsoid.svc.eqiad.wmnet.crt.pem file to puppet in profile/files/sslarsoid.svc.eqiad.wmnet.crt [15:15:03] sigh, rolling back my deploy to fix ^ [15:15:32] (03CR) 10Vgutierrez: [C: 03+1] lvs2007: decommission host for codfw hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/914341 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [15:15:56] akosiaris: Ok they're wrong, for some reason. [15:16:01] Copying them over again [15:16:48] !log jiji@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: codfw row C switches upgrade - T334049 [15:16:53] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [15:17:06] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row C switches upgrade - T334049... [15:17:23] (03PS5) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [15:17:35] claime: so, I checked that the pubkey the cert has and the pubkey corresponding to the private key file indeed differ [15:17:42] akosiaris: Yeah [15:17:51] akosiaris: I think I mispasted the pubkeys [15:17:54] they do ofc ;-) [15:17:55] (03CR) 10CI reject: [V: 04-1] [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [15:17:56] I'm checking them all atm [15:18:28] Both parsoid.svc.{eqiad.codfw}.wmnet.crt were wrong [15:18:39] I'm checking the jobrunners ones too [15:19:00] oh, so mispasting of the certs, not the private keys? [15:19:06] yes [15:19:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:20:00] jobrunner is good [15:20:44] (03PS1) 10Clément Goubert: ssl: Fix parsoid.svc.{codfw,eqiad} pubkeys [puppet] - 10https://gerrit.wikimedia.org/r/914357 (https://phabricator.wikimedia.org/T313227) [15:22:39] akosiaris: Root cause is: I'm dumb and pasted the same pubkey for discovery, svc.codfw and svc.eqiad [15:22:55] ok [15:23:25] we obviously have some work to do to make these things less easy to happen [15:23:49] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [15:23:57] The copy/pasting is not foolproof yeah [15:24:50] akosiaris: Can you check the fix? https://gerrit.wikimedia.org/r/914357 [15:26:44] Next time I do a cert update I'll disable puppet on the targets and test one first [15:26:57] So we don't end up with *gestures around* [15:28:54] (03PS6) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [15:29:22] (03CR) 10CI reject: [V: 04-1] [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [15:29:24] (03PS8) 10Giuseppe Lavagetto: _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [15:29:26] (03PS6) 10Giuseppe Lavagetto: fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 [15:32:03] checking pubkey [15:32:06] ack [15:32:33] RECOVERY - PyBal backends health check on lvs2007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:32:39] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:32:39] RECOVERY - pybal on lvs2007 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:32:51] claime: I still get different pubkeys... [15:33:01] akosiaris: between? [15:33:11] ah wait [15:33:15] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2022 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:33:17] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2022 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:33:21] now I am the idiot that copy pastes wrongly [15:33:30] (SystemdUnitFailed) firing: (9) wmf_auto_restart_mjolnir-kafka-bulk-daemon.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:33] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:33:37] (03PS1) 10Ottomata: flink - upgrade to flink 1.17.0, python 3.9, Debian Bullsye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/914358 (https://phabricator.wikimedia.org/T335408) [15:33:48] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:33:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] ssl: Fix parsoid.svc.{codfw,eqiad} pubkeys [puppet] - 10https://gerrit.wikimedia.org/r/914357 (https://phabricator.wikimedia.org/T313227) (owner: 10Clément Goubert) [15:34:00] claime: ok, full match [15:34:03] proceed [15:34:04] (03CR) 10Clément Goubert: [C: 03+2] ssl: Fix parsoid.svc.{codfw,eqiad} pubkeys [puppet] - 10https://gerrit.wikimedia.org/r/914357 (https://phabricator.wikimedia.org/T313227) (owner: 10Clément Goubert) [15:34:06] nice [15:34:07] PROBLEM - Check systemd state on wdqs2022 is CRITICAL: CRITICAL - degraded: The following units failed: load-dcatap-weekly.service,prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service,wdqs-blazegraph.service,wdqs-categories.service,wdqs-updater.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categor [15:34:07] ice https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:09] (03CR) 10Elukey: [C: 03+1] "Tested ./create_new_service.sh manually and now I can see the networkpolicy.yaml template file!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [15:34:13] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:34:23] PROBLEM - Query Service HTTP Port on wdqs2022 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:34:25] PROBLEM - WDQS SPARQL on wdqs2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 244 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:34:49] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 12 days, 0:00:00 on wdqs2022.codfw.wmnet with reason: attempting WDQS stack on bullseye [15:34:50] (03CR) 10Elukey: [C: 03+1] fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 (owner: 10Giuseppe Lavagetto) [15:34:52] (03Abandoned) 10AOkoth: Remove EventGate Icinga checks that have been moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [15:34:54] (03PS1) 10Andrea Denisse: prometheus: Add UID/GID mappings support for promethus data sync [puppet] - 10https://gerrit.wikimedia.org/r/914359 (https://phabricator.wikimedia.org/T309979) [15:35:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:35:03] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12 days, 0:00:00 on wdqs2022.codfw.wmnet with reason: attempting WDQS stack on bullseye [15:35:05] akosiaris: testing on parse1023 and parse2014 [15:35:33] RECOVERY - Check that envoy is running on parse2014 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:35:53] yay [15:35:59] RECOVERY - Check systemd state on parse1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:13] !log Re-running puppet on failed parse servers - T313227 [15:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:16] T313227: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 [15:36:29] RECOVERY - Check systemd state on parse1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:39] RECOVERY - Check that envoy is running on parse1023 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:36:49] RECOVERY - PyBal connections to etcd on lvs2007 is OK: OK: 12 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:36:53] RECOVERY - Check systemd state on parse2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:57] RECOVERY - Check that envoy is running on parse1005 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:37:13] RECOVERY - Check no envoy runtime configuration is left persistent on parse2014 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:38:37] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:38:49] RECOVERY - Check that envoy is running on parse2020 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:38:49] RECOVERY - Check systemd state on parse2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:03] Brace for recovery flood [15:39:09] RECOVERY - Check systemd state on parse2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:11] RECOVERY - Check systemd state on parse2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:23] RECOVERY - Check that envoy is running on parse2004 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:39:30] (03PS7) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [15:39:33] RECOVERY - Check systemd state on parse2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:33] RECOVERY - Check that envoy is running on parse2017 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:39:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:39:57] RECOVERY - Check systemd state on parse2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:07] RECOVERY - Check that envoy is running on parse2006 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:40:09] RECOVERY - Check that envoy is running on parse2018 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:40:19] RECOVERY - Check systemd state on parse1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:19] RECOVERY - Check systemd state on parse1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:47] RECOVERY - Check that envoy is running on parse1022 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:40:47] RECOVERY - Check that envoy is running on parse1019 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:40:49] RECOVERY - Check that envoy is running on parse1013 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:40:53] RECOVERY - Check systemd state on parse1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:17] RECOVERY - Check no envoy runtime configuration is left persistent on parse1013 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:41:39] (03CR) 10CI reject: [V: 04-1] [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [15:45:23] RECOVERY - Check no envoy runtime configuration is left persistent on parse1022 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:45:23] RECOVERY - Check no envoy runtime configuration is left persistent on parse1019 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:45:23] RECOVERY - Check no envoy runtime configuration is left persistent on parse1005 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:45:23] RECOVERY - Check no envoy runtime configuration is left persistent on parse1023 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:45:23] RECOVERY - Check no envoy runtime configuration is left persistent on parse2006 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:45:24] RECOVERY - Check no envoy runtime configuration is left persistent on parse2004 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:45:24] RECOVERY - Check no envoy runtime configuration is left persistent on parse2017 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:45:25] RECOVERY - Check no envoy runtime configuration is left persistent on parse2020 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:45:25] RECOVERY - Check no envoy runtime configuration is left persistent on parse2018 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:45:28] (03CR) 10Filippo Giunchedi: "See inline, idea LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/914359 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [15:45:41] akosiaris: I think we're all good [15:46:15] 👍 [15:46:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [15:46:36] I'm gonna step away from the keyboard before I cause any more chaos [15:47:24] <3 <3 <3 claime [15:48:12] (03PS2) 10Andrea Denisse: prometheus: Add UID/GID mappings support for promethus data sync [puppet] - 10https://gerrit.wikimedia.org/r/914359 (https://phabricator.wikimedia.org/T309979) [15:48:38] (03CR) 10CI reject: [V: 04-1] prometheus: Add UID/GID mappings support for promethus data sync [puppet] - 10https://gerrit.wikimedia.org/r/914359 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [15:50:11] (03CR) 10Andrea Denisse: prometheus: Add UID/GID mappings support for promethus data sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/914359 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [15:51:38] (03CR) 10Andrea Denisse: prometheus: Add UID/GID mappings support for promethus data sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914359 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [15:52:06] (03PS3) 10Andrea Denisse: prometheus: Add UID/GID mappings support for promethus data sync [puppet] - 10https://gerrit.wikimedia.org/r/914359 (https://phabricator.wikimedia.org/T309979) [15:53:26] (03Merged) 10jenkins-bot: _scaffold: fix networkpolicy names [deployment-charts] - 10https://gerrit.wikimedia.org/r/914325 (owner: 10Elukey) [15:55:52] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add UID/GID mappings support for promethus data sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914359 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [15:57:17] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add UID/GID mappings support for promethus data sync [puppet] - 10https://gerrit.wikimedia.org/r/914359 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [16:00:05] jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1600). [16:00:06] No Gerrit patches in the queue for this window AFAICS. [16:00:19] (03PS8) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [16:02:34] (03CR) 10CI reject: [V: 04-1] [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [16:03:08] (03CR) 10Ssingh: [C: 03+1] "And thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/914349 (owner: 10Muehlenhoff) [16:06:18] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [16:07:37] (03CR) 10Jcrespo: [C: 03+2] Fix bug by which %% wasn't adequately escaped on sql queries [software/mediabackups] - 10https://gerrit.wikimedia.org/r/914321 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [16:08:01] (03PS1) 10Alexandros Kosiaris: machinetranslation: Enable ingress in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/914365 (https://phabricator.wikimedia.org/T331505) [16:08:53] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [16:10:40] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [16:11:42] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [16:12:03] !log ns1: delete routing-options static route 208.80.153.231/32 next-hop 208.80.153.111, set to 208.80.153.77 [16:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:36] (03PS1) 10Dzahn: add discovery records for miscweb in eqiad and miscweb in codfw [dns] - 10https://gerrit.wikimedia.org/r/914369 [16:15:41] (03PS9) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [16:16:00] !log ebernhardson@deploy1002 Started deploy [search/mjolnir/deploy@bb96aca]: Add snappy dependency for kafka daemons [16:16:10] !log ns0 backup routes: delete routing-options static route 208.80.154.238/32 next-hop 208.80.153.111, set to 208.80.153.77 [16:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:27] !log ebernhardson@deploy1002 Finished deploy [search/mjolnir/deploy@bb96aca]: Add snappy dependency for kafka daemons (duration: 00m 26s) [16:24:33] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [16:28:10] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add label to prometheus4002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912383 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [16:28:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [16:30:26] (03PS2) 10Andrea Denisse: prometheus: Failover DNS from prometheus4001 to prometheus4002 in ulsfo [dns] - 10https://gerrit.wikimedia.org/r/913194 (https://phabricator.wikimedia.org/T309979) [16:32:20] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Failover DNS from prometheus4001 to prometheus4002 in ulsfo [dns] - 10https://gerrit.wikimedia.org/r/913194 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [16:38:19] !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [16:45:21] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (gerrit1001, ...), Fresh: 123 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [16:47:18] (03PS10) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [16:58:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [16:58:24] (03PS2) 10Dzahn: add discovery records for miscweb in eqiad and miscweb in codfw [dns] - 10https://gerrit.wikimedia.org/r/914369 (https://phabricator.wikimedia.org/T335797) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1700) [17:00:28] (03CR) 10Herron: [C: 03+2] kafkamon: transition to firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/912979 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [17:03:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [17:06:56] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10JayCano) I'm @JKieserman's new manager and I approve this as well [17:11:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:16:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:25:04] !log ns0 set routing-options static route 208.80.154.238/32 next-hop [ 208.80.154.10 208.80.155.108 208.80.154.134 ]: T330670 [17:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:08] T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 [17:26:38] (03PS1) 10Giuseppe Lavagetto: Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 [17:28:51] !log ns1 set routing-options static route 208.80.153.231/32 next-hop [ 208.80.153.77 208.80.153.111 208.80.153.10 ]: T330670 [17:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:35] !log cr*-codfw: delete backup routes for ns1: delete routing-options static route 208.80.154.238/32: T330670 [17:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:39] T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 [17:33:53] !log [correction] cr*-codfw: delete backup routes for ns0: delete routing-options static route 208.80.154.238/32: T330670 [17:34:05] !log [correction] cr*-codfw: delete backup routes for ns0: delete routing-options static route 208.80.154.238/32: T330670 [17:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:40] (03CR) 10Ottomata: [C: 03+2] flink - upgrade to flink 1.17.0, python 3.9, Debian Bullsye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/914358 (https://phabricator.wikimedia.org/T335408) (owner: 10Ottomata) [17:36:16] !log cr*-eqiad: delete backup routes for ns0: delete routing-options static route 208.80.153.231/32: T330670 [17:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:33] (03PS11) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [17:38:40] (03CR) 10Btullis: Jupyterhub-conda exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [17:39:00] (03CR) 10CI reject: [V: 04-1] [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [17:39:23] !log milimetric@deploy1002 Started deploy [analytics/refinery@c42021f]: Regular analytics weekly train [analytics/refinery@c42021f] [17:39:25] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) Summarizing for posterity the current state: On `cr*-eqiad`: ` /* ns0 */ route 208.80.154.238/32 { next-hop [ 208.... [17:40:08] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]00[123]. - https://phabricator.wikimedia.org/T330670 (10ssingh) [17:42:45] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2022 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:44:20] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]00[123]. - https://phabricator.wikimedia.org/T330670 (10ssingh) 05Open→03Resolved a:03ssingh As per the last comment, we have moved over authdns[12]001 to dns[12]00[123] and marking this as resolved. [17:45:06] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]00[123]. - https://phabricator.wikimedia.org/T330670 (10ssingh) [17:45:50] !log milimetric@deploy1002 Finished deploy [analytics/refinery@c42021f]: Regular analytics weekly train [analytics/refinery@c42021f] (duration: 06m 26s) [17:45:59] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:47:53] (03PS12) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [17:50:11] (03PS1) 10Urbanecm: [Growth] Add GEMentorDashboardEnabledModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914373 (https://phabricator.wikimedia.org/T334630) [17:50:22] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder) [17:50:55] (03CR) 10CI reject: [V: 04-1] [Growth] Add GEMentorDashboardEnabledModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914373 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [17:50:59] (03PS1) 10Urbanecm: [Growth] Remove GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914375 (https://phabricator.wikimedia.org/T334630) [17:51:58] (03CR) 10CI reject: [V: 04-1] [Growth] Remove GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914375 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [17:52:00] (03PS2) 10Urbanecm: [Growth] Add GEMentorDashboardEnabledModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914373 (https://phabricator.wikimedia.org/T334630) [17:52:11] (03PS2) 10Urbanecm: [Growth] Remove GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914375 (https://phabricator.wikimedia.org/T334630) [17:59:43] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink - upgrade to flink 1.17.0, python 3.9, Debian Bullsye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/914358 (https://phabricator.wikimedia.org/T335408) (owner: 10Ottomata) [18:00:05] brennen and jeena: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T1800). [18:00:59] o/ [18:01:53] !log train 1.41.0-wmf.7 (T330213): no current blockers, rolling to group1 with `scap train` [18:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:57] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [18:02:14] today's soundtrack: https://www.youtube.com/watch?v=LTrnuI8E6Jo&list=OLAK5uy_nKBASxxUs_eC8bH76NCFTyHDOcyLyWe1s&index=4 [18:03:14] !log train 1.41.0-wmf.7 (T330213): (correction: group0 today) [18:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:43] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914376 (https://phabricator.wikimedia.org/T330213) [18:03:49] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914376 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot) [18:04:44] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914376 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot) [18:11:45] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.7 refs T330213 [18:11:49] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [18:14:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:18:30] (JobUnavailable) firing: (4) Reduced availability for job thanos-sidecar in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:19:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:19:34] !log milimetric@deploy1002 Started deploy [analytics/refinery@c42021f] (thin): Regular analytics weekly train THIN [analytics/refinery@c42021f] [18:19:41] !log milimetric@deploy1002 Finished deploy [analytics/refinery@c42021f] (thin): Regular analytics weekly train THIN [analytics/refinery@c42021f] (duration: 00m 07s) [18:24:04] (03CR) 10Ladsgroup: [C: 03+1] "https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/2591/console looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912929 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [18:29:35] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Consider confirming the hostname by user input when running the reimaging cookbook - https://phabricator.wikimedia.org/T332202 (10Ladsgroup) Yeah, emphasizing on what host the operator is about to reimage sounds better to me. Maybe we can... [18:33:49] (03PS17) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [18:41:50] (03Abandoned) 10Herron: icinga_exporter: run service on both active and standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/905244 (https://phabricator.wikimedia.org/T333838) (owner: 10Herron) [18:42:55] (03PS1) 10Bking: wdqs: add wdqs2022 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/914381 (https://phabricator.wikimedia.org/T331300) [18:45:08] (03PS1) 10Papaul: Add new codfw lvs nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/914382 (https://phabricator.wikimedia.org/T326767) [18:45:14] (03CR) 10Cathal Mooney: "Thanks for the review :) Updated now as per feedback, and also to deal with another issue in testing related to SLAAC IPs." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [18:45:43] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/914381 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [18:45:55] (03CR) 10Bking: [C: 03+2] wdqs: add wdqs2022 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/914381 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [18:47:22] (03CR) 10Papaul: [C: 03+2] Add new codfw lvs nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/914382 (https://phabricator.wikimedia.org/T326767) (owner: 10Papaul) [18:48:14] (03PS18) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [18:49:46] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [18:50:10] !log bking@cumin1001 conftool action : set/pooled=inactive; selector: name=wdqs2022.codfw.wmnet [18:53:43] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:55:44] !log bking@deploy1002 Started deploy [wdqs/wdqs@0e051d8]: (no justification provided) [18:56:03] !log bking@deploy1002 Finished deploy [wdqs/wdqs@0e051d8]: (no justification provided) (duration: 00m 19s) [18:56:50] !log bking@deploy1002 Started deploy [wdqs/wdqs@0e051d8]: (no justification provided) [18:56:53] !log bking@deploy1002 Finished deploy [wdqs/wdqs@0e051d8]: (no justification provided) (duration: 00m 03s) [18:59:49] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Papaul) I pasted in the ticket with Dell the same error we are seeing and here is what Dell is telling me: ` Denial Notes Troubleshooting/System Failure in... [19:00:53] (03PS1) 10Bking: wdqs: Add wdqs2022 as scap target [puppet] - 10https://gerrit.wikimedia.org/r/914384 (https://phabricator.wikimedia.org/T331300) [19:01:32] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: Add wdqs2022 as scap target [puppet] - 10https://gerrit.wikimedia.org/r/914384 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:02:09] (03CR) 10Bking: [C: 03+2] wdqs: Add wdqs2022 as scap target [puppet] - 10https://gerrit.wikimedia.org/r/914384 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:03:01] (03CR) 10Andrew Bogott: [C: 03+2] Update a lot of mwopenstackclients uses to get creds from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913985 (owner: 10Andrew Bogott) [19:03:03] (03PS1) 10Jdlrobson: Router handling code should be centralized into mmv.bootstrap [extensions/MultimediaViewer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914301 (https://phabricator.wikimedia.org/T236591) [19:04:15] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs2022.codfw.wmnet with reason: attempting WDQS stack on bullseye [19:04:18] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs2022.codfw.wmnet with reason: attempting WDQS stack on bullseye [19:05:31] !log bking@deploy1002 Started deploy [wdqs/wdqs@0e051d8]: (no justification provided) [19:12:44] !log bking@deploy1002 Finished deploy [wdqs/wdqs@0e051d8]: (no justification provided) (duration: 07m 13s) [19:12:48] !log bking@deploy1002 Started deploy [wdqs/wdqs@0e051d8]: (no justification provided) [19:13:04] !log bking@deploy1002 Finished deploy [wdqs/wdqs@0e051d8]: (no justification provided) (duration: 00m 16s) [19:18:16] !log bking@deploy1002 Started deploy [wdqs/wdqs@0e051d8]: (no justification provided) [19:18:21] !log bking@deploy1002 Finished deploy [wdqs/wdqs@0e051d8]: (no justification provided) (duration: 00m 05s) [19:19:10] (03PS1) 10Ottomata: page_content_change- bump image to v0.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914389 (https://phabricator.wikimedia.org/T335408) [19:20:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [19:20:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10wiki_willy) @Jclark-ctr - can you take a peak at this one to see if it's pending on anything from our side? Thanks, Willy [19:21:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10wiki_willy) a:03Papaul [19:21:43] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2022 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:22:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) [19:23:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) [19:24:47] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [19:25:52] (03CR) 10Ottomata: [C: 03+2] page_content_change- bump image to v0.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914389 (https://phabricator.wikimedia.org/T335408) (owner: 10Ottomata) [19:27:55] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2022 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:28:25] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2022 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:28:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10nskaggs) @Jclark-ctr These should be setup with software RAID just like last time. See @Andrew comment: https://phabricator.wikimedia.org/T294972#80... [19:28:34] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:28:38] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:29:55] RECOVERY - WDQS SPARQL on wdqs2022 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.255 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:30:45] RECOVERY - Query Service HTTP Port on wdqs2022 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [19:33:35] (SystemdUnitFailed) firing: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:19] (03PS2) 10Alexandros Kosiaris: machinetranslation: Support ingress in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/914365 (https://phabricator.wikimedia.org/T331505) [19:40:15] !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [19:40:28] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [19:47:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Support ingress in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/914365 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [19:53:06] (03Merged) 10jenkins-bot: machinetranslation: Support ingress in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/914365 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [19:53:15] PROBLEM - puppet last run on wdqs2012 is CRITICAL: CRITICAL: Puppet has been disabled for 605056 seconds, message: T331300 - bking, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:58:10] (03PS1) 10Urbanecm: [Growth] Deploy Personalized praise to AR, BN, CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914393 (https://phabricator.wikimedia.org/T334630) [19:58:19] 10SRE, 10PyBal, 10Traffic-Icebox: PyBal ProxyFetch failure when talking to Envoy in SNI-only mode - https://phabricator.wikimedia.org/T253527 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze development... [19:58:53] RECOVERY - puppet last run on wdqs2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:59:28] 10SRE, 10PyBal, 10Traffic-Icebox: PyBal healthchecks should specify User-Agent instead of using "Twisted PageGetter" - https://phabricator.wikimedia.org/T246431 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to... [19:59:43] 10SRE, 10PyBal, 10Traffic-Icebox: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 (10BCornwall) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230502T2000). [20:00:05] 10SRE, 10PyBal, 10Traffic-Icebox: pybal fails to reconnect cleanly to etcd when etcd is restarted - https://phabricator.wikimedia.org/T240665 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze development... [20:00:05] Iniquity: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] i can deploy today [20:00:26] but I don't see Iniquity [20:01:00] 10SRE, 10Traffic-Icebox: PyBal ProxyFetch checks using HTTP/1.0 with https and HTTP/1.1 with plain http - https://phabricator.wikimedia.org/T232319 (10BCornwall) 05Open→03Resolved a:03BCornwall [20:01:24] 10SRE, 10Prod-Kubernetes, 10PyBal, 10Traffic-Icebox: Pybal support of configuration from the kubernetes API - https://phabricator.wikimedia.org/T192437 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze... [20:02:09] Iniquity: hi! [20:02:11] ready for the deployment? [20:02:18] Hi:)  Yes [20:02:24] let's do it! [20:02:25] what I need to do? [20:02:53] Iniquity: Do you have https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_extensions installed? [20:03:00] yep [20:03:06] 10SRE, 10PyBal, 10Traffic-Icebox: Some etcd connections not established at startup - https://phabricator.wikimedia.org/T188087 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze development of non-trivial... [20:03:12] in that case, please wait for a while -- I'll ping you once it can be tested [20:03:17] (03PS4) 10Urbanecm: Switch on creating Babel categories in Russian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911799 (https://phabricator.wikimedia.org/T335136) (owner: 10Iniquity) [20:03:18] 10SRE, 10PyBal, 10Traffic-Icebox, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze developme... [20:03:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911799 (https://phabricator.wikimedia.org/T335136) (owner: 10Iniquity) [20:03:35] ok, np :) [20:03:57] 10SRE, 10PyBal, 10Traffic-Icebox: PyBal Feature: progressive depooling strategy for monitored failures - https://phabricator.wikimedia.org/T172124 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze develop... [20:03:59] 10SRE, 10PyBal, 10Traffic-Icebox: IPVS issues with UDP services, pybal depooling strategy - https://phabricator.wikimedia.org/T172103 (10BCornwall) [20:04:23] (03Merged) 10jenkins-bot: Switch on creating Babel categories in Russian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911799 (https://phabricator.wikimedia.org/T335136) (owner: 10Iniquity) [20:04:27] (03PS1) 10Alexandros Kosiaris: machinetranslation: Fix strategy bug [deployment-charts] - 10https://gerrit.wikimedia.org/r/914394 [20:04:55] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:911799|Switch on creating Babel categories in Russian Wiktionary (T335136)]] [20:04:58] T335136: Switch on creating Babel categories in Russian Wiktionary - https://phabricator.wikimedia.org/T335136 [20:05:47] 10SRE, 10PyBal, 10Traffic-Icebox: Add graceful-restart capability to PyBal - https://phabricator.wikimedia.org/T246788 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze development of non-trivial changes... [20:06:01] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T335722 (10wiki_willy) a:03Papaul [20:06:24] !log urbanecm@deploy1002 urbanecm and iniquity: Backport for [[gerrit:911799|Switch on creating Babel categories in Russian Wiktionary (T335136)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:06:36] Iniquity: your patch can now be tested at all debug servers. can you try test it please? [20:07:14] I honestly don't quite understand how to do it :) [20:08:33] Iniquity: finding a page with babel in it, and purging/null edit should do the trick [20:08:40] you should see the categories added t oit [20:08:43] *to it [20:09:19] 10SRE, 10Traffic-Icebox, 10Performance-Team (Radar): Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10BCornwall) 05Open→03Resolved a:03BCornwall [20:10:00] sec [20:10:05] 10SRE, 10PyBal, 10Traffic-Icebox: Fully-redundant LVS clusters using Pybal per-service MED feature - https://phabricator.wikimedia.org/T165764 (10BCornwall) 05Open→03Resolved a:03BCornwall [20:10:27] The database is currently locked to new entries and other modifications, probably for routine database maintenance, after which it will be back to normal. [20:10:53] which page you're trying and which debug server please? [20:11:03] https://ru.wiktionary.org/wiki/%D0%A3%D1%87%D0%B0%D1%81%D1%82%D0%BD%D0%B8%D0%BA:Iniquity [20:11:13] debug1001 [20:11:36] i tried nulleditign via debug1001, and it seems to work [20:11:43] can you try a different page please? [20:11:55] 10SRE, 10PyBal, 10Traffic-Icebox: Run IPVS in a separate network namespace - https://phabricator.wikimedia.org/T114979 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze development of non-trivial changes... [20:12:09] (in any case, I'm comfortable syncing, as it appears to work, but of course happy to give you space to finish testing) [20:12:30] 10SRE, 10PyBal, 10Traffic-Icebox: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze d... [20:13:03] 10SRE, 10PyBal, 10Traffic-Icebox: pybal-related issue on host start can break service IPs... - https://phabricator.wikimedia.org/T113597 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze development of no... [20:14:20] Iniquity: how is it going? ^^ [20:14:43] 10SRE, 10PyBal, 10Traffic-Icebox: PyBal Feature: progressive depooling strategy for monitored failures - https://phabricator.wikimedia.org/T172124 (10BCornwall) [20:14:45] 10SRE, 10PyBal, 10Traffic-Icebox: IPVS issues with UDP services, pybal depooling strategy - https://phabricator.wikimedia.org/T172103 (10BCornwall) [20:14:48] it seems to work, but it doesn't let me purge [20:14:54] 10SRE, 10PyBal, 10Traffic-Icebox: Make PyBal respect advertised BGP capabilities - https://phabricator.wikimedia.org/T81305 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze development of non-trivial cha... [20:14:57] 10SRE, 10PyBal, 10Traffic-Icebox: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze development of non-trivial ch... [20:15:09] interesting [20:15:09] 10SRE, 10PyBal, 10Traffic-Icebox: Add pybal check to ensure service IP is bound - https://phabricator.wikimedia.org/T79730 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze development of non-trivial chan... [20:15:13] let's sync then [20:15:19] syncing [20:15:37] 10SRE, 10PyBal, 10Traffic-Icebox: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372 (10BCornwall) 05Open→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffi... [20:16:19] lets do it, I think everything will be fine:)  but in the future I wonder why I have an error [20:16:28] 10SRE, 10Traffic-Icebox, 10conftool, 10serviceops: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10BCornwall) 05Stalled→03Resolved [20:16:29] Iniquity: can you send me a screenshot of it? [20:16:39] 10SRE-tools, 10Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593 (10Papaul) The first 51 servers on the list are R430 since we can not do any for those we are left with 209 servers out of 260. [20:16:41] i tried purging, and it works [20:16:46] might be a temporary error [20:17:13] https://imgur.com/a/Uddw9mZ [20:17:45] https://imgur.com/a/gAPGsw0 [20:17:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Fix strategy bug [deployment-charts] - 10https://gerrit.wikimedia.org/r/914394 (owner: 10Alexandros Kosiaris) [20:18:10] Iniquity: how does debug extension configuration look like? do you have "Read-only" checked by any chance? [20:18:40] mmm [20:18:41] yes [20:18:47] can you uncheck that part? [20:18:53] yes, it works [20:18:53] it locks the database [20:18:55] sorry [20:18:55] great! [20:18:56] xD [20:19:00] no worries [20:19:07] 10SRE, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10colewhite) [20:20:01] bot works fine too [20:20:43] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:911799|Switch on creating Babel categories in Russian Wiktionary (T335136)]] (duration: 15m 47s) [20:20:46] T335136: Switch on creating Babel categories in Russian Wiktionary - https://phabricator.wikimedia.org/T335136 [20:21:18] Iniquity: should be deployed! [20:21:22] anything else i can help with? [20:21:40] No! thank you so much :) [20:21:48] happy to help! [20:22:32] (03CR) 10BryanDavis: [C: 03+1] "Awesome work Giuseppe!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto) [20:23:21] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10JKieserman) L3 was signed! [20:24:48] !log ebernhardson@deploy1002 Started deploy [search/mjolnir/deploy@daf8c32]: bump mjolnir to v2.3.0 [20:24:56] (03Merged) 10jenkins-bot: machinetranslation: Fix strategy bug [deployment-charts] - 10https://gerrit.wikimedia.org/r/914394 (owner: 10Alexandros Kosiaris) [20:25:16] !log ebernhardson@deploy1002 Finished deploy [search/mjolnir/deploy@daf8c32]: bump mjolnir to v2.3.0 (duration: 00m 28s) [20:28:15] (03PS4) 10Urbanecm: [Growth] Finish Personalized praise variable rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908367 (https://phabricator.wikimedia.org/T334630) [20:28:20] (03CR) 10Urbanecm: [C: 03+2] [Growth] Finish Personalized praise variable rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908367 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [20:28:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908367 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [20:28:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [20:29:00] (03Merged) 10jenkins-bot: [Growth] Finish Personalized praise variable rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908367 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [20:29:30] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:908367|[Growth] Finish Personalized praise variable rename (T334630)]] [20:29:33] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [20:30:21] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [20:30:25] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [20:33:30] (03PS1) 10BCornwall: doc: Remove extra preceding space in intro example [software/spicerack] - 10https://gerrit.wikimedia.org/r/914398 [20:36:25] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:908367|[Growth] Finish Personalized praise variable rename (T334630)]] (duration: 06m 55s) [20:36:29] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [20:43:27] (03PS1) 10Andrea Denisse: prometheus: Synchronize only the /srv/prometheus folder instead of the entire /srv directory [puppet] - 10https://gerrit.wikimedia.org/r/914400 (https://phabricator.wikimedia.org/T309979) [20:45:37] (03CR) 10CI reject: [V: 04-1] prometheus: Synchronize only the /srv/prometheus folder instead of the entire /srv directory [puppet] - 10https://gerrit.wikimedia.org/r/914400 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [20:47:37] (03PS1) 10JHathaway: ssh: clamp lifetime_remaining_seconds to a value JRuby can accept [puppet] - 10https://gerrit.wikimedia.org/r/914404 (https://phabricator.wikimedia.org/T268344) [20:47:44] (03PS2) 10Andrea Denisse: prometheus: Synchronize only the /srv/prometheus directory when migrating data [puppet] - 10https://gerrit.wikimedia.org/r/914400 (https://phabricator.wikimedia.org/T309979) [20:48:15] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/914404 (https://phabricator.wikimedia.org/T268344) (owner: 10JHathaway) [20:48:20] (03CR) 10CI reject: [V: 04-1] ssh: clamp lifetime_remaining_seconds to a value JRuby can accept [puppet] - 10https://gerrit.wikimedia.org/r/914404 (https://phabricator.wikimedia.org/T268344) (owner: 10JHathaway) [20:49:57] (03PS2) 10JHathaway: ssh: clamp lifetime_remaining_seconds to a value JRuby can accept [puppet] - 10https://gerrit.wikimedia.org/r/914404 (https://phabricator.wikimedia.org/T268344) [20:50:52] (03PS3) 10JHathaway: ssh: clamp lifetime_remaining_seconds to a value JRuby can accept [puppet] - 10https://gerrit.wikimedia.org/r/914404 (https://phabricator.wikimedia.org/T268344) [20:51:23] (03PS4) 10JHathaway: ssh: clamp lifetime_remaining_seconds to a value JRuby can accept [puppet] - 10https://gerrit.wikimedia.org/r/914404 (https://phabricator.wikimedia.org/T268344) [21:01:51] (03PS1) 10JHathaway: puppet: use a string rather than a symbol to call a puppet function [puppet] - 10https://gerrit.wikimedia.org/r/914406 [21:02:17] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/914406 (owner: 10JHathaway) [21:08:54] (03PS1) 10JHathaway: puppet7: re-add host core [puppet] - 10https://gerrit.wikimedia.org/r/914408 [21:09:17] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/914408 (owner: 10JHathaway) [21:19:05] (03PS1) 10Volans: doc: fix search in documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/914409 [21:23:26] (03CR) 10Volans: [C: 03+2] "self-merging to make search work on doc.w.o" [software/spicerack] - 10https://gerrit.wikimedia.org/r/914409 (owner: 10Volans) [21:34:50] (03Merged) 10jenkins-bot: doc: fix search in documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/914409 (owner: 10Volans) [21:35:38] (03PS2) 10Volans: doc: Remove extra preceding space in intro example [software/spicerack] - 10https://gerrit.wikimedia.org/r/914398 (owner: 10BCornwall) [21:35:50] (03CR) 10Volans: [C: 03+2] "Thanks for the fix!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/914398 (owner: 10BCornwall) [21:39:49] (03Merged) 10jenkins-bot: doc: Remove extra preceding space in intro example [software/spicerack] - 10https://gerrit.wikimedia.org/r/914398 (owner: 10BCornwall) [21:55:02] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder) [22:18:30] (JobUnavailable) firing: (4) Reduced availability for job thanos-sidecar in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:44:16] (03PS1) 10Andrew Bogott: wmcs-dns-floating-ip-updater: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914412 (https://phabricator.wikimedia.org/T330759) [22:46:00] (03PS1) 10Urbanecm: EditPage: Support preloading from i18n messages [core] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914302 (https://phabricator.wikimedia.org/T330337) [22:46:06] (03PS1) 10Urbanecm: ApiVisualEditor: Support preloading from i18n messages [extensions/VisualEditor] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914303 (https://phabricator.wikimedia.org/T330337) [22:46:14] (03PS1) 10Urbanecm: EditPage: Support preloading from i18n messages [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914304 (https://phabricator.wikimedia.org/T330337) [22:46:19] (03PS1) 10Urbanecm: ApiVisualEditor: Support preloading from i18n messages [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914305 (https://phabricator.wikimedia.org/T330337) [22:47:58] (03PS1) 10Urbanecm: Personalized praise: Let mentors to skip suggestions [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914426 (https://phabricator.wikimedia.org/T334300) [23:01:42] (03PS3) 10Urbanecm: [Growth] Add GEMentorDashboardEnabledModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914373 (https://phabricator.wikimedia.org/T334630) [23:01:50] (03CR) 10CI reject: [V: 04-1] [Growth] Add GEMentorDashboardEnabledModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914373 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [23:02:33] (03PS4) 10Urbanecm: [Growth] Add GEMentorDashboardEnabledModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914373 (https://phabricator.wikimedia.org/T334630) [23:03:28] (03PS3) 10Urbanecm: [Growth] Remove GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914375 (https://phabricator.wikimedia.org/T334630) [23:04:02] (03PS2) 10Urbanecm: [Growth] Deploy Personalized praise to AR, BN, CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914393 (https://phabricator.wikimedia.org/T334630) [23:04:58] (03PS3) 10Urbanecm: [Growth] Deploy Personalized praise to AR, BN, CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914393 (https://phabricator.wikimedia.org/T334630) [23:16:17] (03PS1) 10Andrew Bogott: Remove unused labstore code [puppet] - 10https://gerrit.wikimedia.org/r/914415 [23:16:42] (03PS1) 10Andrew Bogott: wmcs-webproxy: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914416 (https://phabricator.wikimedia.org/T330759) [23:16:44] (03PS1) 10Andrew Bogott: wmcs-wikireplica-dns: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914417 (https://phabricator.wikimedia.org/T330759) [23:16:46] (03PS1) 10Andrew Bogott: wmcs-enc-cli: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914418 (https://phabricator.wikimedia.org/T330759) [23:16:50] (03PS1) 10Andrew Bogott: wmcs notify_maintainers: use mwopenstackclients for keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/914419 (https://phabricator.wikimedia.org/T330759) [23:18:49] (03CR) 10Andrew Bogott: "no-op on clouddumps: https://puppet-compiler.wmflabs.org/output/914415/40997/" [puppet] - 10https://gerrit.wikimedia.org/r/914415 (owner: 10Andrew Bogott) [23:19:36] (03CR) 10CI reject: [V: 04-1] wmcs notify_maintainers: use mwopenstackclients for keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/914419 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [23:23:32] (03PS1) 10Andrew Bogott: Added missing profile::toolforge::disable_tool::disable_tool_db_password [labs/private] - 10https://gerrit.wikimedia.org/r/914421 [23:23:55] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added missing profile::toolforge::disable_tool::disable_tool_db_password [labs/private] - 10https://gerrit.wikimedia.org/r/914421 (owner: 10Andrew Bogott) [23:26:19] (03PS1) 10Andrew Bogott: Add fake profile::wmcs::services::toolsdb_replica_cnf::htpassword [labs/private] - 10https://gerrit.wikimedia.org/r/914422 [23:26:45] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add fake profile::wmcs::services::toolsdb_replica_cnf::htpassword [labs/private] - 10https://gerrit.wikimedia.org/r/914422 (owner: 10Andrew Bogott) [23:28:20] (03PS1) 10Andrew Bogott: Added fake profile::wmcs::services::toolsdb_replica_cnf::htpassword_salt [labs/private] - 10https://gerrit.wikimedia.org/r/914423 [23:28:33] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added fake profile::wmcs::services::toolsdb_replica_cnf::htpassword_salt [labs/private] - 10https://gerrit.wikimedia.org/r/914423 (owner: 10Andrew Bogott) [23:30:10] (03CR) 10Andrew Bogott: "no-op on toolforge nfs server https://puppet-compiler.wmflabs.org/output/914415/41003/" [puppet] - 10https://gerrit.wikimedia.org/r/914415 (owner: 10Andrew Bogott) [23:31:49] (03PS2) 10Andrew Bogott: wmcs notify_maintainers: use mwopenstackclients for keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/914419 (https://phabricator.wikimedia.org/T330759) [23:34:57] (SystemdUnitFailed) firing: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:39:35] PROBLEM - Disk space on centrallog1002 is CRITICAL: DISK CRITICAL - free space: /srv 54387 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops