[00:00:07] (03PS2) 10Dzahn: ci::firewall: allow http monitoring from prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/890920 (https://phabricator.wikimedia.org/T327972) [00:06:07] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/890920/39774/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/890920 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn) [00:06:09] (03CR) 10Dzahn: [C: 03+2] ci::firewall: allow http monitoring from prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/890920 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn) [00:13:39] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [00:20:30] (03PS1) 10Dzahn: ci: set port 1443 for https monitoring [puppet] - 10https://gerrit.wikimedia.org/r/890921 (https://phabricator.wikimedia.org/T327972) [00:21:55] (03PS2) 10Dzahn: ci: set port 1443 for https monitoring [puppet] - 10https://gerrit.wikimedia.org/r/890921 (https://phabricator.wikimedia.org/T327972) [00:23:39] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/890921/39775/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/890921 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn) [00:30:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:01] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:34:02] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "the "connection refused" errors in logstash stopped after pupper ran on prometheus hosts as well" [puppet] - 10https://gerrit.wikimedia.org/r/890921 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn) [00:37:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:58:35] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [01:10:46] (03Abandoned) 10Jforrester: Reduce height of the article toolbar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890140 (https://phabricator.wikimedia.org/T316950) (owner: 10Sushrith Bogi) [01:28:40] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [01:29:58] (KubernetesCalicoDown) firing: (4) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:31:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2186.codfw.wmnet with OS bullseye [01:32:02] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye [01:33:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2186.codfw.wmnet with reason: host reimage [01:35:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2186.codfw.wmnet with reason: host reimage [01:44:48] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:52:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:52:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2186.codfw.wmnet with OS bullseye [01:53:05] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye completed: - db2186 (**PASS**) - Dow... [02:06:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:16] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:33:44] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 56 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:36:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:37:34] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Papaul) [02:38:01] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Papaul) 05Open→03Resolved complete [02:39:34] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 22 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:43:39] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [02:47:01] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [02:52:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [03:27:13] (03PS1) 10KartikMistry: Content Translation: Set MT threshold to 45% for Kurdish WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890947 (https://phabricator.wikimedia.org/T324941) [04:07:01] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [04:12:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [04:31:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:57:01] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:02:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:29:58] (KubernetesCalicoDown) firing: (4) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:18:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [06:36:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T0700) [07:14:15] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [07:14:22] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) 05Open→03Resolved a:03ayounsi [07:23:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [07:35:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10ayounsi) `asw-b-codfw> show virtual-chassis vc-port statistics extensive | match "FPC|Port|CRC alignment errors" fpc2: Port: vcp-255/0/48 CRC alignment errors: 5642 ` shows that there are indeed errors ` > show vir... [07:38:41] (03CR) 10Jelto: [C: 03+1] "lgtm, I'd leave it to Arnold to deploy this" [puppet] - 10https://gerrit.wikimedia.org/r/890799 (https://phabricator.wikimedia.org/T330129) (owner: 10Muehlenhoff) [07:39:30] (03CR) 10Jelto: [C: 03+1] "lgtm, I like the naming" [puppet] - 10https://gerrit.wikimedia.org/r/890014 (owner: 10Dzahn) [07:43:31] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) From T311999#8594766 I see that there is progress! Yay! @jbond / @MoritzMuehlenhoff In Juniper's form the only information requested when selecting OIDC is `ID token (Ope... [07:51:21] (03CR) 10Jcrespo: "Blocked on the review & adaptation/deploy of 868392, a patch from *December*." [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup) [07:52:59] (03CR) 10Nikerabbit: [C: 03+1] Content Translation: Set MT threshold to 45% for Kurdish WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890947 (https://phabricator.wikimedia.org/T324941) (owner: 10KartikMistry) [07:54:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/890891 (owner: 10Slyngshede) [08:00:06] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T0800). [08:00:06] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:51] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1128', diff saved to https://phabricator.wikimedia.org/P44724 and previous config saved to /var/cache/conftool/dbconfig/20230222-080050-jynus.json [08:02:03] OK. I'm here and will go ahead with deployment.. [08:04:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890947 (https://phabricator.wikimedia.org/T324941) (owner: 10KartikMistry) [08:05:40] (03Merged) 10jenkins-bot: Content Translation: Set MT threshold to 45% for Kurdish WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890947 (https://phabricator.wikimedia.org/T324941) (owner: 10KartikMistry) [08:09:40] RoanKattouw: seems commit 7dd89d8f05b4b78626206db74ff7e9506c5d6608 is merged, but not deployed? Shows `There were unexpected commits pulled from origin for /srv/mediawiki-staging.` [08:11:43] I looked into bug (T315621) and changes, seems OK to go ahead. [08:11:43] T315621: Install VueTest extension in beta labs - https://phabricator.wikimedia.org/T315621 [08:12:14] !log kartik@deploy1002 Started scap: Backport for [[gerrit:890947|Content Translation: Set MT threshold to 45% for Kurdish WP (T324941)]] [08:12:19] T324941: Make the Machine translation stricter by 45% in Kurdish Wikipedia - https://phabricator.wikimedia.org/T324941 [08:14:10] !log kartik@deploy1002 kartik: Backport for [[gerrit:890947|Content Translation: Set MT threshold to 45% for Kurdish WP (T324941)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [08:17:55] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1004.eqiad.wmnet with OS bullseye [08:18:09] (03PS1) 10Slyngshede: Minor fixes found while setting up production env. [software/bitu] - 10https://gerrit.wikimedia.org/r/891227 [08:22:56] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:890947|Content Translation: Set MT threshold to 45% for Kurdish WP (T324941)]] (duration: 10m 41s) [08:23:00] T324941: Make the Machine translation stricter by 45% in Kurdish Wikipedia - https://phabricator.wikimedia.org/T324941 [08:27:47] (03CR) 10Volans: "reply to previous question" [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [08:31:01] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment missing comma in Ferm rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890807 (owner: 10Slyngshede) [08:36:17] !log [WDQS] Repooled `wdqs20[05,07,10]` [08:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:47] ryankemper: o/ wdqs2010 should not be repooled, it does not have a valid journal [08:38:28] hosts >= 2009 are not ready to serve user traffic [08:43:11] (03CR) 10Jcrespo: [C: 04-2] "See (potentially) related incident https://phabricator.wikimedia.org/T330258 before proceeding with this. The -2 is to mark the important " [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup) [08:43:42] !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1004.eqiad.wmnet with reason: host reimage [08:47:01] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [08:47:08] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1004.eqiad.wmnet with reason: host reimage [08:49:05] !log rolling upgrade to HAProxy 2.6.9 in codfw, eqsin, drmrs, esams and eqiad [08:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:12] (03CR) 10Jelto: [C: 03+1] "looks good to me and useful in combination with after stanza. I have some concerns that some of the 300+ timers we have rely on the old be" [puppet] - 10https://gerrit.wikimedia.org/r/890843 (owner: 10Jbond) [08:50:28] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [08:50:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:52:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [09:00:04] hashar and dduvall: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T0900) [09:00:05] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891232 (https://phabricator.wikimedia.org/T325587) [09:00:07] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891232 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot) [09:00:44] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891232 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot) [09:03:01] !log nfraison@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1004.eqiad.wmnet with OS bullseye [09:07:57] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.24 refs T325587 [09:08:01] T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587 [09:08:26] dcausse: ah yes, thanks. fortunately looks like the pool command failed on 2010 anyway [09:09:37] * dcausse loves when a command knows when to fail :) [09:13:36] (03CR) 10Ladsgroup: mariadb: Update grants to use wikiuser@10.% only (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup) [09:14:36] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.24 refs T325587 (duration: 06m 38s) [09:14:40] T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587 [09:18:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/891227 (owner: 10Slyngshede) [09:19:15] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Minor fixes found while setting up production env. [software/bitu] - 10https://gerrit.wikimedia.org/r/891227 (owner: 10Slyngshede) [09:20:18] (ProbeDown) firing: (5) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:20:44] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [09:20:44] what's that? [09:20:57] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print [09:20:57] page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Proton [09:20:57] PROBLEM - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is CRITICAL: CRITICAL - exception while fetching the URL. 503 Server Error: Backend fetch failed for url: https://en.planet.wikimedia.org/ https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [09:21:02] hashar: this is probably related to your deploy [09:21:03] phab down fwiw [09:21:10] Morning! Since yesterday wikidata is getting a ton of these "[FIRING:1] DatasourceNoData (kK0KSCJ4z "AlertManager","cxserver" Wikidata kK0KSCJ4z Edits: below 30 per minute (for 3 minutes) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Alerts critical wikidata)" [09:21:22] elukey: this? [09:21:25] checking too [09:21:33] hmmm [09:21:35] PROBLEM - Debmonitor Health Check on debmonitor.wikimedia.org is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Backend fetch failed https://wikitech.wikimedia.org/wiki/Debmonitor [09:21:43] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [09:21:45] proton, http appservers [09:21:51] varnish [09:21:56] hashar: yeah we got paged for appservers down, may be something else but it smells weird as coincincence [09:21:59] any ideas if the data being fed into alerting changed suddenly around 1708 yesterday changed a load? [09:22:09] edits at 0 [09:22:10] (ProbeDown) firing: (5) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:22:13] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [09:22:14] at least there is nothing showing up on the mw error dashboards [09:22:19] godog: major outage? [09:22:23] (oh I'll revise phab down to wikis down) [09:22:25] jynus: looks like it [09:22:29] hashar: okok let's keep in sync thanks [09:23:06] can people confirm no wiki access? [09:23:20] jynus: yes, here, en.wiki [09:23:20] here if needed [09:23:21] R/O is working here via drmrs [09:23:45] (er, UK, whichever DC that is if helpful) [09:23:46] and I can successfully log-in and browse en.wp.o [09:23:46] +1 for it.wikipedia.org [09:23:52] R/O working from fr, drmrs also [09:23:53] It is app servers [09:23:57] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [09:24:01] (Wikidata Reliability Metrics - Median Payload alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert [09:24:06] (JobUnavailable) firing: Reduced availability for job swagger_check_restbase_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:24:17] TheresNoTime: yes no access or yes access? [09:24:21] Phabricator/ Grafana also not accessable for me [09:24:29] phab is working here via drmrs as well [09:24:30] jynus: no access [09:24:36] I am going to rollback to rule out the train [09:24:44] creating an incident [09:25:01] jynus: I've linked the doc [09:25:02] eqiad appservers rpm down 50% [09:25:05] in -security [09:25:15] phabricator is offline too, though, so train is probably not related? I cannot access phabricator or en.wikipedia.org (from Germany) [09:25:18] (ProbeDown) firing: (9) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:25:31] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:25:38] https://www.wikimediastatus.net/ [09:26:57] (ProbeDown) firing: (11) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:27:32] :o [09:28:00] Good I checked here. I was looking something up about a music artist and there's an error :P [09:28:16] it's very minor but the chan topic still says that the status is up [09:28:19] hah [09:28:37] it also says to chech the official location for reporting issues 0:-) [09:28:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [09:28:45] (JobUnavailable) firing: (2) Reduced availability for job swagger_check_restbase_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:28:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:29:08] doing the rollback, scap failed to rebuild the images due to Docker receiving a `HTTP status: 503 Backend fetch failed`, filed as https://phabricator.wikimedia.org/T330264 . Might be unrelated to the ongoing issue [09:29:23] PROBLEM - Wikitech and wt-static content in sync on cloudweb1004 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech https://wikitech.wikimedia.org/wiki/Wikitech-static [09:29:41] Docker failed to fetch from the docker-registry, so I am wondering whether we might have a network related issue of some sort [09:29:57] hashar: please come on -sec [09:29:57] perhaps to do with https://sal.toolforge.org/log/4y9PeIYBtR_B8fLx3spz? I can't consult wikitech to see how we use HAProxy :| [09:30:13] (KubernetesCalicoDown) firing: (4) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:30:21] vgutierrez: maybe https://sal.toolforge.org/log/4y9PeIYBtR_B8fLx3spz? [09:30:27] kostajh: https://wikitech-static.wikimedia.org/wiki/Main_Page [09:30:28] kostajh: https://wikitech-static.wikimedia.org/wiki/Main_Page might work ;) [09:30:34] (heh) [09:30:38] Amir1: not related AFAIK [09:30:39] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.40.0-wmf.23" - T325587 [09:31:29] (train rolled back) [09:31:51] PROBLEM - Check if active EventStreams endpoint is delivering messages. on alert1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [09:34:38] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=esams,service=ats-be,cluster=cache_text [09:34:45] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=ats-be,cluster=cache_text [09:35:23] RECOVERY - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is OK: OK - Website content is current (1961 = 86400) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [09:35:27] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [09:35:56] phab back, en.wiki back [09:35:59] RECOVERY - Debmonitor Health Check on debmonitor.wikimedia.org is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 1094 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [09:36:02] Hello, sr.wikipedia and Zuul aren't working. Is there some maintenance or? [09:36:10] Kizule: just fixed [09:36:16] try again please [09:36:18] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1005.eqiad.wmnet with OS bullseye [09:36:19] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:19] Amazing, it works. [09:37:14] nice [09:37:18] (ProbeDown) resolved: (11) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:38:46] I will let things settle and promote group1 again [09:39:00] (Wikidata Reliability Metrics - Median Payload alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert [09:39:04] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [09:39:09] (JobUnavailable) resolved: (2) Reduced availability for job swagger_check_restbase_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:40:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [09:40:18] (ProbeDown) resolved: (9) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:40:44] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [09:41:43] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [09:42:37] (03PS1) 10Hashar: Revert "group1 wikis to 1.40.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891237 (https://phabricator.wikimedia.org/T325587) [09:42:39] (03CR) 10Hashar: [C: 03+2] Revert "group1 wikis to 1.40.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891237 (https://phabricator.wikimedia.org/T325587) (owner: 10Hashar) [09:43:13] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.40.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891237 (https://phabricator.wikimedia.org/T325587) (owner: 10Hashar) [09:43:30] (03CR) 10Slyngshede: [C: 03+2] idm.wikimedia.org CNAME to idm1001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/890891 (owner: 10Slyngshede) [09:43:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [09:43:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:44:22] (03CR) 10Muehlenhoff: [C: 03+2] Fix condition for including haveged [puppet] - 10https://gerrit.wikimedia.org/r/890816 (owner: 10Muehlenhoff) [09:45:02] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891238 (https://phabricator.wikimedia.org/T325587) [09:45:04] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891238 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot) [09:45:24] the issue has been figured out and is unrelated to MediaWiki deployment so I am proceeding again [09:45:40] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891238 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot) [09:46:23] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2017.codfw.wmnet with OS bullseye [09:47:52] someone update wikimediastatus.net please [09:48:03] doing [09:48:35] !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1005.eqiad.wmnet with reason: host reimage [09:51:23] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1005.eqiad.wmnet with reason: host reimage [09:52:30] 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Unexpected auditd service restart failure - https://phabricator.wikimedia.org/T287266 (10MoritzMuehlenhoff) Per the bug that should be fixed in the auditd package in Bullseye, we'll be able to confirm when we reimage the doh* servers to Bullseye. [09:52:55] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.24 refs T325587 [09:52:58] T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587 [09:54:47] (03CR) 10Jcrespo: mariadb: Update grants to use wikiuser@10.% only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup) [09:56:06] (03CR) 10Jcrespo: mariadb: Update grants to use wikiuser@10.% only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup) [09:57:16] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/891241 [09:59:28] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.24 refs T325587 (duration: 06m 33s) [09:59:32] T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587 [10:01:40] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2017.codfw.wmnet with reason: host reimage [10:01:48] RECOVERY - Check if active EventStreams endpoint is delivering messages. on alert1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [10:02:15] (03PS1) 10Slyngshede: P:idm move OIDC endpoint to variable. [puppet] - 10https://gerrit.wikimedia.org/r/891242 [10:04:06] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2018.codfw.wmnet with OS bullseye [10:04:30] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2019.codfw.wmnet with OS bullseye [10:04:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2017.codfw.wmnet with reason: host reimage [10:04:55] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2020.codfw.wmnet with OS bullseye [10:05:04] RECOVERY - Check systemd state on an-airflow1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:24] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2021.codfw.wmnet with OS bullseye [10:05:50] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39776/console" [puppet] - 10https://gerrit.wikimedia.org/r/891242 (owner: 10Slyngshede) [10:07:05] (03PS2) 10Slyngshede: P:idm move OIDC endpoint to variable. [puppet] - 10https://gerrit.wikimedia.org/r/891242 [10:07:47] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [10:08:21] !log Starting sre.switchdc.mediawiki live test preparation steps [10:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:45] (JobUnavailable) firing: Reduced availability for job swagger_check_eventstreams_internal_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:12:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [10:12:49] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/891241 (owner: 10Muehlenhoff) [10:13:23] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [10:14:29] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10SLyngshede-WMF) @ayounsi doesn't it need an URL as well, for the endpoint? [10:14:58] (KubernetesCalicoDown) firing: (5) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:16:06] (03PS3) 10Slyngshede: P:idm move OIDC endpoint to variable. [puppet] - 10https://gerrit.wikimedia.org/r/891242 [10:16:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:17:25] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39778/console" [puppet] - 10https://gerrit.wikimedia.org/r/891242 (owner: 10Slyngshede) [10:17:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [10:18:47] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2018.codfw.wmnet with reason: host reimage [10:18:59] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2019.codfw.wmnet with reason: host reimage [10:19:51] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2020.codfw.wmnet with reason: host reimage [10:19:58] (KubernetesCalicoDown) firing: (5) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:20:01] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [10:20:25] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2021.codfw.wmnet with reason: host reimage [10:21:08] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2017.codfw.wmnet with OS bullseye [10:21:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2018.codfw.wmnet with reason: host reimage [10:21:47] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) For the record: some doc on {F36864730} as well as https://jnprprod.devportal-aw-us.webmethods.io/portal/apis [10:21:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:22:08] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) 05Open→03In progress [10:22:19] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1005.eqiad.wmnet with OS bullseye [10:24:30] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) >>! In T306238#8636356, @SLyngshede-WMF wrote: > @ayounsi doesn't it need an URL as well, for the endpoint? I guess they will give it to us later on in the onboarding proc... [10:24:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2020.codfw.wmnet with reason: host reimage [10:24:58] (KubernetesCalicoDown) resolved: (5) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:25:07] (03CR) 10Nicolas Fraison: [C: 03+2] fix(presto): fix typo from node.enviroment to node.environment [puppet] - 10https://gerrit.wikimedia.org/r/889807 (owner: 10Nicolas Fraison) [10:26:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2021.codfw.wmnet with reason: host reimage [10:27:50] (03PS1) 10Jbond: idp: Add juniper OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/891245 [10:28:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/891242 (owner: 10Slyngshede) [10:28:35] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm move OIDC endpoint to variable. [puppet] - 10https://gerrit.wikimedia.org/r/891242 (owner: 10Slyngshede) [10:28:41] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2019.codfw.wmnet with reason: host reimage [10:28:52] (03CR) 10Jbond: [C: 03+2] idp: Add juniper OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/891245 (owner: 10Jbond) [10:30:53] 10SRE, 10Traffic, 10observability: varnish-frontend-fetcherr sets incorrect level in logstash - https://phabricator.wikimedia.org/T330267 (10Vgutierrez) [10:33:09] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) [10:33:28] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [10:33:36] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [10:33:44] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [10:35:00] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [10:35:28] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [10:36:11] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) [10:36:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:36:34] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 10:35 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl 10:35 <+logmsgbot> !log cgoubert@cu... [10:38:03] (03PS1) 10Nicolas Fraison: presto: remove - in the cluster name used in node.environment [puppet] - 10https://gerrit.wikimedia.org/r/891248 [10:38:53] (03PS1) 10Slyngshede: P:IDM Fix url for OIDC endpoint [puppet] - 10https://gerrit.wikimedia.org/r/891249 [10:39:26] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39779/console" [puppet] - 10https://gerrit.wikimedia.org/r/891248 (owner: 10Nicolas Fraison) [10:39:34] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2018.codfw.wmnet with OS bullseye [10:39:56] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39780/console" [puppet] - 10https://gerrit.wikimedia.org/r/891249 (owner: 10Slyngshede) [10:40:58] (03CR) 10Jelto: [C: 03+1] Update DNS to switch gitlab-replica (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/890785 (https://phabricator.wikimedia.org/T329930) (owner: 10EoghanGaffney) [10:40:59] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2020.codfw.wmnet with OS bullseye [10:41:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/891249 (owner: 10Slyngshede) [10:41:52] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Fix url for OIDC endpoint [puppet] - 10https://gerrit.wikimedia.org/r/891249 (owner: 10Slyngshede) [10:42:36] (03PS2) 10Nicolas Fraison: presto: remove - in the cluster name used in node.environment [puppet] - 10https://gerrit.wikimedia.org/r/891248 [10:43:30] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:43:50] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 181, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:45:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2021.codfw.wmnet with OS bullseye [10:45:11] (03PS3) 10Nicolas Fraison: presto: remove - in the cluster name used in node.environment [puppet] - 10https://gerrit.wikimedia.org/r/891248 [10:45:58] (03PS2) 10Vgutierrez: varnish: Set `X-Content-Type-Options: nosniff` on upload requests [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm) [10:46:04] (03PS4) 10Nicolas Fraison: presto: remove - in the cluster name used in node.environment [puppet] - 10https://gerrit.wikimedia.org/r/891248 [10:46:13] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:12] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39783/console" [puppet] - 10https://gerrit.wikimedia.org/r/891248 (owner: 10Nicolas Fraison) [10:47:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2019.codfw.wmnet with OS bullseye [10:48:21] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) Skipping `00-optional-warmup-caches` as the node script is broken and [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/890299 |... [10:48:44] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] presto: remove - in the cluster name used in node.environment [puppet] - 10https://gerrit.wikimedia.org/r/891248 (owner: 10Nicolas Fraison) [10:48:46] 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10Vgutierrez) I think so, I've took the liberty of amending the commit and adding a test for the new header as well [10:49:12] (03CR) 10Vgutierrez: [C: 03+1] "tests seem to be happy:" [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm) [10:56:42] (03PS1) 10AikoChou: ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218) [10:59:22] (03CR) 10Elukey: ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [11:00:04] claime: OwO what's this, a deployment window?? MediaWiki infrastucture (UTC mid-day). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1100). nyaa~ [11:00:24] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [11:00:35] jynus, godog, heads up, starting live-test [11:01:01] ok [11:01:04] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [11:01:06] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [11:01:31] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:01 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet 11:01 <+logmsgbot> !log cgouber... [11:01:41] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks [11:01:49] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) [11:02:06] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:01 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks 11:01 <+logmsgbot>... [11:02:26] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [11:02:38] (03PS1) 10Slyngshede: P:IDM Strip / in OIDC url [puppet] - 10https://gerrit.wikimedia.org/r/891254 [11:02:42] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [11:03:08] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:02 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance 11:02 <+logmsgbot> !log cgoub... [11:03:19] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [11:03:19] !log cgoubert@cumin1001 [DRY-RUN] MediaWiki read-only period starts at: 2023-02-22 11:03:19.149671 [11:03:21] (03CR) 10Slyngshede: [C: 03+2] P:IDM Strip / in OIDC url [puppet] - 10https://gerrit.wikimedia.org/r/891254 (owner: 10Slyngshede) [11:03:33] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [11:03:41] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [11:04:16] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=99) [11:04:18] [2/3, retrying in 9.00s] Attempt to run 'spicerack.mysql_legacy.MysqlLegacy._check_core_master_in_sync' raised: Unable to get heartbeat from master db1118.eqiad.wmnet for section s1 [11:04:26] Amir1 ? [11:04:58] (03PS2) 10Slyngshede: Switch to built in LogoutView. [software/bitu] - 10https://gerrit.wikimedia.org/r/883113 [11:05:03] at meeting but it doesn't look problematic [11:05:10] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:03 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly 11:03 <+logmsgbot> !log cgoubert@... [11:05:20] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` spicerack.mysql_legacy.MysqlLegacyError: Unable to get heartbeat from master db1118.eqiad.wmnet for section s1 ` [11:05:26] claime: https://orchestrator.wikimedia.org/web/cluster/alias/s1 db1118 is the master [11:05:43] and should have the heartbeat in heartbeat db (heartbeat table) [11:05:54] maybe it is trying to get events from the codfw master, and they won't reach unless done for real? [11:06:04] (03CR) 10Slyngshede: [V: 03+2] Switch to built in LogoutView. [software/bitu] - 10https://gerrit.wikimedia.org/r/883113 (owner: 10Slyngshede) [11:06:07] possible [11:06:13] or circular replication is setup [11:06:47] is it failing only there, or just happens to be the first checked? [11:07:32] No the others seem good I think [11:08:15] Hmm idk, it's not outputing which servers it's checking at each step [11:08:19] lemme check debug [11:09:05] my guess is it may work, but needs circular replication, with will be setup only a few days before the switchover [11:09:14] it looks good for the other sections [11:09:40] in any case, in an emergency is not a big deal, it should happen anyway [11:10:02] if the other sections work, then there may be a grant issue or something else [11:10:10] Hmm it looks like the same issue with x2 possibly [11:10:16] pyparsing.ParseException: Expected end of text, found ':' (at char 1), (line:1, col:2) [11:10:31] cumin.backends.InvalidQueryError: Unexpected boolean operator 'and' with hosts '' [11:10:41] Yeah, there's an empty cumin query somewhere [11:11:10] I'll log the stacktrace in the task and proceed and we debug later? [11:11:41] sure, but I would do the whole process again later [11:11:49] ack [11:11:52] (when fixed) [11:12:33] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) Error seems to come from cumin query: ` 2023-02-06 12:09:06,872 DRY-RUN cgoubert 2367071 [ERROR _menu.py:261 in run] Exception raised... [11:13:03] !log eoghan@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab2002.wikimedia.org with reason: Running failover to gitlab1003 - T329930 [11:13:04] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [11:13:07] T329930: Switchover gitlab-replica (gitlab2002 -> gitlab1003) - https://phabricator.wikimedia.org/T329930 [11:13:14] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [11:13:18] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab2002.wikimedia.org with reason: Running failover to gitlab1003 - T329930 [11:13:37] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [11:13:39] (03PS2) 10AikoChou: ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218) [11:13:39] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [11:13:51] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [11:13:51] !log cgoubert@cumin1001 [DRY-RUN] MediaWiki read-only period ends at: 2023-02-22 11:13:51.466468 [11:13:51] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [11:14:04] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners [11:14:06] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners (exit_code=0) [11:14:27] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [11:15:05] (03CR) 10AikoChou: ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [11:15:46] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:13 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki 11:13 <+logmsgbot> !log cgoub... [11:16:26] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [11:16:53] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl [11:17:26] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) [11:18:38] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters [11:24:15] how many steps left? [11:24:19] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) [11:24:33] done [11:24:37] with the cookbook steps [11:24:42] that one seemed quite long! [11:24:55] PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (43/43) [05:29<00:00, 7.67s/hosts] [11:24:57] Yeah [11:24:58] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:18 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters 11:24 how man... [11:25:54] It runs sequentially on quite a few hosts [11:25:59] (all the DB masters) [11:26:20] (03PS1) 10KartikMistry: Fix contribution menu entrypoint in vector-2022 skin [extensions/ContentTranslation] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/890863 (https://phabricator.wikimedia.org/T329893) [11:26:22] !log installing git security updates [11:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:32] I see, probably it is one that is not time-sensitive [11:26:45] (JobUnavailable) firing: (2) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:26:50] jynus: No, we're out of read-only at that point [11:27:03] read only steps are 2-7 [11:27:03] ngix @ codfw ? [11:27:17] wait, we have some issue, is it only monitoring? [11:28:37] (03PS1) 10KartikMistry: Fix contribution menu entrypoint in vector-2022 skin [extensions/ContentTranslation] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890864 (https://phabricator.wikimedia.org/T329893) [11:29:00] yeah, it is prometheus [11:29:10] I was confused by the wording "Reduced availability for job nginx" [11:29:17] sudo confctl select 'dc=codfw,service=nginx' get says everything pooled [11:29:43] really meaning "reduced availibility on prometheus scraping job for nginx" [11:30:05] I read it as "reduced availibility for nginx" :-D [11:30:07] yeah, it's confusing [11:30:12] RECOVERY - Wikitech and wt-static content in sync on cloudweb1004 is OK: wikitech-static OK - wikitech and wikitech-static in sync (41139 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [11:30:39] it's fine, not sure if you have to do more operations re test? [11:31:02] No I think we're good now [11:31:36] debugging then, it is, but seems like an easy fix if just a cumin query issue [11:31:44] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:31:58] jynus: about the nginx job, it's gitlab [11:32:05] https://grafana.wikimedia.org/goto/yMtb46JVz?orgId=1 [11:32:31] claime: I'm back from a very long meeting, I haven't read the backlog, anything I can help with? [11:32:31] And that's probably because eoghan and jelto are running a switchover on gitlab [11:32:45] ok, good [11:32:45] volans: Can you help me check out https://phabricator.wikimedia.org/T330271#8636689 ? [11:33:11] It's the only thing that crapped out in the whole live-test [11:33:31] that's a query that does 'foo and bar ...' and foo returns 0 hosts [11:33:36] yep [11:33:43] we are switching the gitlab-replica. That should not have a impact [11:33:57] which cookbook was that? [11:34:01] at least from gitlab replicas beeing down [11:34:02] jelto: thanks, no issue, just the alerting was confusing to me at first [11:34:07] Cookbook sre.switchdc.mediawiki.03-set-db-readonly ? [11:34:07] volans: sre.switchdc.mediawiki.03-set-db-readonly [11:34:12] heh [11:35:18] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:35:19] I see also another failure in th elogs [11:35:19] spicerack.mysql_legacy.MysqlLegacyError: Unable to get heartbeat from master db1118.eqiad.wmnet for section s1 [11:35:37] (later on) [11:36:17] volans: yes, that's the error that bubbled up in the cookbook, the stacktrace is from the spicerack log [11:36:27] x2??? [11:36:32] didnt' we ditch it [11:36:56] a deploy defect, maybe? [11:37:10] I thought socgoubert@cumin1001:~/cookbooks$ dpkg -l | grep spicerack [11:37:12] ii spicerack 6.2.1-1+deb11u1 amd64 Automation and orchestration library for WMF, written in Python [11:37:22] that's the right version right volans ? [11:37:28] yes [11:37:32] checking [11:37:35] weird then [11:37:38] ty <3 [11:37:52] jynus: may be my patch sucked, idk [11:38:11] volans: I'll create a specific task for it [11:38:15] ha ha, don't think so, usually it is the smallest issue [11:39:14] meeting over, do you need anything now claime ? [11:39:45] Amir1: volan,s is helping debug the 03-set-db-readonly error [11:40:05] okay, I'm around. Ping me if needed [11:40:21] Thanks <3 [11:41:15] claime: that stacktrace is from the 6th... [11:41:25] ffs [11:41:27] sorry [11:41:30] ha ha [11:41:44] let me check the right log then [11:42:17] I ma about to leave but -tech has a report of users seeing read-only errors [11:42:40] taavi: which wiki? en? [11:42:44] bn [11:42:48] yeah bn [11:43:26] that's s3 [11:43:27] that's not normal, we should not be changing the RO status in the live DC during the live-test [11:44:09] db1166 errors [11:44:30] I will depool and later debug [11:44:33] ack [11:44:47] claime: the cookbook issue is that on db1118 [11:44:51] SELECT ts FROM heartbeat.heartbeat WHERE datacenter = 'codfw' and shard = 's1' ORDER BY ts DESC LIMIT 1; [11:44:51] yes [11:44:54] Empty set (0.000 sec) [11:45:16] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1166, seen mw errors', diff saved to https://phabricator.wikimedia.org/P44726 and previous config saved to /var/cache/conftool/dbconfig/20230222-114515-jynus.json [11:45:16] so it can't get the heartbeat [11:46:36] we have grant errors, Amir1 [11:46:54] sigh [11:47:07] what is the user [11:47:08] I depooled db1166, but it may be deeper [11:47:24] not sure if grant or network, but authentication is failing at random [11:47:51] where are you seeing this? nothing in https://logstash.wikimedia.org/goto/83874c6c6b848b8236a12c8f470be6f8 [11:48:07] network fails at random. It might not be grants [11:48:23] db1166 has not been touched by the cookbook [11:48:23] read-only errors are in https://logstash.wikimedia.org/goto/ae819cf364437054868c6cf56829bee1, and not limited to s3 [11:48:47] it started at 11:03 [11:48:55] so maybe test related [11:49:04] https://orchestrator.wikimedia.org/web/clusters [11:49:07] I am repooling db1166, seems something else [11:49:11] nothing has issues on db side [11:49:24] I think the test might have actually set production to read-only [11:49:27] let me check dbctl [11:49:41] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1166, errors not fixed', diff saved to https://phabricator.wikimedia.org/P44727 and previous config saved to /var/cache/conftool/dbconfig/20230222-114940-jynus.json [11:49:46] the timing matches with the read-only cookbook [11:49:52] wtf [11:50:21] icinga checks for read only looks fine [11:50:58] edits are flowing, so it is localized [11:51:32] ok why are mw2* which are in codfw trying to write? [11:51:46] https://www.irccloud.com/pastebin/nSHr1J7j/ [11:52:04] is it ro? [11:53:04] dbctl says it's not ro, everything is rw [11:53:21] all are codfw [11:54:05] There are no A/P mediawiki services pooled in codfw [11:54:07] dbs are ok, so we should not be in a split brain [11:54:24] so why are edits going there? [11:54:52] it's not all edits, these are read views that are giving ro [11:55:12] some can write to the db but it should reach eqiad master [11:55:14] claime: can you check etcd status for priamary and active dc for mw? [11:55:49] If its helpful, users reporting getting read only also report that edits made via API go through [11:55:57] maybe something changed there, I am trying to check for causes [11:56:07] from mw point of view, regardless of dc, the master is eqiad's master [11:56:36] then what could it be? [11:56:59] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/799685 [11:57:12] cgoubert@cumin1001:/var/log/spicerack/sre/switchdc$ confctl --object-type mwconfig select name=ReadOnly get [11:57:14] {"ReadOnly": {"val": "You can't edit now. This is because of maintenance. Copy and save your text and try again in a few minutes."}, "tags": "scope=codfw"} [11:57:16] {"ReadOnly": {"val": false}, "tags": "scope=eqiad"} [11:57:53] !log cgoubert@cumin1001 conftool action : set/val=false; selector: name=ReadOnly,scope=codfw [11:58:19] ok the cookbook doesn't reset the ReadOnly val [11:58:30] you think you found it an fixed it? [11:58:38] Can someone sanity check that false/false is the right status? [11:58:45] this day is getting more interesting by the hour [11:58:46] for mwconfig [11:59:02] yup fixed [11:59:07] claime: Thanks! [11:59:25] Please confirm mwconfig ReadOnly false/false is the right state [11:59:39] (03CR) 10EoghanGaffney: [C: 03+2] Change the active gitlab replica host to be the eqiad instance [puppet] - 10https://gerrit.wikimedia.org/r/890779 (https://phabricator.wikimedia.org/T329930) (owner: 10EoghanGaffney) [11:59:50] claime: I don't know [11:59:55] me neither [12:00:02] but errors seem to have stopped? [12:00:05] :death: [12:00:05] but the errors are gone [12:00:27] MasterDatacenter is set to eqiad so we should be ok [12:00:48] and dbs are a final protection agains split brains [12:00:53] which is a good thing [12:01:00] Adding a big warning to the Switch Datacenter page [12:01:11] let me create a new "mini-incident" [12:01:34] and confirm with reporters things lookg good [12:01:38] claime: was 07-set-readwrite.py not called? [12:02:17] dude my tmux :D [12:02:27] !log installing NSS security updates [12:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:37] 2023-02-22 11:13:51,074 cgoubert 2919708 [INFO] START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [12:02:38] 2023-02-22 11:13:51,076 cgoubert 2919708 [INFO] Set MediaWiki in read-write in eqiad [12:02:40] 2023-02-22 11:13:51,077 cgoubert 2919708 [INFO] Setting val=False for tags: {'scope': 'eqiad', 'name': 'ReadOnly'} [12:02:44] It's only setting it for DC_TO [12:02:50] It's not resetting DC_FROM [12:03:02] I think that's a holdover from before multidc [12:03:12] And it skipped right under my nose when I checked [12:03:19] ahhh got it [12:03:34] I'll fix it [12:04:10] btw, I think we need to check affects of this patch done on multidc before the switchover: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/799685/11/wmf-config/etcd.php [12:04:25] claime: also 02-set-readonly.py should have set_reaonly for both [12:04:28] at this point [12:04:40] I am double checking with reporter [12:05:15] volans: will double-chek 02. [12:05:26] jynus: as for the original failure regarding db1118 the cookbook tries to get the heartbeat for codfw, but it's empty. Could it be because the replication is not yet enabled codfw->eqiad? [12:05:28] to be clear, nothing to do with the grants? [12:05:35] (03PS2) 10EoghanGaffney: Update DNS to switch gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/890785 (https://phabricator.wikimedia.org/T329930) [12:05:38] Amir1: false positive on my side [12:05:49] althought there may be small issues there [12:06:02] saw some auth errors, but were unrelated [12:06:07] small issues is the least terrible thing about the grants [12:06:29] claime: I get the report from someone that can edit on desktop but show read only on mobile [12:06:46] (03CR) 10EoghanGaffney: [C: 03+2] Update DNS to switch gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/890785 (https://phabricator.wikimedia.org/T329930) (owner: 10EoghanGaffney) [12:07:30] in total 700 errors, it eats a lot of error budget but not really an incident I think [12:07:56] jynus: i don´t know what to do with that information :/ [12:08:22] claime: basically I don't have confirmation that it is fully fixed [12:08:38] I am discussing with someone to check if the issue is still ongoing [12:08:48] https://config-master.wikimedia.org/mediawiki.yaml [12:09:13] cgoubert@cumin1001:/var/log/spicerack/sre/switchdc$ sudo confctl --object-type mwconfig select name=ReadOnly get [12:09:15] {"ReadOnly": {"val": "false"}, "tags": "scope=codfw"} [12:09:17] {"ReadOnly": {"val": false}, "tags": "scope=eqiad"} [12:09:19] Stale config? [12:11:01] yeah, read-only gets cached to avoid stampede. Maybe an overly aggressive cache? [12:11:17] mobile reaches the same cluster [12:11:34] "commons still locked i see.." [12:11:45] (JobUnavailable) firing: (3) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:11:47] so not a single user, multiple are seeing the issue [12:12:13] I am going to update the status page at this point [12:12:16] How can we reset cache? [12:12:20] jynus: yes, thank you [12:12:22] I'm so sorry :( [12:12:37] I mean cache for that particular value? [12:12:39] manually [12:13:00] claime: we have icinga checks to ensure mw siteinfo returns the latest value in etcd [12:13:03] and those are not firing [12:13:04] why is the other false a string and the other a boolean? [12:13:20] taavi: good catch [12:13:31] !log cgoubert@cumin1001 conftool action : set/val=False; selector: name=ReadOnly,scope=codfw [12:13:44] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [12:13:48] !log cgoubert@cumin1001 conftool action : set/val=no; selector: name=ReadOnly,scope=codfw [12:13:55] WTF [12:14:32] !log cgoubert@cumin1001 conftool action : set/val=false; selector: name=ReadOnly,scope=codfw [12:14:41] it doesn't want to set it to a bool [12:15:13] please advise [12:15:35] who in your team know about this, let's call them [12:15:35] what commands did you try ? [12:15:50] (03CR) 10Ladsgroup: mariadb: Update grants to use wikiuser@10.% only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup) [12:15:53] sudo confctl --object-type mwconfig select name=ReadOnly,scope=codfw set/val=false [12:15:57] sudo confctl --object-type mwconfig select name=ReadOnly,scope=codfw set/val=False [12:16:01] sudo confctl --object-type mwconfig select name=ReadOnly,scope=codfw set/val=no [12:16:20] !log akosiaris@cumin1001 conftool action : set/val=false; selector: name=ReadOnly [12:16:22] akosiaris@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [12:16:37] akosiaris: still a string [12:16:51] 0? [12:16:53] I'll run the set rw cookbook with codfw as dc_to [12:16:57] I went for a string [12:17:06] it was a bool before [12:17:26] well, I went for quotes, it was unquoted before [12:17:30] (03PS1) 10Superpes15: [sysop_itwiki] Change the logo, the favicon, and add a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891261 (https://phabricator.wikimedia.org/T330279) [12:17:44] https://wikitech.wikimedia.org/wiki/MediaWiki_and_EtcdConfig says to use the edit command [12:17:47] and I do see the quotes now [12:18:07] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [12:18:08] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [12:18:11] !log cgoubert@cumin1001 MediaWiki read-only period ends at: 2023-02-22 12:18:11.451680 [12:18:11] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [12:18:12] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [12:18:14] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [12:18:36] why is stashbot failing to write? [12:18:39] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [12:18:40] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [12:18:43] the cookbook does [12:18:43] self._conftool.set_and_verify("val", False, scope=datacenter, name="ReadOnly") [12:18:45] !log cgoubert@cumin1001 MediaWiki read-only period ends at: 2023-02-22 12:18:45.829060 [12:18:46] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [12:18:53] it should hit eqiad anyway, shouldn't it ? [12:19:03] cgoubert@cumin1001:/var/log/spicerack/sre/switchdc$ sudo confctl --object-type mwconfig select name=ReadOnly get [12:19:05] {"ReadOnly": {"val": false}, "tags": "scope=codfw"} [12:19:07] {"ReadOnly": {"val": false}, "tags": "scope=eqiad"} [12:19:09] There [12:19:19] I had to run it twice with the two orders [12:19:41] I don't get a read-only warning [12:19:57] claime: running mediawiki.set_readwrite('codfw') coyuld have been quicker ;) [12:20:01] *from a spicerack repl [12:20:07] volans: yeah well I panicked ok :D [12:20:17] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:20:28] but the question is why it does allow to set it to the wrong value [12:20:30] I am not sure what happened tbh. The issue was we had eqiad set to a string ? [12:20:44] akosiaris: codfw was originally not reset to rw [12:20:45] I thought the errors were about codfw ? [12:20:59] Then I tried the confctl command to set codfw to ro=false [12:21:05] It didn't work because string [12:21:19] Now it's all bool [12:21:26] so what's the current status, we think it is fixed? [12:21:33] to ask reporters to confirm [12:21:50] edit works on en.wiki for me [12:22:03] I can't find my way through other languages unfortunately [12:22:04] this is I think on codfw only [12:22:14] not eqiad [12:22:23] (has been, all errors were mw2xxx) [12:22:25] I asked again to see [12:22:26] !log test [12:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:31] recentchange is picking back up [12:22:35] ah stashbot is fine again [12:22:48] "yep works" I get now [12:22:51] maybe I'm missing something [12:22:52] so seem fixed [12:23:15] ok, I broke eqiad too apparently [12:23:20] I had this on plsource, but it's back now [12:23:43] for a small window of time, around 2m, from 14:16 to 14:18 [12:23:44] akosiaris: claime: I may need help with the timeline, I am quite confused [12:23:48] akosiaris: yeah, you set it to a string because no selector and confctl doing bs [12:23:54] you are not the only one jynus [12:23:59] he he [12:24:06] yeah I have it [12:24:07] I just figured out what happened [12:24:20] or at least pieced enough pieces together [12:24:37] akosiaris: https://docs.google.com/document/d/12QY-N1oXRwY4tPHO0fwrvf2osvZnr-2Vjfl_3pAOjE4 [12:24:54] I suspect MW treats 'false' inconsistently, which explains why I could not reproduce it initially at least [12:24:58] so, set val=value in confctl doesn't set a bool but sets a string, interesting [12:25:17] (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:25:22] akosiaris: yeah and that's... bad [12:26:22] akosiaris: shouldn't it adhere to the db_readonly.schema? [12:26:45] (JobUnavailable) firing: (3) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:27:40] I see, the real issue was the old cookbook asuming that setting codfw read only was ok [12:27:55] at leat the initial trigger, right? [12:28:10] and later confd weirdness? [12:28:49] volans: same question from my side [12:29:22] setting status page to resolved unless someone disagrees [12:29:52] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10jbond) @ayounsi i have took another look at this. from the steps in the document above i have now configured * Register the Juniper API gateway app in the Customer/Partner's IdP.... [12:30:22] doing now [12:31:45] (JobUnavailable) firing: (3) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:31:58] akosiaris: here's the culprit [12:31:58] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/files/conftool/json-schema/mediawiki-config/db_readonly.schema [12:32:06] !log installing openssl security updates on buster [12:32:08] jynus: Exactly that, the timeline I constructed in the doc should be explicit [12:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:20] so it can be either a string or boolean false [12:32:38] so too main issue, un unexpected (old) change, an an issue preventing it from manually solve it [12:32:42] *2 [12:32:43] not sure if confctl CLI allow to set all of those values [12:33:22] I haven't managed to use confctl to set it to a boolean [12:36:39] claime: I think 'edit' would work [12:36:45] judging from https://sal.toolforge.org/production?p=0&q=%22name%3DReadOnly%22&d= [12:37:01] volans: wasn't aware of edit, will add it to my snippets [12:37:17] Or at least it didn´t come to mind in the heat of the moment [12:37:59] so, 02-set-readonly does set both datacenters RO (except in live-test where it doesn't touch dc_to) [12:38:28] But 07-set-readwrite doesn't revert it for dc_from [12:38:40] That's my conclusion for root cause [12:39:07] I think the string option is for a reason [12:39:07] 10SRE, 10Security-Team, 10Traffic-Icebox, 10WMF-General-or-Unknown, and 2 others: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618 (10Bawolff) a:05Bawolff→03None Huh. Guess this got deployed to `/wikipedia/(el|fr|ru|it|de|uk|ja|id|he|fi|zh|test)` but never everywh... [12:39:14] yes, because we were going RW/RO -> RO/RO -> RO/RW [12:39:52] yeah, the way mw handles it is either the value is falsey or it's the read only reason (polymorphic variable) [12:39:55] now is RW/RW -> RO/RO -> RW/RW | RO/RW based on how you want eqiad to be after the switch [12:40:02] falsey means it's not RO [12:40:17] volans: well we can't keep eqiad RO in mediawiki terms apparently [12:40:24] which is not great but hey I didn't design it [12:40:35] Since here leaving codfw in that state caused user-facing issues [12:40:56] so it's forcely RW/RW -> RO/RO -> RW/RW [12:40:56] !log rolling restart of FPM on mw canaries [12:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:06] volans: yep, and only the dbs are RO [12:41:12] left RO* [12:41:37] that should wonder the question if ReadOnly should not be anymore per-dc [12:41:50] but a central value for mediawiki (no, not for next week ;) ) [12:42:01] Yeah, I feel like it should be a global setting in the future [12:42:30] I don't see a use in having even the passive DC set RO from a mw point of view if it causes edit outages for users [12:42:36] (03CR) 10Jbond: [C: 03+2] dnsquery: Add dnsquery module to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/890476 (owner: 10Jbond) [12:42:40] (03CR) 10Jbond: [C: 03+2] wmflib::dns_lookup: switch to dnsquery::lookup [puppet] - 10https://gerrit.wikimedia.org/r/890477 (owner: 10Jbond) [12:42:48] (03CR) 10Jbond: [C: 03+2] apereo_cas: Add missing docs and fix lint issues [puppet] - 10https://gerrit.wikimedia.org/r/890483 (owner: 10Jbond) [12:42:58] (03CR) 10Jbond: [V: 03+1 C: 03+2] apereo_cas: update to use dnsquery functions for lookups [puppet] - 10https://gerrit.wikimedia.org/r/890484 (owner: 10Jbond) [12:42:59] yeah [12:44:32] 10SRE: Improve process to add/update keys for pwstore repo - https://phabricator.wikimedia.org/T262393 (10MoritzMuehlenhoff) 05Open→03Resolved This has been resolved for quite a while: The PGP keys are now stored within the repo itself, current docs are at https://office.wikimedia.org/wiki/Pwstore [12:44:35] the only possible explanation to me is that the gods of MediaWiki are angry at SREs and require a human sacrifice [12:44:55] otherwise I can't explain this many incidents in a day [12:46:10] I really thought vgutierre.z had made the necessary blood sacrifice [12:46:30] My understanding of the arcane mediawiki magic still leaves to be desired [12:46:36] Amir1: let me know if the new wording is ok for you [12:46:50] 10SRE: Make bpfcc-tools available fleet-wide - https://phabricator.wikimedia.org/T261193 (10MoritzMuehlenhoff) 05Open→03Resolved bpfcc-tools is available for one-off debuging on all servers (it's in debian startig with Bullseye and for Buster it can be installed from buster-backports). So I think this task a... [12:47:09] jynus: thanks looks good [12:47:14] Although I still don't understand why some writes go to codfw [12:47:34] and also IIRC, we have some cookies being sent that if the user is edited, they'd be routed to eqiad for a while [12:48:04] what I mean is some users edited, saw "we are in read only, stop editing" [12:48:10] and they did [12:48:22] Apparently the mobile button for edits was unclickable [12:48:23] so those are not accounted in errors [12:48:31] ^that [12:48:42] Which, wth [12:48:57] I gotta step out and breathe for a minute if that's ok with y'all [12:49:03] that could be an actionable, although I am unsure of which [12:49:19] claime: two reasons: 1- Still some GETs do write, so they can end up in codfw, they'd be just slow 2- Some dbs are writable in codfw (x2 for some ways of caching, PC) 3- the error is just saying "this wiki is read-only", e.g. it shows it to you when you attempt to edit, which is a get in codfw [12:49:26] I need to learn how to count [12:49:56] Amir1: Thanks, I understand now [12:50:03] also off by one errors [12:50:06] I think the last one is the biggest culprit [12:50:13] (03Abandoned) 10Jbond: Remove Hiera option to enable adduser config [puppet] - 10https://gerrit.wikimedia.org/r/644808 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [12:51:18] 10SRE, 10Observability-Logging, 10Traffic: varnish-frontend-fetcherr sets incorrect level in logstash - https://phabricator.wikimedia.org/T330267 (10fgiunchedi) I took a quick look at this and found the following: * the logger program seems to be `modules/varnish/files/varnishfetcherr.py` ran by `modules/va... [12:51:52] (03PS1) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) [12:52:18] 10Puppet, 10Infrastructure Security, 10Infrastructure-Foundations, 10User-jbond: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10jbond) [12:53:23] (03PS2) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) [12:54:21] (03PS3) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) [12:56:10] (03CR) 10CI reject: [V: 04-1] sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [12:57:05] jynus: do you have an answer for my previous question a while ago? (regarding the cookbook failure for db1118) [12:57:19] sorry, it got lost with the incident [12:57:20] checking [12:58:10] jynus: reposting: """as for the original failure regarding db1118 the cookbook tries to get the heartbeat for codfw, but it's empty. Could it be because the replication is not yet enabled codfw->eqiad?""" [12:58:12] volans: yes and no- circular replication AFAIAA, is not enabled [12:58:23] but heartbeat should be running anyway [12:58:31] but maybe it changed since orchestrator was enabled [12:58:37] that would be new [12:58:46] I queried all s1 in eqiad and they have hearthbeat only for 'eqiad' not for 'codfw' [12:58:46] and if it, it would require more changes [12:58:57] yes, that is expected at the moment [12:58:59] while hosts in codfw have it for both [12:59:03] that's why it failed [12:59:05] ah [12:59:12] then it is just the circular replication [12:59:16] SELECT ts FROM heartbeat.heartbeat WHERE datacenter = 'codfw' and shard = 's1' ORDER BY ts DESC LIMIT 1; [12:59:20] (03Abandoned) 10Ladsgroup: [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup) [12:59:21] just returned empty [12:59:29] it will work when circular is enabled [12:59:38] but then it failed on all hosts, right? [12:59:40] not just s1 [13:00:10] or would have failed is what I mean [13:00:13] it stopped there because it failed [13:00:22] yes I think it would have failed on all eqiad hosts [13:00:27] yeah, so the test has to be done with circual replication enabled [13:00:32] yep [13:00:38] which is ok, it means the check works! [13:00:46] but that will be done when manuel returns [13:00:48] I don't think orch replaced/changed anything related to heartbeat [13:01:08] yeah, I knew leftovers beats had been removed [13:01:19] but maybe also codfw master ones [13:01:31] but it wasn't, so it is just circular [13:01:50] volans: on my calendar I have 23- no more maintenance [13:02:07] 27th enable codfw-> replication [13:02:11] jouncebot: nowandnext [13:02:11] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [13:02:11] In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1400) [13:02:17] can I deploy stuff? [13:02:38] so maybe retry on monday? [13:03:08] claime: ^^^ [13:03:14] ack thx [13:03:31] not a big deal because in an emergency, we don't care about checks, we just switch in whatever state we have [13:04:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10IPv6, 10User-jbond: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jbond) a:05jbond→03None [13:04:11] that's a check to make sure eqiad is working after failover [13:04:37] (03PS3) 10Ladsgroup: Move userrights related configs from IS.php to core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890789 (https://phabricator.wikimedia.org/T308932) [13:04:46] (03CR) 10Ladsgroup: [C: 03+2] Move userrights related configs from IS.php to core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890789 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [13:04:51] 10SRE, 10Observability-Logging, 10Security, 10User-jbond: ulog: filter out diffscan from ulog - https://phabricator.wikimedia.org/T265590 (10jbond) a:05jbond→03None [13:05:15] * volans grabbing some lunch [13:05:24] (03Merged) 10jenkins-bot: Move userrights related configs from IS.php to core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890789 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [13:08:19] (03CR) 10Nicolas Fraison: Add a spark-operator chart and helmfile configuration (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [13:12:19] (03CR) 10Muehlenhoff: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [13:16:35] 10SRE, 10User-fgiunchedi: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) a:05fgiunchedi→03None [13:16:45] 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) a:05fgiunchedi→03None [13:16:57] 10SRE, 10Observability-Alerting: Missing 'notify' for some Icinga configuration files - https://phabricator.wikimedia.org/T263027 (10fgiunchedi) a:05fgiunchedi→03None [13:17:21] (03CR) 10Volans: [C: 04-1] "Some issues inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (owner: 10Clément Goubert) [13:18:10] (03CR) 10Nicolas Fraison: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [13:21:34] https://www.irccloud.com/pastebin/OhErE4ol/ [13:21:44] should I worry? [13:29:36] scap is realllllllly slow [13:29:52] stuck in rebuilding the mw docker images it seems [13:31:33] akosiaris: maybe you know? P44731 [13:31:39] https://phabricator.wikimedia.org/P44731 [13:34:05] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for nginx on durum [puppet] - 10https://gerrit.wikimedia.org/r/891271 (https://phabricator.wikimedia.org/T135991) [13:35:54] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:46] PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:38:11] almost forty minutes now... [13:40:50] (03PS1) 10Jbond: P:idp: use production variable [puppet] - 10https://gerrit.wikimedia.org/r/891287 [13:41:00] (03CR) 10CI reject: [V: 04-1] P:idp: use production variable [puppet] - 10https://gerrit.wikimedia.org/r/891287 (owner: 10Jbond) [13:43:49] (03PS2) 10Jbond: P:idp: use production variable [puppet] - 10https://gerrit.wikimedia.org/r/891287 [13:43:56] Amir1: ouch. We've backport deployment in around 17 minutes :/ [13:44:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891271 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:45:05] it probably needs to be cancelled to be honest [13:47:08] (03CR) 10Jbond: P:idm move OIDC endpoint to variable. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891242 (owner: 10Slyngshede) [13:49:31] (03PS3) 10Jbond: P:idp: use production variable [puppet] - 10https://gerrit.wikimedia.org/r/891287 [13:49:54] :/ [13:51:04] Amir1: Looks like capacity issues in codfw again [13:51:38] I'll send a CR for even less replicas in codfw for mw-* [13:52:05] (03CR) 10Jbond: "pcc: https://puppet-compiler.wmflabs.org/output/891287/39786/" [puppet] - 10https://gerrit.wikimedia.org/r/891287 (owner: 10Jbond) [13:57:59] (03CR) 10Jbond: "lgtm pending CI and the comment from Moritz" [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [14:00:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1400). nyaa~ [14:00:04] kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:20] * Lucas_WMDE can’t deploy [14:00:27] I can deploy :) [14:00:50] (also I think Amir1’s scap might still be running) [14:01:04] ack [14:01:34] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:59] it is running [14:02:09] so marked the hour [14:02:19] I have two more syncs [14:02:24] RECOVERY - Hadoop NodeManager on an-worker1108 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:02:30] Amir1: no problem, let me know when I can proceed [14:02:32] (03PS10) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 [14:02:32] otherwise everything breaks [14:02:52] TheresNoTime: currently that means two hours from now but I hope it gets fixed by then [14:02:59] ah! [14:03:34] kart_: it doesn't look like your patches are going to be deployed — can they be rescheduled? [14:03:52] TheresNoTime: Let me withdraw/reschedule it later or tomorrow. [14:03:53] Amir1: Is it still stuck on helmfile ? [14:03:58] yup [14:04:09] last line: 13:55:24 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 01m 56s) [14:04:17] ten minutes on it already [14:04:31] kart_: sure thing [14:05:39] (HelmReleaseBadStatus) firing: Helm release mw-api-int/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:06:19] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:06:26] !log UTC afternoon backport window not done due to in-progress deployment [14:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:42] (03CR) 10Hnowlan: [C: 03+1] mw-on-k8s,thumbor: reduce codfw replicas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891277 (https://phabricator.wikimedia.org/T330048) (owner: 10Clément Goubert) [14:07:31] claime: I'm tiny bit confused https://phabricator.wikimedia.org/T330048 is resolved and it's green [14:07:36] (in icinga) [14:08:01] Amir1: akosiaris is deploying the new nodes [14:08:24] Which were held up because of that task + some nodes couldn't be updated and pooled [14:08:44] !log test network connectivity of kubernetes20{17,18,19,21} [14:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:59] (03CR) 10David Caro: "Oh, I forgot to "git add" 😮" [puppet] - 10https://gerrit.wikimedia.org/r/890881 (owner: 10David Caro) [14:09:06] I'll destroy/recreate the failed release for mw-api-int [14:09:46] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [14:09:50] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [14:10:08] wmfdebug ping containers been created [14:10:22] and all of them completed successfully [14:10:29] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: sync [14:10:36] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: sync [14:10:41] I need to create a job that does this in a nice mesh way and reports it somewhere [14:10:45] I think it actually already exists [14:10:51] anyway, uncordoning hosts [14:10:54] (03PS2) 10David Caro: profile.cloudceph: Add some tests [puppet] - 10https://gerrit.wikimedia.org/r/890881 [14:11:19] !log uncordon kubernetes20{17,18,19,21} T330048 [14:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:23] T330048: asw-a-codfw management interface unreachable - https://phabricator.wikimedia.org/T330048 [14:11:49] done [14:13:00] it's now syncing apaches [14:13:21] (03CR) 10CI reject: [V: 04-1] profile.cloudceph: Add some tests [puppet] - 10https://gerrit.wikimedia.org/r/890881 (owner: 10David Caro) [14:13:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/890843 (owner: 10Jbond) [14:14:10] (03CR) 10Volans: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [14:14:13] I need to do the next sync and then we can see if it's fixed. I can wait until the new nodes are in place [14:15:03] Amir1: the new nodes are uncordoned, you should be good [14:15:20] We'll wait until all backports are done to scale back up [14:15:30] That way you shouldn't hit capacity issues [14:15:34] !log ladsgroup@deploy1002 Synchronized wmf-config/core-Permissions.php: Move all of userrights config out of IS.php to a dedicated file, part I (T308932) (duration: 68m 38s) [14:15:38] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [14:15:39] (HelmReleaseBadStatus) resolved: Helm release mw-api-int/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:15:45] duration: 68m 38s [14:16:48] the next one is starting [14:17:01] on 14:16:42 Started Running helmfile -e codfw --selector name=pinkunicorn apply in /srv/deployment-charts/helmfile.d/services/mw-debug [14:18:02] STATUS: deployed [14:18:04] REVISION: 14 [14:18:08] running helmfile is fast [14:18:20] Yeah, when it can schedule its pods properly, it is [14:20:01] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [14:21:49] this alert doesn't look very informative ^ [14:21:53] (03Abandoned) 10Clément Goubert: mw-on-k8s,thumbor: reduce codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/891277 (https://phabricator.wikimedia.org/T330048) (owner: 10Clément Goubert) [14:22:54] (03PS1) 10Alexandros Kosiaris: Revert "mw-on-k8s: reduce codfw replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/891288 (https://phabricator.wikimedia.org/T330048) [14:23:08] !log ladsgroup@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: Move all of userrights config out of IS.php to a dedicated file, part II (T308932) (duration: 07m 01s) [14:23:13] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [14:23:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Don't merge yet, we need first confirmation" [deployment-charts] - 10https://gerrit.wikimedia.org/r/891288 (https://phabricator.wikimedia.org/T330048) (owner: 10Alexandros Kosiaris) [14:23:16] That was faster :D [14:23:22] Amir1: it does not, I suspect it is related to the grafana 9 upgrade [14:24:25] akosiaris: I think it's fixed now [14:24:31] 7 minutes [14:24:49] \o/ [14:25:46] (03PS3) 10David Caro: profile.cloudceph: Add some tests [puppet] - 10https://gerrit.wikimedia.org/r/890881 [14:26:17] godog: :D want me to file a ticket? [14:26:23] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@1a041e2] (releasing): (no justification provided) [14:26:47] Amir1: thank you, I think updating T317887 should be enough, I see Peter reported the same [14:26:48] T317887: Upgrade to Grafana 9 - https://phabricator.wikimedia.org/T317887 [14:27:13] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@1a041e2] (releasing): (no justification provided) (duration: 00m 49s) [14:29:38] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:03] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Move all of userrights config out of IS.php to a dedicated file, part III (T308932) (duration: 06m 16s) [14:30:07] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [14:30:26] (03PS2) 10Clément Goubert: sre.switchdc.mediawiki: Set both datacenters to rw [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 [14:30:45] I'm done [14:31:08] (03CR) 10Ssingh: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/891271 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:32:15] (03CR) 10Jbond: redfish: add update commands using the patch method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [14:32:32] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:33:18] (03PS1) 10Elukey: role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) [14:33:47] (03CR) 10Jbond: systemd::timer: update services to onshot and set RemainAfterExit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890843 (owner: 10Jbond) [14:33:49] (03CR) 10Jbond: [C: 03+2] systemd::timer: update services to onshot and set RemainAfterExit [puppet] - 10https://gerrit.wikimedia.org/r/890843 (owner: 10Jbond) [14:34:21] (03PS2) 10Elukey: role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) [14:34:49] TheresNoTime: kart_ Lucas_WMDE feel free to deploy [14:37:41] (03CR) 10Ottomata: Add a postgresql database and user for airflow_search_platform (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889572 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [14:38:09] some DNS and BGP (in cr*-ulsfo) incoming; expected [14:38:18] *alerts [14:38:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS bullseye [14:38:51] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS bullseye [14:39:29] okay, I could deploy now if you want kart_ [14:41:57] (03Abandoned) 10Ladsgroup: Rework DNS entries of wikis in wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/637849 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [14:42:16] PROBLEM - Host 2620:0:863:1:198:35:26:8 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:38] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-Needs-Improvement: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Ladsgroup) a:05Ladsgroup→03None [14:43:00] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:43:18] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:43:24] ^ expected [14:43:44] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:43:48] (03CR) 10Elukey: ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [14:43:48] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:44:14] (03PS1) 10Alexandros Kosiaris: eventstreams-internal: Switch to the new way of defining service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/891283 [14:44:41] (03CR) 10Clément Goubert: sre.switchdc.mediawiki: Set both datacenters to rw (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (owner: 10Clément Goubert) [14:44:51] (03CR) 10Elukey: [C: 03+2] ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [14:46:26] PROBLEM - Recursive DNS on 198.35.26.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [14:46:29] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:46:45] (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:47:16] (03CR) 10Jbond: [C: 03+1] "lgtm left some optional nits" [puppet] - 10https://gerrit.wikimedia.org/r/890881 (owner: 10David Caro) [14:47:36] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:48:34] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:49:07] (03PS4) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) [14:49:18] (03CR) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [14:49:56] (03CR) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [14:50:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams-internal: Switch to the new way of defining service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/891283 (owner: 10Alexandros Kosiaris) [14:51:00] (03CR) 10CI reject: [V: 04-1] sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [14:51:02] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for nginx on durum [puppet] - 10https://gerrit.wikimedia.org/r/891271 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:53:06] (03PS5) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) [14:53:37] (03PS5) 10Jbond: P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) [14:55:20] (03PS1) 10Elukey: admin_ng: upgrade the DSE cluster to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/891284 (https://phabricator.wikimedia.org/T330261) [14:56:48] RECOVERY - Host 2620:0:863:1:198:35:26:8 is UP: PING OK - Packet loss = 0%, RTA = 71.81 ms [14:56:53] (03Merged) 10jenkins-bot: eventstreams-internal: Switch to the new way of defining service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/891283 (owner: 10Alexandros Kosiaris) [14:57:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [14:57:53] (03PS1) 10Urbanecm: growthexperiments: Run refreshPraiseworthyMentees daily [puppet] - 10https://gerrit.wikimedia.org/r/891285 (https://phabricator.wikimedia.org/T322444) [14:58:14] (03CR) 10CI reject: [V: 04-1] growthexperiments: Run refreshPraiseworthyMentees daily [puppet] - 10https://gerrit.wikimedia.org/r/891285 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm) [14:59:25] (03PS2) 10Urbanecm: growthexperiments: Run refreshPraiseworthyMentees daily [puppet] - 10https://gerrit.wikimedia.org/r/891285 (https://phabricator.wikimedia.org/T322444) [15:00:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [15:00:40] PROBLEM - Recursive DNS on 2620:0:863:1:198:35:26:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:00:43] Lucas_WMDE: Sorry, rescheduled it tomorrow already. Stepped out too :/ [15:01:00] Next window is too late for me. [15:01:45] (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:50] (03PS1) 10Slyngshede: P:IDM Ensure that social auth can lookup username. [puppet] - 10https://gerrit.wikimedia.org/r/891307 [15:01:56] (03PS1) 10Urbanecm: Growth: Set GEPersonalizedPraiseBackendEnabled to true on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891308 (https://phabricator.wikimedia.org/T322444) [15:02:07] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [15:02:34] ok sure [15:02:42] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Thanks for this (and for cleaning up my mess). FWIW, this appears cleaner, more maintenable and effectively achieves the same goal as far " [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup) [15:02:55] (03CR) 10Muehlenhoff: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [15:03:50] (03CR) 10Majavah: [C: 03+1] "These look correct, but I'm wondering a bit about the value of having separate prometheus servers in PAWS anymore. The primary reason for " [puppet] - 10https://gerrit.wikimedia.org/r/889998 (https://phabricator.wikimedia.org/T329212) (owner: 10Vivian Rook) [15:04:04] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891285 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm) [15:04:13] jouncebot: nowandnext [15:04:13] No deployments scheduled for the next 2 hour(s) and 55 minute(s) [15:04:13] In 2 hour(s) and 55 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1800) [15:04:25] (03PS7) 10Urbanecm: [tox] Make running `tox` work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) [15:04:30] (03CR) 10Urbanecm: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [15:05:07] (03CR) 10Majavah: [tox] Make running `tox` work (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [15:06:13] (03PS8) 10Urbanecm: [tox] Make running `tox` work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) [15:06:28] taavi: was going to merge it :). thanks for catching. [15:06:43] (03CR) 10Urbanecm: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [15:08:05] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: sync [15:08:13] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: sync [15:08:27] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: sync [15:08:33] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: sync [15:09:09] (03CR) 10Vivian Rook: Update dns for paws prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889998 (https://phabricator.wikimedia.org/T329212) (owner: 10Vivian Rook) [15:09:47] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: sync [15:10:13] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: sync [15:11:17] (03PS6) 10Jbond: P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) [15:11:32] (03CR) 10Majavah: [C: 03+1] Update dns for paws prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889998 (https://phabricator.wikimedia.org/T329212) (owner: 10Vivian Rook) [15:11:45] (JobUnavailable) resolved: (3) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:54] RECOVERY - Recursive DNS on 198.35.26.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [15:13:11] 10SRE, 10Data-Engineering, 10Event-Platform Value Stream: > ~1 request/second to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10CDanis) [15:13:12] RECOVERY - Recursive DNS on 2620:0:863:1:198:35:26:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [15:13:18] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:13:35] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Clement_Goubert) [15:13:58] (03CR) 10Jbond: [C: 03+2] P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) (owner: 10Jbond) [15:13:59] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Clement_Goubert) p:05Triage→03High [15:14:06] (03PS3) 10Clément Goubert: sre.switchdc.mediawiki: Set both datacenters to rw [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (https://phabricator.wikimedia.org/T330300) [15:14:47] (03PS7) 10Jbond: P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) [15:15:04] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10taavi) [15:15:11] (03CR) 10Jbond: P:sre::check_user: add support for namely API (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) (owner: 10Jbond) [15:15:20] (03PS6) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) [15:15:24] (03PS1) 10Muehlenhoff: Adjust monitoring for KDC processes if worker threads are in use [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831) [15:15:29] (03CR) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [15:15:44] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:16:50] (03CR) 10Jbond: [C: 03+2] P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) (owner: 10Jbond) [15:17:58] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) [15:18:26] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) p:05Triage→03High [15:19:30] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:19:52] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:20:11] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) [15:20:14] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:20:17] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Clement_Goubert) [15:20:18] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:20:27] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) 05Stalled→03In progress [15:20:32] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [15:21:25] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [15:21:33] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [15:22:18] (03PS1) 10Jbond: sre:check_user: make namley_api_key optional [puppet] - 10https://gerrit.wikimedia.org/r/891311 [15:23:16] (03PS4) 10David Caro: profile.cloudceph: Add some tests [puppet] - 10https://gerrit.wikimedia.org/r/890881 [15:23:19] (03CR) 10David Caro: profile.cloudceph: Add some tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/890881 (owner: 10David Caro) [15:24:39] (03PS2) 10Jbond: sre:check_user: make namley_api_key optional [puppet] - 10https://gerrit.wikimedia.org/r/891311 [15:24:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one final remark/warning (feel free to ignore!)" [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [15:25:01] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [15:25:12] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10Clement_Goubert) [15:25:29] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10Clement_Goubert) p:05Triage→03Medium [15:25:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39788/console" [puppet] - 10https://gerrit.wikimedia.org/r/891311 (owner: 10Jbond) [15:26:08] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:34] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:27:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [15:29:23] (03PS3) 10Jbond: sre:check_user: make namley_api_key optional [puppet] - 10https://gerrit.wikimedia.org/r/891311 [15:29:25] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-Needs-Improvement: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10MdsShakil) [15:30:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/890881 (owner: 10David Caro) [15:30:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39789/console" [puppet] - 10https://gerrit.wikimedia.org/r/891311 (owner: 10Jbond) [15:30:39] !log update mwdebug2002 to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 T323358 [15:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:51] (03PS7) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) [15:31:10] (03CR) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [15:31:18] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/781055 (https://phabricator.wikimedia.org/T306223) [15:31:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [15:31:51] (03PS4) 10Jbond: sre:check_user: make namley_api_key optional [puppet] - 10https://gerrit.wikimedia.org/r/891311 [15:31:57] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [15:33:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39790/console" [puppet] - 10https://gerrit.wikimedia.org/r/891311 (owner: 10Jbond) [15:33:33] (03CR) 10Jbond: [C: 04-1] "lgtm but missing dollar" [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [15:33:44] (03PS1) 10Ssingh: prometheus: update ensure_packages for node_gdnsd (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) [15:33:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:33:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [15:34:05] (03CR) 10CI reject: [V: 04-1] prometheus: update ensure_packages for node_gdnsd (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:34:22] these wdqs/wcqs updater alerts are expected [15:35:46] (03PS1) 10Krinkle: Remove redundant wgOriginTrials and wgFeaturePolicyReportOnly settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891314 [15:35:46] jouncebot: nowandnext [15:35:46] No deployments scheduled for the next 2 hour(s) and 24 minute(s) [15:35:46] In 2 hour(s) and 24 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1800) [15:35:49] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF) 05Open→03Resolved I'll mark this as Resolved. No-one wants to confess to knowing what the remain... [15:36:19] (03CR) 10Clément Goubert: [C: 03+1] "I think we're good now that deploys have been running as usual." [deployment-charts] - 10https://gerrit.wikimedia.org/r/891288 (https://phabricator.wikimedia.org/T330048) (owner: 10Alexandros Kosiaris) [15:36:49] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39791/console" [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:37:09] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: Use S3 instead of Swift for bucket access [deployment-charts] - 10https://gerrit.wikimedia.org/r/889155 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking) [15:37:22] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: Use S3 instead of Swift for bucket access [deployment-charts] - 10https://gerrit.wikimedia.org/r/889155 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking) [15:38:42] (03PS2) 10Ssingh: prometheus: update ensure_packages for node_gdnsd (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) [15:40:21] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39792/console" [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:40:55] (03PS5) 10Jbond: sre:check_user: make namley_api_key optional [puppet] - 10https://gerrit.wikimedia.org/r/891311 [15:41:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39793/console" [puppet] - 10https://gerrit.wikimedia.org/r/891311 (owner: 10Jbond) [15:42:01] (03PS2) 10Muehlenhoff: Adjust monitoring for KDC processes if worker threads are in use [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831) [15:42:54] (03Merged) 10jenkins-bot: rdf-streaming-updater: Use S3 instead of Swift for bucket access [deployment-charts] - 10https://gerrit.wikimedia.org/r/889155 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking) [15:43:00] (03CR) 10Jbond: [V: 03+1 C: 03+2] sre:check_user: make namley_api_key optional [puppet] - 10https://gerrit.wikimedia.org/r/891311 (owner: 10Jbond) [15:45:37] (03PS1) 10Volans: apt: add new module with new AptGetHosts class [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 [15:45:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [15:45:48] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:45:51] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [15:47:04] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [15:47:17] (03CR) 10Urbanecm: [C: 03+2] [tox] Make running `tox` work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [15:47:55] (03Merged) 10jenkins-bot: [tox] Make running `tox` work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [15:50:21] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:887830|[tox] Make running `tox` work (T329231)]] [15:50:25] T329231: CI should ensure that wmf-config/logos.php matches logos/config.yaml - https://phabricator.wikimedia.org/T329231 [15:50:34] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez) [15:50:34] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:52:08] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:887830|[tox] Make running `tox` work (T329231)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [15:52:19] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:54:32] (03PS1) 10Muehlenhoff: idm::jobs: Adapt auto restart to only run of idm-rq is active/present [puppet] - 10https://gerrit.wikimedia.org/r/891318 (https://phabricator.wikimedia.org/T320797) [15:55:19] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10MdsShakil) >>! In T152882#8637729, @gerritbot wrote: > Change 891289 **abandoned** by MdsShakil: > %%%[operations/dns@master] add missing mobile domain for ombu... [15:56:30] (03PS1) 10Alexandros Kosiaris: eventstreams-internal: Fix public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/891319 [15:56:34] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:57:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [15:57:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] Exclude traindev from tests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 (owner: 10Clément Goubert) [15:58:16] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:887830|[tox] Make running `tox` work (T329231)]] (duration: 07m 54s) [15:58:19] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [15:58:20] T329231: CI should ensure that wmf-config/logos.php matches logos/config.yaml - https://phabricator.wikimedia.org/T329231 [15:59:39] (03CR) 10Filippo Giunchedi: "See inline, idea LGTM though" [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:59:51] (03CR) 10Volans: [C: 03+1] "Works as it is, couple of possible improvements inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (https://phabricator.wikimedia.org/T330300) (owner: 10Clément Goubert) [16:00:09] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Dzahn) I wonder if the traffic team sees any blocker here to just add the missing .m. names to DNS. [16:00:19] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [16:00:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891318 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff) [16:01:34] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:02:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [16:04:44] (03CR) 10Muehlenhoff: prometheus: update ensure_packages for node_gdnsd (bullseye) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:06:57] (03Abandoned) 10Ilias Sarantopoulos: ml-services: upgrade revertrisk staging to debian bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/890401 (https://phabricator.wikimedia.org/T328439) (owner: 10Ilias Sarantopoulos) [16:08:21] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:08:29] (03CR) 10Ssingh: [V: 03+1] prometheus: update ensure_packages for node_gdnsd (bullseye) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:09:20] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [16:09:32] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:10:22] (03PS3) 10Ssingh: prometheus: update ensure_packages for node_gdnsd (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) [16:10:30] PROBLEM - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-webserver@search.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:11:28] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39795/console" [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:11:36] (03CR) 10Elukey: [C: 04-1] "Forgot to add the new intermediate pkis, going to do it :)" [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [16:12:07] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/891307 (owner: 10Slyngshede) [16:12:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams-internal: Fix public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/891319 (owner: 10Alexandros Kosiaris) [16:14:18] (03PS1) 10Ssingh: prometheus: refactor prometheus-gdnsd-stats.py to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/891320 (https://phabricator.wikimedia.org/T321309) [16:14:22] (03PS3) 10Elukey: role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) [16:14:24] (03PS1) 10Elukey: profile::pki::root_ca: add new pkis for the DSE k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/891321 (https://phabricator.wikimedia.org/T330261) [16:15:17] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39796/console" [puppet] - 10https://gerrit.wikimedia.org/r/891320 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:17:35] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:17:43] (03Merged) 10jenkins-bot: eventstreams-internal: Fix public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/891319 (owner: 10Alexandros Kosiaris) [16:21:43] !log hashar@deploy1002 Started deploy [integration/docroot@956dd11]: zuul: Link to report_url if available [16:21:59] !log hashar@deploy1002 Finished deploy [integration/docroot@956dd11]: zuul: Link to report_url if available (duration: 00m 15s) [16:22:59] !log hashar@deploy1002 Started deploy [integration/docroot@b32e023]: doc: Add GrowthExperiments to MediaWiki components - T329034 [16:23:03] T329034: Publish frontend docs to doc.wikimedia.org/ - https://phabricator.wikimedia.org/T329034 [16:23:07] !log hashar@deploy1002 Finished deploy [integration/docroot@b32e023]: doc: Add GrowthExperiments to MediaWiki components - T329034 (duration: 00m 07s) [16:26:48] 10SRE-tools, 10Infrastructure-Foundations: Decide which cookbooks using icinga_hosts.wait_for_optimal() should use skip_acked=True - https://phabricator.wikimedia.org/T330136 (10MoritzMuehlenhoff) I ran into issues with cookbooks running the SREBatchRunnerBase class, the ganeti.reboot-vm and ganeti.reboot-sing... [16:29:17] (03PS1) 10Jbond: wmf-update-known-hosts-production: handle multiple algorithems [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891326 [16:34:59] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [16:35:39] Lucas_WMDE: o/ was looking at https://grafana-rw.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&from=now-3h&to=now to see the impact we had on some maintainance on the wdqs updater but seems like metrics stopped being collected a couple hours ago [16:36:15] yeah, we noticed that too :/ but I’m in a meeting right now, haven’t looked into it yet [16:36:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891321 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [16:36:45] sure no rush just wanted to let you know :) [16:37:13] (03CR) 10Volans: "couple of nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/891320 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:39:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 (owner: 10Volans) [16:48:00] (03PS2) 10Ssingh: prometheus: refactor prometheus-gdnsd-stats.py to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/891320 (https://phabricator.wikimedia.org/T321309) [16:48:56] (03CR) 10Ssingh: prometheus: refactor prometheus-gdnsd-stats.py to Python 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/891320 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:51:00] dcausse: I think someone™ with the right permissions needs to give wmde-analytics-minutely.service on stat1007 a kick [16:51:24] I have enough permissions to see that it’s “active (exited) since … 14:55”, which is exactly when the stats stopped coming in (I think) [16:51:28] but not enough permissions to restart it [16:52:11] I don’t remember who I asked for such requests in the past though :S [16:52:14] 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Unexpected auditd service restart failure - https://phabricator.wikimedia.org/T287266 (10BCornwall) Ah, my bad, I thought this *was* affecting bullseye. Oops. Sounds good then. [16:54:19] (03PS1) 10Alexandros Kosiaris: Ship systemd user units, fix a bug [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891330 [16:56:18] filed T330311 for the wikidata stats issue [16:56:18] T330311: wmde-analytics-minutely.service is no longer running on stat1007 - https://phabricator.wikimedia.org/T330311 [16:56:33] if someone with root on stat1007 could look into this I’d be much obliged [16:56:39] not sure what the right phab tags would be [16:56:57] (03PS1) 10Cwhite: logstash: remove SEVERITY_LABEL from syslog messages [puppet] - 10https://gerrit.wikimedia.org/r/890363 (https://phabricator.wikimedia.org/T330267) [16:59:07] Lucas_WMDE: I'd ping someone in #wikimedia-analytics [16:59:21] thanks, will do [17:00:26] PROBLEM - Ensure that passive node gets the certificates from the active node as expected on acmechief2001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/acme-chief/certs/.rsync.status is 7224 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief [17:01:02] PROBLEM - Ensure cert-sync script runs successfully in the active node on acmechief1001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/acme-chief/certs/.rsync.done is 7258 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief [17:05:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:05:59] jbond: cert-sync alert could be related to https://github.com/wikimedia/operations-puppet/commit/0eee4e1b5208815e367b9ff3bf14a810b53edcc1 [17:06:25] On my smartphone right now... no real git client at the moment [17:09:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 10%: T329864', diff saved to https://phabricator.wikimedia.org/P44733 and previous config saved to /var/cache/conftool/dbconfig/20230222-170920-root.json [17:09:26] T329864: db2106 crashed - https://phabricator.wikimedia.org/T329864 [17:09:35] (03CR) 10Ssingh: "Does this have a task by any chance? Looks good otherwise." [dns] - 10https://gerrit.wikimedia.org/r/890908 (owner: 10Zabe) [17:10:15] (03CR) 10Lucas Werkmeister (WMDE): "I think this might have caused T330311 – please take a look?" [puppet] - 10https://gerrit.wikimedia.org/r/890843 (owner: 10Jbond) [17:10:45] (03PS2) 10Zabe: Add mobile domain for ombuds.wm.o [dns] - 10https://gerrit.wikimedia.org/r/890908 (https://phabricator.wikimedia.org/T152882) [17:10:58] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10Ladsgroup) [17:11:51] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover, and 2 others: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Ladsgroup) [17:12:15] (03CR) 10Ssingh: [C: 03+2] Add mobile domain for ombuds.wm.o [dns] - 10https://gerrit.wikimedia.org/r/890908 (https://phabricator.wikimedia.org/T152882) (owner: 10Zabe) [17:12:33] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Ladsgroup) [17:13:01] (03PS3) 10Ssingh: Add mobile domain for ombuds.wm.o [dns] - 10https://gerrit.wikimedia.org/r/890908 (https://phabricator.wikimedia.org/T152882) (owner: 10Zabe) [17:15:07] (03CR) 10Dzahn: "yea, this very much has a task, since 2016. for some reason they never got added. https://phabricator.wikimedia.org/T152882" [dns] - 10https://gerrit.wikimedia.org/r/890908 (https://phabricator.wikimedia.org/T152882) (owner: 10Zabe) [17:15:19] (03CR) 10Ottomata: "wgEventStreams is not an EventLogging config. It is for EventStreamConfig extension. EventLogging and EventBus use wgEventStreams via the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889557 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [17:16:10] (03CR) 10Ladsgroup: [C: 03+2] Migrate EventLogging config into its own file (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889557 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [17:17:33] !log running authdns-update for T152882 / CR 890908 [17:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:38] T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 [17:18:59] (03CR) 10Ssingh: [C: 03+2] prometheus: refactor prometheus-gdnsd-stats.py to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/891320 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:19:09] (03CR) 10Ssingh: [V: 03+1 C: 03+2] prometheus: update ensure_packages for node_gdnsd (bullseye) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:24:21] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns4004.wikimedia.org with OS bullseye [17:24:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 25%: T329864', diff saved to https://phabricator.wikimedia.org/P44734 and previous config saved to /var/cache/conftool/dbconfig/20230222-172424-root.json [17:24:29] T329864: db2106 crashed - https://phabricator.wikimedia.org/T329864 [17:24:30] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS bullseye executed with errors: - dns4004 (**FAIL**) - Downtimed o... [17:27:41] (03CR) 10BCornwall: [C: 03+1] varnish: Set `X-Content-Type-Options: nosniff` on upload requests [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm) [17:30:01] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:30:16] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:31:40] (03PS1) 10Lucas Werkmeister (WMDE): systemd::timer: unset RemainAfterExit again [puppet] - 10https://gerrit.wikimedia.org/r/891340 (https://phabricator.wikimedia.org/T330311) [17:32:33] ^ I could use some SRE-y person taking a look at this puppet change [17:33:23] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Zabe) [17:34:00] (03CR) 10Lucas Werkmeister (WMDE): "CCing people from Iee96d252fa." [puppet] - 10https://gerrit.wikimedia.org/r/891340 (https://phabricator.wikimedia.org/T330311) (owner: 10Lucas Werkmeister (WMDE)) [17:34:49] (03CR) 10Jbond: [C: 03+2] "thanks for the patch ill merge this now as its causing issues and will look again at the cloudbackup issue tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/891340 (https://phabricator.wikimedia.org/T330311) (owner: 10Lucas Werkmeister (WMDE)) [17:34:52] (03CR) 10Klausman: [C: 03+1] profile::pki::root_ca: add new pkis for the DSE k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/891321 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [17:35:06] jbond: thanks for the quick response! [17:35:16] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:35:24] I’m testing the After= behavior locally now [17:35:36] (03CR) 10Klausman: [C: 03+1] admin_ng: upgrade the DSE cluster to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/891284 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [17:36:00] (03CR) 10BCornwall: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39798/console" [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm) [17:36:50] (03CR) 10Elukey: [C: 03+2] profile::pki::root_ca: add new pkis for the DSE k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/891321 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [17:37:23] (03CR) 10BCornwall: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39799/console" [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm) [17:39:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 75%: T329864', diff saved to https://phabricator.wikimedia.org/P44736 and previous config saved to /var/cache/conftool/dbconfig/20230222-173929-root.json [17:39:35] T329864: db2106 crashed - https://phabricator.wikimedia.org/T329864 [17:40:19] I’ll check again after the puppet change is rolled out, but I suspect the service might also need to be stopped manually [17:40:30] Lucas_WMDE: yes, I think so, and happy to do it fwiw [17:40:32] (and if true, that could apply to who knows how many units 😬) [17:40:37] ok [17:40:42] if so thanks in advance [17:40:58] Lucas_WMDE: no problem im still looking at it seems onshot is causing an issue as well. im going to roll that back and take another look at the whole issue tomorrow [17:41:00] I’m looking at `systemctl cat wmde-analytics-minutely.{service,timer}` on stat1007 and right now it still has the RemainAfterExit=yes, so puppet didn’t run yet [17:41:05] jbond: ok [17:41:22] !log force puppet run on stat1007 [17:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:35] already in progress, should pick up the patch [17:42:35] yeah, now it’s gone from the unit file and the service is still active (exited) [17:43:00] stopped [17:43:02] timer should pick it up now [17:43:18] cool, service is inactive again [17:43:24] cool [17:43:40] looking at list-timers, I think wmf_auto_restart_exim4 might be in the same situation [17:43:48] but I know literally nothing about that service [17:43:52] take that with a grain of salt ^^ [17:43:55] yeah me neither :) [17:45:01] (DatasourceNoData) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:45:09] okay, we have multiple new data points in the wikidata analytics [17:45:18] so I think that part is working – thanks a lot jbond and sukhe! [17:45:37] Lucas_WMDE: i ran the job manually about ~30 mins ago [17:45:58] I’m looking at 4 new points in the last 4 minutes [17:46:09] yes it looks like its rinning [17:46:09] Wed 2023-02-22 17:46:00 UTC 4s left Wed 2023-02-22 17:45:01 UTC 53s ago wmde-analytics-minutely.timer wmde-analytics-minutely.service [17:46:26] do you want to leave the task open to check on other timers, or should I close it? [17:46:40] jbond: Lucas_WMDE we are encountering an alert that i think is related to what you are talking about [17:46:43] no please leave it open [17:46:44] cc mforns [17:46:46] (IIUC it would’ve affected any timer that ran in the past… two hours or so) [17:46:50] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [17:46:50] ok ack [17:46:59] havn 't totally grokked [17:47:00] but [17:47:00] Lucas_WMDE: yes i agree [17:47:13] should i just run puppet (since somethign has been reverted) and see if my timers work again? [17:47:20] ottomata: sysmemd timeres havn;t been running for the past few hourse is the tl;dr [17:47:37] yes give that a go and ping me if not [17:47:41] phewf that is the symptom i see too, thought somethign was wrong with jobs [17:47:41] okay [17:48:26] (03PS1) 10Elukey: Add fake intermediate PKI key for DSE k8s [labs/private] - 10https://gerrit.wikimedia.org/r/891344 (https://phabricator.wikimedia.org/T330261) [17:48:38] RECOVERY - Ensure cert-sync script runs successfully in the active node on acmechief1001 is OK: FILE_AGE OK: /var/lib/acme-chief/certs/.rsync.done is 16 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief [17:48:48] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake intermediate PKI key for DSE k8s [labs/private] - 10https://gerrit.wikimedia.org/r/891344 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [17:49:58] RECOVERY - Ensure that passive node gets the certificates from the active node as expected on acmechief2001 is OK: FILE_AGE OK: /var/lib/acme-chief/certs/.rsync.status is 96 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief [17:50:46] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:51:27] dduvall: I have done a bit of log triage and the train looks quiet ;) [17:51:46] hashar: excellent :) [17:52:02] I am calling it a day. See you all tomorrow! [17:52:13] jbond: puppet run on an-launcher1002, i see the RemainAfterExit being removed [17:52:20] i would expect [17:52:21] sudo systemctl start gobblin-webrequest.service [17:52:27] to run the service that the timer does [17:52:28] but [17:52:32] doing so, nothign happens [17:52:36] what does status look like? [17:52:39] (03PS4) 10Elukey: role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) [17:52:39] ottomata: you need to restart it [17:52:41] (03PS1) 10Elukey: Add K8s DSE intermediate PKI configs and public certs [puppet] - 10https://gerrit.wikimedia.org/r/891346 (https://phabricator.wikimedia.org/T330261) [17:52:42] ottomata: (not jbond) but stop the service first [17:52:47] that worked for us on a few other ones [17:52:50] and then let the timer pick it up again [17:52:50] we need to restart all systemd timer schedule services? [17:53:03] ottomata: yes im writing a cumin now [17:53:24] okay, yeah thank you there are many 10s (maybe 100s?) [17:54:06] ott the following onliner shuld recover things [17:54:07] systemctl list-timers | awk '/n\/a/ {print $NF}' | while read line ; do echo sudo systemctl restart $line ; done [17:54:25] dcausse: FYI, our edit rate stats are back but I’m not yet seeing any query service lag in the maxlag panel – not sure if that part is working properly [17:54:31] though maybe the ^ ongoing restarts will fix that [17:54:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 100%: T329864', diff saved to https://phabricator.wikimedia.org/P44737 and previous config saved to /var/cache/conftool/dbconfig/20230222-175434-root.json [17:54:39] T329864: db2106 crashed - https://phabricator.wikimedia.org/T329864 [17:55:10] (I don’t know if there’s also a systemd timer somewhere in the query service lag pipeline or not) [17:55:39] (03CR) 10Dzahn: [C: 03+2] site: differentiate between both serviceops teams for insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/890014 (owner: 10Dzahn) [17:55:47] (03PS3) 10Dzahn: site: differentiate between both serviceops teams for insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/890014 [17:56:33] jbond: if i don't want each service to run via that command, can I just sttop instead of restart? [17:56:40] and the next scheduled run will pick it up? [17:57:20] ottomata: could yu try stoping the service; restarting the timer and see what list-timers shows? [17:57:56] (03PS5) 10Elukey: role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) [17:58:18] jbond: [17:58:19] e.gh. [17:58:20] e.g. [17:58:21] Wed 2023-02-22 18:05:00 UTC 6min left n/a n/a gobblin-event_default_test.timer gobblin-event_default_test.service [17:58:28] so i think it will scheudle it? [17:58:43] ottomata: what servr is this ill take a quick look [17:58:54] that one is an-test-coord1001.eqiad.wmnet [17:59:04] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39800/console" [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [17:59:49] Lucas_WMDE: I think wdqs lag -> maxlag is collected via a mw maint script? [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1800) [18:00:10] ottomata: yes that looks good to me [18:00:14] it seems to be some prometheus-based thing in the Wikidata.org extension [18:00:20] ill use that fix for the rest of the fleet as well thanks [18:01:48] dcausse: you’re right, there is a maintenance script too, I thought it was querying prometheus live in the request [18:01:53] so this is probably the timer https://gerrit.wikimedia.org/g/operations/puppet/+/baa0836c8405f3ad110935655e9039b27dd12de7/modules/profile/manifests/mediawiki/maintenance/wikidata.pp#25 [18:01:59] ok thanks jbond, if you are running a ffull fleet fix [18:02:00] that will hopefully be fixed soon [18:02:03] i'll wait for your cumin run [18:03:13] (03PS1) 10Dzahn: Revert "ci::firewall: allow http monitoring from prometheus hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891291 [18:03:22] (03CR) 10CI reject: [V: 04-1] Revert "ci::firewall: allow http monitoring from prometheus hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891291 (owner: 10Dzahn) [18:05:01] (DatasourceNoData) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [18:05:20] 10SRE, 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10BCornwall) 05Open→03Stalled [18:05:53] (03CR) 10Elukey: [V: 03+1] "Better now! Ready for a review ;)" [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [18:06:17] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:06:32] 10SRE, 10Traffic-Icebox: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10BCornwall) [18:06:35] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) 05Open→03In progress [18:06:45] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [18:08:17] jbond: i see changed list-timers output, guessing you have run your command? [18:08:48] !log stop all failed timer servies and restart the corrosponding timer unit [18:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:52] ottomata: yes :) [18:08:55] just ran now [18:08:57] ty! [18:09:29] no thanks every one [18:09:49] thanks a lot! [18:10:00] Lucas_WMDE: sukhe: hopefully th issues are all resolved now [18:10:15] jbond: all good on the issues we noticed so far, yep [18:10:15] thanks! [18:10:28] sukhe: specifically for on call its possible that there may be some other issues that buble up fomr this so just keep it in mind [18:10:29] I’m watching https://www.wikidata.org/w/api.php?action=query&format=json&maxlag=-1, hopefully the type will change from db to wikibase-queryservice soon :) [18:10:34] PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [18:10:41] and ill be around for the next few hours so please ping if needed [18:10:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'depool db1112', diff saved to https://phabricator.wikimedia.org/P44738 and previous config saved to /var/cache/conftool/dbconfig/20230222-181046-ladsgroup.json [18:10:52] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump-s4.service,cirrussearch-dump-s8.service,wikidatajson-dump.service,wikidatardf-all-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:11:42] 10SRE, 10Infrastructure-Foundations: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run - https://phabricator.wikimedia.org/T330318 (10ssingh) [18:11:52] PROBLEM - Check systemd state on poolcounter1004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-exporter-apt.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:10] jbond: thanks, brett and mutante ^ please be aware [18:12:34] ack [18:12:49] (here too, if that matters) [18:13:18] it does matter :* [18:13:24] haha [18:15:48] PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [18:16:45] ok, I dont know what exactly is happening but that looks like the puppetmaster is down? [18:18:39] yeah the https check is failing? [18:18:40] jbond: seems like that would qualify for the "ping if needed"? [18:19:28] RECOVERY - puppetmaster https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 415 bytes in 6.644 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [18:19:38] cool, probably the service restarts ) [18:20:08] mutante: https://puppetboard.wikimedia.org/report/puppetmaster2001.codfw.wmnet/e39e9d93d54be5a1f259f4d7a59d11c69775e0b8 [18:20:09] ok, I don't know what restarts but this looks much better [18:20:19] we might see a few other such alerts too [18:20:23] I also have a service that didn’t restart, but puppetmaster down sounds more important [18:20:41] Lucas_WMDE: not down but a monitoring check that was failing [18:21:11] ok, I see the recovery now [18:21:29] in that case – mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002 is still active (exited) [18:21:30] what's up with the systemd timers? I have several cloud vps instances which don't seem to be running the puppet timer and am wondering if that's related [18:21:33] cc jbond [18:21:39] sukhe: mutante: i think that the puppet masters may be getting a bit or extra load do to restarting the timers [18:21:45] ill check on it in 30 mins [18:21:54] taavi: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/d7a044a76158c80d333ae590d37cb3b6542c184d revert for that [18:22:13] taavi: you might need to manually stop the services activated by the timers (after the puppet change is rolled out) [18:22:22] if systemctl status says it’s “active (exited)” [18:22:27] hm. that fix is not being applied because puppet is not running due to that bug :/ [18:22:33] hmm. ok that is worse [18:22:52] yeah, puppet-agent-timer.service says 'active (exited)' [18:23:09] so we would need to cumin 'systemctl stop puppet-agent-timer.service' or something similar? (or run puppet via cumin) [18:23:17] er [18:23:37] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:23:59] RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 2.629 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [18:24:29] taavi: this is what i did on production https://phabricator.wikimedia.org/P44739. i can look at cloud in a bit but still some fall out in production [18:24:29] taavi: which host is that? [18:24:37] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [18:24:51] for the prod hosts at least, I see no issues on a few random ones I tried [18:24:55] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:24:58] sukhe: any WMCS instance I log in to [18:25:51] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01036 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:26:50] guess I was wrong on the failures [18:27:01] but then this is expected because it seems like the hosts can't reach puppetmasters [18:27:08] sukhe: i think that is tiggered from when the puppetmasters where down [18:27:10] yep [18:27:14] running puppet on failed nodes now [18:27:18] request https://puppet:8140//puppet/v3/file_metadatas/modules/admin/home/... timed out after 60.263 seconds [18:28:33] taavi: I guess you’d also need to cumin something like `sed -i '/RemainAfterExit/d' /lib/systemd/system/puppet-agent-timer.service` [18:28:44] (but probably don’t just take my word for it) [18:30:40] Lucas_WMDE: mwmaint looks good to me [18:30:46] jbond: I’m now happy with mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002, thank you :) [18:30:49] jinx ^^ [18:30:57] dcausse: mediawiki should have query service maxlag information again [18:31:07] i think ill run the two fix commands above again in 30 mins once puppet has run every where to make sure we get an slackers [18:31:11] :) [18:31:17] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:33:14] !log planet* - stopping and restarting all the timers for the various languages, commands from https://phabricator.wikimedia.org/P44739 [18:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:10] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review, 10Software-Licensing: Add LICENSE to operations/dns scripts - https://phabricator.wikimedia.org/T291323 (10Ottomata) Also fine with my work being licensed at Apache 2.0. Thank you! [18:42:37] RECOVERY - Check systemd state on poolcounter1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:47] 10SRE, 10noc.wikimedia.org, 10serviceops: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Dzahn) a:05Dzahn→03None removing assignee based on automated mail from Andre pointing out it has been assigned... [18:46:29] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:49:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P44740 and previous config saved to /var/cache/conftool/dbconfig/20230222-184908-root.json [18:49:55] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) Sharing Shopify's latest update below. If anyone has any ideas, please send them my way since I still have no clue what to do! I made a few tests and found the issue. The subd... [18:50:15] 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10Jclark-ctr) @MatthewVernon Can you advise when and what Server you would like to test in [18:50:21] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-discovery-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:35] PROBLEM - Check systemd state on restbase1033 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-exporter-apt.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:39] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s2.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s3.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s5.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s6.service,mediawiki_job_growthexperiments-refreshLinkRecommendati [18:50:39] ervice,mediawiki_job_growthexperiments-updateMenteeData-s1.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:32] (03CR) 10Slyngshede: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/891318 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff) [18:52:07] PROBLEM - Check systemd state on registry1004 is CRITICAL: CRITICAL - degraded: The following units failed: build-homepage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:52:13] PROBLEM - Check systemd state on cp6004 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsyslog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:52:19] (03CR) 10Gergő Tisza: [C: 03+1] Growth: Set GEPersonalizedPraiseBackendEnabled to true on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891308 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm) [18:53:20] (03CR) 10Gergő Tisza: [C: 03+1] growthexperiments: Run refreshPraiseworthyMentees daily [puppet] - 10https://gerrit.wikimedia.org/r/891285 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm) [18:54:12] (03CR) 10Slyngshede: [C: 03+2] P:IDM Ensure that social auth can lookup username. [puppet] - 10https://gerrit.wikimedia.org/r/891307 (owner: 10Slyngshede) [18:58:12] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[19-27].eqiad.wmnet: Replace expiring keys/certs - eevans@cumin1001 [19:00:05] hashar and dduvall: OwO what's this, a deployment window?? Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1900). nyaa~ [19:00:05] hashar and dduvall: Dear deployers, time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1900). [19:03:21] RECOVERY - Check systemd state on restbase1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P44741 and previous config saved to /var/cache/conftool/dbconfig/20230222-190413-root.json [19:09:33] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [19:11:23] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:11:49] (03CR) 10Ladsgroup: [C: 03+2] "I updated OMG and nothing has grant on 10.64 and 10.192 anymore. All are 10.%" [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup) [19:11:54] (03PS2) 10Ladsgroup: mariadb: Update grants to use wikiuser@10.% only [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) [19:11:58] (03CR) 10Ladsgroup: [V: 03+2] mariadb: Update grants to use wikiuser@10.% only [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup) [19:14:30] (03CR) 10Ottomata: [C: 03+2] "We'll also need to add an-airflow1005 to the list of refinery scap targets." [puppet] - 10https://gerrit.wikimedia.org/r/890906 (https://phabricator.wikimedia.org/T329870) (owner: 10Ebernhardson) [19:19:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P44742 and previous config saved to /var/cache/conftool/dbconfig/20230222-191918-root.json [19:32:52] (03Abandoned) 10Ottomata: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu) [19:34:07] !log restarted the following an-launcher1002 timers, which seemed stuck (next run = n/a): gobblin-webrequest.timer, reportupdater-browser.timer, reportupdater-reference-previews.timer, refine_event.timer, refine_eventlogging_legacy.timer [19:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P44743 and previous config saved to /var/cache/conftool/dbconfig/20230222-193422-root.json [19:38:59] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01036 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:48:47] RECOVERY - Check systemd state on registry1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:38] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [20:07:36] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:08:31] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:08:36] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:14:26] !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase10[19-27].eqiad.wmnet: Replace expiring keys/certs - eevans@cumin1001 [20:17:32] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[26-27].eqiad.wmnet: Replace expiring keys/certs - eevans@cumin1001 [20:22:39] (03PS49) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [20:23:01] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [20:30:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Jclark-ctr) ms-be1072. A4 U27 cableid 20220021 port 42 ms-be1073. B4. U10 cableid 5018 port 12 ms-be1074. E3. U5 cableid 20220227 Port 5 ms-be1075. F3. U1 cableid 20... [20:30:32] (03PS50) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [20:30:55] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [20:33:21] (03PS51) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [20:33:43] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [20:35:07] (03PS52) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [20:35:28] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [20:36:27] (03PS53) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [20:36:49] (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [20:37:27] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[26-27].eqiad.wmnet: Replace expiring keys/certs - eevans@cumin1001 [20:47:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10Rmaung) [20:51:57] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:31] PROBLEM - puppet last run on puppetdb2003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:56:56] (03PS1) 10Zabe: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890365 [20:57:33] (03CR) 10CI reject: [V: 04-1] Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890365 (owner: 10Zabe) [20:57:35] (03Abandoned) 10Zabe: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890365 (owner: 10Zabe) [20:58:04] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) Hi, @SHust. We appear to be running in circles here! What we're after has nothing to do with DNS/domain names/CNAME/A records, etc. This is entirely about adjusting a secur... [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:12] * urbanecm waves [21:00:24] jouncebot's right [21:00:29] but i'll deploy few things anyway [21:00:49] PROBLEM - puppet last run on puppetdb1003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:01:33] (03PS2) 10Urbanecm: Growth: Set GEPersonalizedPraiseBackendEnabled to true on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891308 (https://phabricator.wikimedia.org/T322444) [21:01:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891308 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm) [21:02:19] (03Merged) 10jenkins-bot: Growth: Set GEPersonalizedPraiseBackendEnabled to true on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891308 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm) [21:02:46] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:891308|Growth: Set GEPersonalizedPraiseBackendEnabled to true on pilot wikis (T322444)]] [21:02:51] T322444: Personalized praise: backend data and logic for the new mentor dashboard module - https://phabricator.wikimedia.org/T322444 [21:03:06] (03PS1) 10Urbanecm: Build backend for PersonalizedPraise [extensions/GrowthExperiments] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891293 (https://phabricator.wikimedia.org/T322444) [21:03:15] (03CR) 10Urbanecm: [C: 03+2] Build backend for PersonalizedPraise [extensions/GrowthExperiments] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891293 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm) [21:03:33] (03PS1) 10Zabe: admin: Update zabe's .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/891360 [21:03:37] (03CR) 10Urbanecm: [C: 03+2] Build backend for PersonalizedPraise [extensions/GrowthExperiments] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891293 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm) [21:10:20] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:891308|Growth: Set GEPersonalizedPraiseBackendEnabled to true on pilot wikis (T322444)]] (duration: 07m 33s) [21:10:25] T322444: Personalized praise: backend data and logic for the new mentor dashboard module - https://phabricator.wikimedia.org/T322444 [21:15:12] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Dzahn) >>! In T128559#8638381, @SHust wrote: > Sharing Shopify's latest update below. If anyone has any ideas, please send them my way since I still have no clue what to do! Hi, tha... [21:15:35] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:27] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:27] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Dzahn) P.S. Yea, just listen to what @BCornwall said above. That is going to make it less confusing. And thanks for doing this! [21:23:34] (03Merged) 10jenkins-bot: Build backend for PersonalizedPraise [extensions/GrowthExperiments] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891293 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm) [21:24:16] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:891293|Build backend for PersonalizedPraise (T322444)]] [21:24:21] T322444: Personalized praise: backend data and logic for the new mentor dashboard module - https://phabricator.wikimedia.org/T322444 [21:31:39] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:891293|Build backend for PersonalizedPraise (T322444)]] (duration: 07m 22s) [21:31:44] T322444: Personalized praise: backend data and logic for the new mentor dashboard module - https://phabricator.wikimedia.org/T322444 [21:36:10] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:39:24] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:40:16] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:40:22] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:45:34] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330343 (10phaultfinder) [22:11:40] (03CR) 10JHathaway: [C: 03+2] admin: Update zabe's .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/891360 (owner: 10Zabe) [22:25:23] 10SRE, 10Traffic: create a puppetized abstraction for haproxy blocklist hysteresis - https://phabricator.wikimedia.org/T329331 (10BCornwall) p:05Triage→03Low [22:27:28] 10SRE, 10Analytics-Radar, 10Data-Engineering-Icebox, 10Traffic: Requests to (hard) redirect pages return their target's contents but are counted as pageviews to the redirect page - https://phabricator.wikimedia.org/T125015 (10BCornwall) p:05Medium→03Triage [22:29:26] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Traffic: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10BCornwall) p:05Triage→03Low [22:38:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Jclark-ctr) [22:39:13] (03PS1) 10Zabe: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891386 [22:40:14] (03PS2) 10Zabe: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891386 [22:40:37] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330343 (10phaultfinder) [22:40:38] (03CR) 10Zabe: [C: 03+2] Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891386 (owner: 10Zabe) [22:41:21] (03Merged) 10jenkins-bot: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891386 (owner: 10Zabe) [22:46:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:48:04] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891387 (https://phabricator.wikimedia.org/T230382) [22:48:06] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891387 (https://phabricator.wikimedia.org/T230382) (owner: 10Zabe) [22:48:49] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891387 (https://phabricator.wikimedia.org/T230382) (owner: 10Zabe) [22:56:08] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Traffic: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10BCornwall) Looks like `interpret_wildcard()` in [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/media... [22:56:31] !log zabe@deploy1002 Synchronized wmf-config/interwiki.php: T230382 (duration: 07m 06s) [22:56:36] T230382: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 [22:57:24] (03PS1) 10Dzahn: switch planet from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891369 (https://phabricator.wikimedia.org/T330091) [22:57:30] 10SRE, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite) [23:00:14] (03CR) 10Dzahn: "openssl x509 -noout -text -in planet.discovery.wmnet.crt | grep DNS" [dns] - 10https://gerrit.wikimedia.org/r/891369 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [23:07:03] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [23:07:54] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [23:08:48] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [23:10:33] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:10:57] (03CR) 10Dzahn: [C: 03+1] "double checked and there is no (more) concept of an "active server" for planet. the timers and updates simply run in both DCs all this tim" [dns] - 10https://gerrit.wikimedia.org/r/891369 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [23:27:13] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Traffic: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10BCornwall) Looking into it further, it seems this is a very possible change! nginx mappings/site names support wildcards. Pulling back a bit, does anyth... [23:41:14] (03PS2) 10Dzahn: switch planet from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891369 (https://phabricator.wikimedia.org/T330091) [23:42:03] (03PS3) 10Dzahn: switch planet from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891369 (https://phabricator.wikimedia.org/T330091) [23:55:58] (03PS1) 10Zabe: Initial configuration for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891378 (https://phabricator.wikimedia.org/T306015) [23:59:07] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Rename deployment-cache-(text|upload)0x to deployment-cp0x - https://phabricator.wikimedia.org/T280393 (10BCornwall) cp hosts have now been updated to bullseye, FYI