[00:00:07] <wikibugs>	 (03PS2) 10Dzahn: ci::firewall: allow http monitoring from prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/890920 (https://phabricator.wikimedia.org/T327972)
[00:06:07] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/890920/39774/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/890920 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn)
[00:06:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] ci::firewall: allow http monitoring from prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/890920 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn)
[00:13:39] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[00:20:30] <wikibugs>	 (03PS1) 10Dzahn: ci: set port 1443 for https monitoring [puppet] - 10https://gerrit.wikimedia.org/r/890921 (https://phabricator.wikimedia.org/T327972)
[00:21:55] <wikibugs>	 (03PS2) 10Dzahn: ci: set port 1443 for https monitoring [puppet] - 10https://gerrit.wikimedia.org/r/890921 (https://phabricator.wikimedia.org/T327972)
[00:23:39] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/890921/39775/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/890921 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn)
[00:30:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:32:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[00:34:02] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "the "connection refused" errors in logstash stopped after pupper ran on prometheus hosts as well" [puppet] - 10https://gerrit.wikimedia.org/r/890921 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn)
[00:37:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[00:58:35] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[01:10:46] <wikibugs>	 (03Abandoned) 10Jforrester: Reduce height of the article toolbar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890140 (https://phabricator.wikimedia.org/T316950) (owner: 10Sushrith Bogi)
[01:28:40] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[01:29:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[01:31:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2186.codfw.wmnet with OS bullseye
[01:32:02] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye
[01:33:01] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2186.codfw.wmnet with reason: host reimage
[01:35:30] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2186.codfw.wmnet with reason: host reimage
[01:44:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:52:58] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:52:59] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2186.codfw.wmnet with OS bullseye
[01:53:05] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye completed: - db2186 (**PASS**)   - Dow...
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:21:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:28:16] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:33:44] <icinga-wm>	 PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 56 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:36:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:37:34] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Papaul)
[02:38:01] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Papaul) 05Open→03Resolved complete
[02:39:34] <icinga-wm>	 RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 22 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:43:39] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[02:47:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[02:52:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[03:27:13] <wikibugs>	 (03PS1) 10KartikMistry: Content Translation: Set MT threshold to 45% for Kurdish WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890947 (https://phabricator.wikimedia.org/T324941)
[04:07:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[04:12:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[04:31:44] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:57:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[05:02:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[05:29:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:18:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[06:36:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T0700)
[07:14:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[07:14:22] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) 05Open→03Resolved a:03ayounsi
[07:23:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[07:35:34] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10ayounsi) `asw-b-codfw> show virtual-chassis vc-port statistics extensive | match "FPC|Port|CRC alignment errors" fpc2: Port: vcp-255/0/48     CRC alignment errors:      5642     ` shows that there are indeed errors  ` > show vir...
[07:38:41] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, I'd leave it to Arnold to deploy this" [puppet] - 10https://gerrit.wikimedia.org/r/890799 (https://phabricator.wikimedia.org/T330129) (owner: 10Muehlenhoff)
[07:39:30] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, I like the naming" [puppet] - 10https://gerrit.wikimedia.org/r/890014 (owner: 10Dzahn)
[07:43:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) From T311999#8594766 I see that there is progress! Yay!  @jbond / @MoritzMuehlenhoff  In Juniper's form the only information requested when selecting OIDC is `ID token (Ope...
[07:51:21] <wikibugs>	 (03CR) 10Jcrespo: "Blocked on the review & adaptation/deploy of 868392, a patch from *December*." [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup)
[07:52:59] <wikibugs>	 (03CR) 10Nikerabbit: [C: 03+1] Content Translation: Set MT threshold to 45% for Kurdish WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890947 (https://phabricator.wikimedia.org/T324941) (owner: 10KartikMistry)
[07:54:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/890891 (owner: 10Slyngshede)
[08:00:06] <jouncebot>	 Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T0800).
[08:00:06] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:51] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1128', diff saved to https://phabricator.wikimedia.org/P44724 and previous config saved to /var/cache/conftool/dbconfig/20230222-080050-jynus.json
[08:02:03] <kart_>	 OK. I'm here and will go ahead with deployment..
[08:04:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890947 (https://phabricator.wikimedia.org/T324941) (owner: 10KartikMistry)
[08:05:40] <wikibugs>	 (03Merged) 10jenkins-bot: Content Translation: Set MT threshold to 45% for Kurdish WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890947 (https://phabricator.wikimedia.org/T324941) (owner: 10KartikMistry)
[08:09:40] <kart_>	 RoanKattouw: seems commit 7dd89d8f05b4b78626206db74ff7e9506c5d6608 is merged, but not deployed? Shows `There were unexpected commits pulled from origin for /srv/mediawiki-staging.`
[08:11:43] <kart_>	 I looked into bug (T315621) and changes, seems OK to go ahead.
[08:11:43] <stashbot>	 T315621: Install VueTest extension in beta labs - https://phabricator.wikimedia.org/T315621
[08:12:14] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:890947|Content Translation: Set MT threshold to 45% for Kurdish WP (T324941)]]
[08:12:19] <stashbot>	 T324941: Make the Machine translation stricter by 45% in Kurdish Wikipedia - https://phabricator.wikimedia.org/T324941
[08:14:10] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:890947|Content Translation: Set MT threshold to 45% for Kurdish WP (T324941)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[08:17:55] <logmsgbot>	 !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1004.eqiad.wmnet with OS bullseye
[08:18:09] <wikibugs>	 (03PS1) 10Slyngshede: Minor fixes found while setting up production env. [software/bitu] - 10https://gerrit.wikimedia.org/r/891227
[08:22:56] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:890947|Content Translation: Set MT threshold to 45% for Kurdish WP (T324941)]] (duration: 10m 41s)
[08:23:00] <stashbot>	 T324941: Make the Machine translation stricter by 45% in Kurdish Wikipedia - https://phabricator.wikimedia.org/T324941
[08:27:47] <wikibugs>	 (03CR) 10Volans: "reply to previous question" [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond)
[08:31:01] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment missing comma in Ferm rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890807 (owner: 10Slyngshede)
[08:36:17] <ryankemper>	 !log [WDQS] Repooled `wdqs20[05,07,10]`
[08:36:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:47] <dcausse>	 ryankemper: o/ wdqs2010 should not be repooled, it does not have a valid journal 
[08:38:28] <dcausse>	 hosts >= 2009 are not ready to serve user traffic 
[08:43:11] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "See (potentially) related incident https://phabricator.wikimedia.org/T330258 before proceeding with this. The -2 is to mark the important " [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup)
[08:43:42] <logmsgbot>	 !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1004.eqiad.wmnet with reason: host reimage
[08:47:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[08:47:08] <logmsgbot>	 !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1004.eqiad.wmnet with reason: host reimage
[08:49:05] <vgutierrez>	 !log rolling upgrade to HAProxy 2.6.9 in codfw, eqsin, drmrs, esams and eqiad
[08:49:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:12] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "looks good to me and useful in combination with after stanza. I have some concerns that some of the 300+ timers we have rely on the old be" [puppet] - 10https://gerrit.wikimedia.org/r/890843 (owner: 10Jbond)
[08:50:28] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.cf
[08:50:29] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[08:52:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[09:00:04] <jouncebot>	 hashar and dduvall: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T0900)
[09:00:05] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891232 (https://phabricator.wikimedia.org/T325587)
[09:00:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891232 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot)
[09:00:44] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891232 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot)
[09:03:01] <logmsgbot>	 !log nfraison@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1004.eqiad.wmnet with OS bullseye
[09:07:57] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.24  refs T325587
[09:08:01] <stashbot>	 T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587
[09:08:26] <ryankemper>	 dcausse: ah yes, thanks. fortunately looks like the pool command failed on 2010 anyway
[09:09:37] * dcausse loves when a command knows when to fail :)
[09:13:36] <wikibugs>	 (03CR) 10Ladsgroup: mariadb: Update grants to use wikiuser@10.% only (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup)
[09:14:36] <logmsgbot>	 !log hashar@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.24  refs T325587 (duration: 06m 38s)
[09:14:40] <stashbot>	 T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587
[09:18:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/891227 (owner: 10Slyngshede)
[09:19:15] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Minor fixes found while setting up production env. [software/bitu] - 10https://gerrit.wikimedia.org/r/891227 (owner: 10Slyngshede)
[09:20:18] <jinxer-wm>	 (ProbeDown) firing: (5) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:20:44] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[09:20:44] <jynus>	 what's that?
[09:20:57] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print 
[09:20:57] <icinga-wm>	 page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Proton
[09:20:57] <icinga-wm>	 PROBLEM - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is CRITICAL: CRITICAL - exception while fetching the URL. 503 Server Error: Backend fetch failed for url: https://en.planet.wikimedia.org/ https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org
[09:21:02] <elukey>	 hashar: this is probably related to your deploy
[09:21:03] <TheresNoTime>	 phab down fwiw
[09:21:10] <tarrow>	 Morning! Since yesterday wikidata is getting a ton of these "[FIRING:1] DatasourceNoData (kK0KSCJ4z "AlertManager","cxserver" Wikidata kK0KSCJ4z Edits: below 30 per minute (for 3 minutes) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Alerts critical wikidata)"
[09:21:22] <hashar>	 elukey: this?
[09:21:25] <godog>	 checking too
[09:21:33] <vgutierrez>	 hmmm
[09:21:35] <icinga-wm>	 PROBLEM - Debmonitor Health Check on debmonitor.wikimedia.org is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Backend fetch failed https://wikitech.wikimedia.org/wiki/Debmonitor
[09:21:43] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[09:21:45] <jynus>	 proton, http appservers
[09:21:51] <jynus>	 varnish
[09:21:56] <elukey>	 hashar: yeah we got paged for appservers down, may be something else but it smells weird as coincincence
[09:21:59] <tarrow>	 any ideas if the data being fed into alerting changed suddenly around 1708 yesterday changed a load?
[09:22:09] <jynus>	 edits at 0
[09:22:10] <jinxer-wm>	 (ProbeDown) firing: (5) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:22:13] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[09:22:14] <hashar>	 at least there is nothing showing up on the mw error dashboards
[09:22:19] <jynus>	 godog: major outage?
[09:22:23] <TheresNoTime>	 (oh I'll revise phab down to wikis down)
[09:22:25] <godog>	 jynus: looks like it
[09:22:29] <elukey>	 hashar: okok let's keep in sync thanks
[09:23:06] <jynus>	 can people confirm no wiki access?
[09:23:20] <TheresNoTime>	 jynus: yes, here, en.wiki
[09:23:20] <claime>	 here if needed
[09:23:21] <vgutierrez>	 R/O is working here via drmrs
[09:23:45] <TheresNoTime>	 (er, UK, whichever DC that is if helpful)
[09:23:46] <vgutierrez>	 and I can successfully log-in and browse en.wp.o
[09:23:46] <elukey>	 +1 for it.wikipedia.org
[09:23:52] <claime>	 R/O working from fr, drmrs also
[09:23:53] <jynus>	 It is app servers
[09:23:57] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[09:24:01] <jinxer-wm>	 (Wikidata Reliability Metrics - Median Payload alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert
[09:24:06] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job swagger_check_restbase_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:24:17] <jynus>	 TheresNoTime: yes no access or yes access?
[09:24:21] <hoo>	 Phabricator/ Grafana also not accessable for me
[09:24:29] <vgutierrez>	 phab is working here via drmrs as well
[09:24:30] <TheresNoTime>	 jynus: no access
[09:24:36] <hashar>	 I am going to rollback to rule out the train
[09:24:44] <jynus>	 creating an incident
[09:25:01] <godog>	 jynus: I've linked the doc 
[09:25:02] <claime>	 eqiad appservers rpm down 50%
[09:25:05] <godog>	 in -security
[09:25:15] <kostajh>	 phabricator is offline too, though, so train is probably not related? I cannot access phabricator or en.wikipedia.org (from Germany)
[09:25:18] <jinxer-wm>	 (ProbeDown) firing: (9) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:25:31] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:25:38] <jynus>	 https://www.wikimediastatus.net/
[09:26:57] <jinxer-wm>	 (ProbeDown) firing: (11) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:27:32] <Bsadowski1>	 :o
[09:28:00] <Bsadowski1>	 Good I checked here. I was looking something up about a music artist and there's an error :P
[09:28:16] <giraffe>	 it's very minor but the chan topic still says that the status is up
[09:28:19] <giraffe>	 hah
[09:28:37] <jynus>	 it also says to chech the official location for reporting issues 0:-)
[09:28:38] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[09:28:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job swagger_check_restbase_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:28:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:29:08] <hashar>	 doing the rollback, scap failed to rebuild the images due to Docker receiving a `HTTP status: 503 Backend fetch failed`, filed as https://phabricator.wikimedia.org/T330264  . Might be unrelated to the ongoing issue
[09:29:23] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on cloudweb1004 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech https://wikitech.wikimedia.org/wiki/Wikitech-static
[09:29:41] <hashar>	 Docker failed to fetch from the docker-registry, so I am wondering whether we might have a network related issue of some sort
[09:29:57] <claime>	 hashar: please come on -sec
[09:29:57] <kostajh>	 perhaps to do with https://sal.toolforge.org/log/4y9PeIYBtR_B8fLx3spz? I can't consult wikitech to see how we use HAProxy :|
[09:30:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:30:21] <Amir1>	 vgutierrez: maybe https://sal.toolforge.org/log/4y9PeIYBtR_B8fLx3spz?
[09:30:27] <TheresNoTime>	 kostajh: https://wikitech-static.wikimedia.org/wiki/Main_Page
[09:30:28] <hashar>	 kostajh: https://wikitech-static.wikimedia.org/wiki/Main_Page might work ;)
[09:30:34] <TheresNoTime>	 (heh)
[09:30:38] <vgutierrez>	 Amir1: not related AFAIK
[09:30:39] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.40.0-wmf.23" - T325587
[09:31:29] <hashar>	 (train rolled back)
[09:31:51] <icinga-wm>	 PROBLEM - Check if active EventStreams endpoint is delivering messages. on alert1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration
[09:34:38] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=esams,service=ats-be,cluster=cache_text
[09:34:45] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=ats-be,cluster=cache_text
[09:35:23] <icinga-wm>	 RECOVERY - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is OK: OK - Website content is current (1961 = 86400) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org
[09:35:27] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[09:35:56] <TheresNoTime>	 phab back, en.wiki back
[09:35:59] <icinga-wm>	 RECOVERY - Debmonitor Health Check on debmonitor.wikimedia.org is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 1094 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Debmonitor
[09:36:02] <Kizule>	 Hello, sr.wikipedia and Zuul aren't working. Is there some maintenance or?
[09:36:10] <akosiaris>	 Kizule: just fixed 
[09:36:16] <akosiaris>	 try again please
[09:36:18] <logmsgbot>	 !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1005.eqiad.wmnet with OS bullseye
[09:36:19] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:36:19] <Kizule>	 Amazing, it works.
[09:37:14] <jynus>	 nice
[09:37:18] <jinxer-wm>	 (ProbeDown) resolved: (11) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:38:46] <hashar>	 I will let things settle and promote group1 again
[09:39:00] <jinxer-wm>	 (Wikidata Reliability Metrics - Median Payload alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert
[09:39:04] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[09:39:09] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job swagger_check_restbase_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:40:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[09:40:18] <jinxer-wm>	 (ProbeDown) resolved: (9) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:40:44] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[09:41:43] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[09:42:37] <wikibugs>	 (03PS1) 10Hashar: Revert "group1 wikis to 1.40.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891237 (https://phabricator.wikimedia.org/T325587)
[09:42:39] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Revert "group1 wikis to 1.40.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891237 (https://phabricator.wikimedia.org/T325587) (owner: 10Hashar)
[09:43:13] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.40.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891237 (https://phabricator.wikimedia.org/T325587) (owner: 10Hashar)
[09:43:30] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] idm.wikimedia.org CNAME to idm1001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/890891 (owner: 10Slyngshede)
[09:43:33] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[09:43:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:44:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix condition for including haveged [puppet] - 10https://gerrit.wikimedia.org/r/890816 (owner: 10Muehlenhoff)
[09:45:02] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891238 (https://phabricator.wikimedia.org/T325587)
[09:45:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891238 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot)
[09:45:24] <hashar>	 the issue has been figured out and is unrelated to MediaWiki deployment so I am proceeding again
[09:45:40] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891238 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot)
[09:46:23] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2017.codfw.wmnet with OS bullseye
[09:47:52] <akosiaris>	 someone update wikimediastatus.net please
[09:48:03] <jynus>	 doing
[09:48:35] <logmsgbot>	 !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1005.eqiad.wmnet with reason: host reimage
[09:51:23] <logmsgbot>	 !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1005.eqiad.wmnet with reason: host reimage
[09:52:30] <wikibugs>	 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Unexpected auditd service restart failure - https://phabricator.wikimedia.org/T287266 (10MoritzMuehlenhoff) Per the bug that should be fixed in the auditd package in Bullseye, we'll be able to confirm when we reimage the doh* servers to Bullseye.
[09:52:55] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.24  refs T325587
[09:52:58] <stashbot>	 T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587
[09:54:47] <wikibugs>	 (03CR) 10Jcrespo: mariadb: Update grants to use wikiuser@10.% only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup)
[09:56:06] <wikibugs>	 (03CR) 10Jcrespo: mariadb: Update grants to use wikiuser@10.% only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup)
[09:57:16] <wikibugs>	 (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/891241
[09:59:28] <logmsgbot>	 !log hashar@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.24  refs T325587 (duration: 06m 33s)
[09:59:32] <stashbot>	 T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587
[10:01:40] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2017.codfw.wmnet with reason: host reimage
[10:01:48] <icinga-wm>	 RECOVERY - Check if active EventStreams endpoint is delivering messages. on alert1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration
[10:02:15] <wikibugs>	 (03PS1) 10Slyngshede: P:idm move OIDC endpoint to variable. [puppet] - 10https://gerrit.wikimedia.org/r/891242
[10:04:06] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2018.codfw.wmnet with OS bullseye
[10:04:30] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2019.codfw.wmnet with OS bullseye
[10:04:42] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2017.codfw.wmnet with reason: host reimage
[10:04:55] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2020.codfw.wmnet with OS bullseye
[10:05:04] <icinga-wm>	 RECOVERY - Check systemd state on an-airflow1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:05:24] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2021.codfw.wmnet with OS bullseye
[10:05:50] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39776/console" [puppet] - 10https://gerrit.wikimedia.org/r/891242 (owner: 10Slyngshede)
[10:07:05] <wikibugs>	 (03PS2) 10Slyngshede: P:idm move OIDC endpoint to variable. [puppet] - 10https://gerrit.wikimedia.org/r/891242
[10:07:47] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl
[10:08:21] <claime>	 !log Starting sre.switchdc.mediawiki live test preparation steps
[10:08:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job swagger_check_eventstreams_internal_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:12:49] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[10:12:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/891241 (owner: 10Muehlenhoff)
[10:13:23] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0)
[10:14:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10SLyngshede-WMF) @ayounsi doesn't it need an URL as well, for the endpoint?
[10:14:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (5) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:16:06] <wikibugs>	 (03PS3) 10Slyngshede: P:idm move OIDC endpoint to variable. [puppet] - 10https://gerrit.wikimedia.org/r/891242
[10:16:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[10:17:25] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39778/console" [puppet] - 10https://gerrit.wikimedia.org/r/891242 (owner: 10Slyngshede)
[10:17:49] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[10:18:47] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2018.codfw.wmnet with reason: host reimage
[10:18:59] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2019.codfw.wmnet with reason: host reimage
[10:19:51] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2020.codfw.wmnet with reason: host reimage
[10:19:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (5) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:20:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[10:20:25] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2021.codfw.wmnet with reason: host reimage
[10:21:08] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2017.codfw.wmnet with OS bullseye
[10:21:33] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2018.codfw.wmnet with reason: host reimage
[10:21:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) For the record: some doc on {F36864730} as well as https://jnprprod.devportal-aw-us.webmethods.io/portal/apis
[10:21:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[10:22:08] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) 05Open→03In progress
[10:22:19] <logmsgbot>	 !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1005.eqiad.wmnet with OS bullseye
[10:24:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) >>! In T306238#8636356, @SLyngshede-WMF wrote: > @ayounsi doesn't it need an URL as well, for the endpoint?  I guess they will give it to us later on in the onboarding proc...
[10:24:31] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2020.codfw.wmnet with reason: host reimage
[10:24:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (5) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:25:07] <wikibugs>	 (03CR) 10Nicolas Fraison: [C: 03+2] fix(presto): fix typo from node.enviroment to node.environment [puppet] - 10https://gerrit.wikimedia.org/r/889807 (owner: 10Nicolas Fraison)
[10:26:32] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2021.codfw.wmnet with reason: host reimage
[10:27:50] <wikibugs>	 (03PS1) 10Jbond: idp: Add juniper OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/891245
[10:28:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/891242 (owner: 10Slyngshede)
[10:28:35] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm move OIDC endpoint to variable. [puppet] - 10https://gerrit.wikimedia.org/r/891242 (owner: 10Slyngshede)
[10:28:41] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2019.codfw.wmnet with reason: host reimage
[10:28:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: Add juniper OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/891245 (owner: 10Jbond)
[10:30:53] <wikibugs>	 10SRE, 10Traffic, 10observability: varnish-frontend-fetcherr sets incorrect level in logstash - https://phabricator.wikimedia.org/T330267 (10Vgutierrez)
[10:33:09] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert)
[10:33:28] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High
[10:33:36] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[10:33:44] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[10:35:00] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl
[10:35:28] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0)
[10:36:11] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert)
[10:36:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:36:34] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 10:35 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl 10:35 <+logmsgbot> !log cgoubert@cu...
[10:38:03] <wikibugs>	 (03PS1) 10Nicolas Fraison: presto: remove - in the cluster name used in node.environment [puppet] - 10https://gerrit.wikimedia.org/r/891248
[10:38:53] <wikibugs>	 (03PS1) 10Slyngshede: P:IDM Fix url for OIDC endpoint [puppet] - 10https://gerrit.wikimedia.org/r/891249
[10:39:26] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39779/console" [puppet] - 10https://gerrit.wikimedia.org/r/891248 (owner: 10Nicolas Fraison)
[10:39:34] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2018.codfw.wmnet with OS bullseye
[10:39:56] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39780/console" [puppet] - 10https://gerrit.wikimedia.org/r/891249 (owner: 10Slyngshede)
[10:40:58] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] Update DNS to switch gitlab-replica (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/890785 (https://phabricator.wikimedia.org/T329930) (owner: 10EoghanGaffney)
[10:40:59] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2020.codfw.wmnet with OS bullseye
[10:41:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/891249 (owner: 10Slyngshede)
[10:41:52] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Fix url for OIDC endpoint [puppet] - 10https://gerrit.wikimedia.org/r/891249 (owner: 10Slyngshede)
[10:42:36] <wikibugs>	 (03PS2) 10Nicolas Fraison: presto: remove - in the cluster name used in node.environment [puppet] - 10https://gerrit.wikimedia.org/r/891248
[10:43:30] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:43:50] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 181, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:45:08] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2021.codfw.wmnet with OS bullseye
[10:45:11] <wikibugs>	 (03PS3) 10Nicolas Fraison: presto: remove - in the cluster name used in node.environment [puppet] - 10https://gerrit.wikimedia.org/r/891248
[10:45:58] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Set `X-Content-Type-Options: nosniff` on upload requests [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm)
[10:46:04] <wikibugs>	 (03PS4) 10Nicolas Fraison: presto: remove - in the cluster name used in node.environment [puppet] - 10https://gerrit.wikimedia.org/r/891248
[10:46:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:47:12] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39783/console" [puppet] - 10https://gerrit.wikimedia.org/r/891248 (owner: 10Nicolas Fraison)
[10:47:20] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2019.codfw.wmnet with OS bullseye
[10:48:21] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) Skipping `00-optional-warmup-caches` as the node script is broken and [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/890299 |...
[10:48:44] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] presto: remove - in the cluster name used in node.environment [puppet] - 10https://gerrit.wikimedia.org/r/891248 (owner: 10Nicolas Fraison)
[10:48:46] <wikibugs>	 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10Vgutierrez) I think so, I've took the liberty of amending the commit and adding a test for the new header as well
[10:49:12] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "tests seem to be happy:" [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm)
[10:56:42] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218)
[10:59:22] <wikibugs>	 (03CR) 10Elukey: ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou)
[11:00:04] <jouncebot>	 claime: OwO what's this, a deployment window?? MediaWiki infrastucture (UTC mid-day). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1100). nyaa~
[11:00:24] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou)
[11:00:35] <claime>	 jynus, godog, heads up, starting live-test
[11:01:01] <jynus>	 ok
[11:01:04] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet
[11:01:06] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0)
[11:01:31] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:01 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet 11:01 <+logmsgbot> !log cgouber...
[11:01:41] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks
[11:01:49] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0)
[11:02:06] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:01 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks 11:01 <+logmsgbot>...
[11:02:26] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance
[11:02:38] <wikibugs>	 (03PS1) 10Slyngshede: P:IDM Strip / in OIDC url [puppet] - 10https://gerrit.wikimedia.org/r/891254
[11:02:42] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0)
[11:03:08] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:02 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance 11:02 <+logmsgbot> !log cgoub...
[11:03:19] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly
[11:03:19] <logmsgbot>	 !log cgoubert@cumin1001 [DRY-RUN] MediaWiki read-only period starts at: 2023-02-22 11:03:19.149671
[11:03:21] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:IDM Strip / in OIDC url [puppet] - 10https://gerrit.wikimedia.org/r/891254 (owner: 10Slyngshede)
[11:03:33] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0)
[11:03:41] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly
[11:04:16] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=99)
[11:04:18] <claime>	 [2/3, retrying in 9.00s] Attempt to run 'spicerack.mysql_legacy.MysqlLegacy._check_core_master_in_sync' raised: Unable to get heartbeat from master db1118.eqiad.wmnet for section s1
[11:04:26] <claime>	 Amir1 ?
[11:04:58] <wikibugs>	 (03PS2) 10Slyngshede: Switch to built in LogoutView. [software/bitu] - 10https://gerrit.wikimedia.org/r/883113
[11:05:03] <Amir1>	 at meeting but it doesn't look problematic
[11:05:10] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:03 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly 11:03 <+logmsgbot> !log cgoubert@...
[11:05:20] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` spicerack.mysql_legacy.MysqlLegacyError: Unable to get heartbeat from master db1118.eqiad.wmnet for section s1 `
[11:05:26] <Amir1>	 claime: https://orchestrator.wikimedia.org/web/cluster/alias/s1 db1118 is the master
[11:05:43] <Amir1>	 and should have the heartbeat in heartbeat db (heartbeat table)
[11:05:54] <jynus>	 maybe it is trying to get events from the codfw master, and they won't reach unless done for real?
[11:06:04] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2] Switch to built in LogoutView. [software/bitu] - 10https://gerrit.wikimedia.org/r/883113 (owner: 10Slyngshede)
[11:06:07] <claime>	 possible
[11:06:13] <jynus>	 or circular replication is setup
[11:06:47] <jynus>	 is it failing only there, or just happens to be the first checked?
[11:07:32] <claime>	 No the others seem good I think
[11:08:15] <claime>	 Hmm idk, it's not outputing which servers it's checking at each step
[11:08:19] <claime>	 lemme check debug
[11:09:05] <jynus>	 my guess is it may work, but needs circular replication, with will be setup only a few days before the switchover
[11:09:14] <claime>	 it looks good for the other sections
[11:09:40] <jynus>	 in any case, in an emergency is not a big deal, it should happen anyway
[11:10:02] <jynus>	 if the other sections work, then there may be a grant issue or something else
[11:10:10] <claime>	 Hmm it looks like the same issue with x2 possibly
[11:10:16] <claime>	 pyparsing.ParseException: Expected end of text, found ':'  (at char 1), (line:1, col:2)
[11:10:31] <claime>	 cumin.backends.InvalidQueryError: Unexpected boolean operator 'and' with hosts ''
[11:10:41] <claime>	 Yeah, there's an empty cumin query somewhere
[11:11:10] <claime>	 I'll log the stacktrace in the task and proceed and we debug later?
[11:11:41] <jynus>	 sure, but I would do the whole process again later
[11:11:49] <claime>	 ack
[11:11:52] <jynus>	 (when fixed)
[11:12:33] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) Error seems to come from cumin query: ` 2023-02-06 12:09:06,872 DRY-RUN cgoubert 2367071 [ERROR _menu.py:261 in run] Exception raised...
[11:13:03] <logmsgbot>	 !log eoghan@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab2002.wikimedia.org with reason: Running failover to gitlab1003 - T329930
[11:13:04] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki
[11:13:07] <stashbot>	 T329930: Switchover gitlab-replica (gitlab2002 -> gitlab1003) - https://phabricator.wikimedia.org/T329930
[11:13:14] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0)
[11:13:18] <logmsgbot>	 !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab2002.wikimedia.org with reason: Running failover to gitlab1003 - T329930
[11:13:37] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite
[11:13:39] <wikibugs>	 (03PS2) 10AikoChou: ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218)
[11:13:39] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0)
[11:13:51] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
[11:13:51] <logmsgbot>	 !log cgoubert@cumin1001 [DRY-RUN] MediaWiki read-only period ends at: 2023-02-22 11:13:51.466468
[11:13:51] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0)
[11:14:04] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners
[11:14:06] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners (exit_code=0)
[11:14:27] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance
[11:15:05] <wikibugs>	 (03CR) 10AikoChou: ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou)
[11:15:46] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:13 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki 11:13 <+logmsgbot> !log cgoub...
[11:16:26] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0)
[11:16:53] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl
[11:17:26] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0)
[11:18:38] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters
[11:24:15] <jynus>	 how many steps left?
[11:24:19] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0)
[11:24:33] <claime>	 done
[11:24:37] <claime>	 with the cookbook steps
[11:24:42] <jynus>	 that one seemed quite long!
[11:24:55] <claime>	 PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (43/43) [05:29<00:00,  7.67s/hosts]
[11:24:57] <claime>	 Yeah
[11:24:58] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:18 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters 11:24 <jynus> how man...
[11:25:54] <claime>	 It runs sequentially on quite a few hosts
[11:25:59] <claime>	 (all the DB masters)
[11:26:20] <wikibugs>	 (03PS1) 10KartikMistry: Fix contribution menu entrypoint in vector-2022 skin [extensions/ContentTranslation] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/890863 (https://phabricator.wikimedia.org/T329893)
[11:26:22] <moritzm>	 !log installing git security updates
[11:26:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:32] <jynus>	 I see, probably it is one that is not time-sensitive
[11:26:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:26:50] <claime>	 jynus: No, we're out of read-only at that point
[11:27:03] <claime>	 read only steps are 2-7
[11:27:03] <jynus>	 ngix @ codfw ?
[11:27:17] <jynus>	 wait, we have some issue, is it only monitoring?
[11:28:37] <wikibugs>	 (03PS1) 10KartikMistry: Fix contribution menu entrypoint in vector-2022 skin [extensions/ContentTranslation] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890864 (https://phabricator.wikimedia.org/T329893)
[11:29:00] <jynus>	 yeah, it is prometheus
[11:29:10] <jynus>	 I was confused by the wording "Reduced availability for job nginx"
[11:29:17] <claime>	 sudo confctl  select 'dc=codfw,service=nginx' get says everything pooled
[11:29:43] <jynus>	 really meaning "reduced availibility on prometheus scraping job for nginx"
[11:30:05] <jynus>	 I read it as "reduced availibility for nginx" :-D
[11:30:07] <claime>	 yeah, it's confusing
[11:30:12] <icinga-wm>	 RECOVERY - Wikitech and wt-static content in sync on cloudweb1004 is OK: wikitech-static OK - wikitech and wikitech-static in sync (41139 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[11:30:39] <jynus>	 it's fine, not sure if you have to do more operations re test?
[11:31:02] <claime>	 No I think we're good now
[11:31:36] <jynus>	 debugging then, it is, but seems like an easy fix if just a cumin query issue
[11:31:44] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:31:58] <claime>	 jynus: about the nginx job, it's gitlab
[11:32:05] <claime>	 https://grafana.wikimedia.org/goto/yMtb46JVz?orgId=1
[11:32:31] <volans>	 claime: I'm back from a very long meeting, I haven't read the backlog, anything I can help with?
[11:32:31] <claime>	 And that's probably because eoghan and jelto are running a switchover on gitlab 
[11:32:45] <jynus>	 ok, good
[11:32:45] <claime>	 volans: Can you help me check out https://phabricator.wikimedia.org/T330271#8636689 ?
[11:33:11] <claime>	 It's the only thing that crapped out in the whole live-test
[11:33:31] <volans>	 that's a query that does 'foo and bar ...' and foo returns 0 hosts
[11:33:36] <claime>	 yep
[11:33:43] <jelto>	 we are switching the gitlab-replica. That should not have a impact
[11:33:57] <volans>	 which cookbook was that?
[11:34:01] <jelto>	 at least from gitlab replicas beeing down 
[11:34:02] <jynus>	 jelto: thanks, no issue, just the alerting was confusing to me at first
[11:34:07] <volans>	 Cookbook sre.switchdc.mediawiki.03-set-db-readonly ?
[11:34:07] <claime>	 volans: sre.switchdc.mediawiki.03-set-db-readonly
[11:34:12] <claime>	 heh
[11:35:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:35:19] <volans>	 I see also another failure in th elogs
[11:35:19] <volans>	 spicerack.mysql_legacy.MysqlLegacyError: Unable to get heartbeat from master db1118.eqiad.wmnet for section s1
[11:35:37] <volans>	 (later on)
[11:36:17] <claime>	 volans: yes, that's the error that bubbled up in the cookbook, the stacktrace is from the spicerack log
[11:36:27] <volans>	 x2???
[11:36:32] <volans>	 didnt' we ditch it
[11:36:56] <jynus>	 a deploy defect, maybe?
[11:37:10] <claime>	 I thought socgoubert@cumin1001:~/cookbooks$ dpkg -l | grep spicerack
[11:37:12] <claime>	 ii  spicerack                            6.2.1-1+deb11u1                amd64        Automation and orchestration library for WMF, written in Python
[11:37:22] <claime>	 that's the right version right volans ?
[11:37:28] <volans>	 yes
[11:37:32] <volans>	 checking
[11:37:35] <jynus>	 weird then
[11:37:38] <claime>	 ty <3
[11:37:52] <claime>	 jynus: may be my patch sucked, idk
[11:38:11] <claime>	 volans: I'll create a specific task for it
[11:38:15] <jynus>	 ha ha, don't think so, usually it is the smallest issue
[11:39:14] <Amir1>	 meeting over, do you need anything now claime ?
[11:39:45] <claime>	 Amir1: volan,s is helping debug the 03-set-db-readonly error
[11:40:05] <Amir1>	 okay, I'm around. Ping me if needed
[11:40:21] <claime>	 Thanks <3
[11:41:15] <volans>	 claime: that stacktrace is from the 6th...
[11:41:25] <claime>	 ffs
[11:41:27] <claime>	 sorry
[11:41:30] <jynus>	 ha ha
[11:41:44] <claime>	 let me check the right log then
[11:42:17] <taavi>	 I ma about to leave but -tech has a report of users seeing read-only errors
[11:42:40] <jynus>	 taavi: which wiki? en?
[11:42:44] <taavi>	 bn
[11:42:48] <Bsadowski1>	 yeah bn
[11:43:26] <jynus>	 that's s3
[11:43:27] <claime>	 that's not normal, we should not be changing the RO status in the live DC during the live-test
[11:44:09] <jynus>	 db1166 errors
[11:44:30] <jynus>	 I will depool and later debug
[11:44:33] <claime>	 ack
[11:44:47] <volans>	 claime: the cookbook issue is that on db1118
[11:44:51] <volans>	 SELECT ts FROM heartbeat.heartbeat WHERE datacenter = 'codfw' and shard = 's1' ORDER BY ts DESC LIMIT 1;
[11:44:51] <claime>	 yes
[11:44:54] <volans>	 Empty set (0.000 sec)
[11:45:16] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1166, seen mw errors', diff saved to https://phabricator.wikimedia.org/P44726 and previous config saved to /var/cache/conftool/dbconfig/20230222-114515-jynus.json
[11:45:16] <volans>	 so it can't get the heartbeat
[11:46:36] <jynus>	 we have grant errors, Amir1
[11:46:54] <Amir1>	 sigh
[11:47:07] <Amir1>	 what is the user
[11:47:08] <jynus>	 I depooled db1166, but it may be deeper
[11:47:24] <jynus>	 not sure if grant or network, but authentication is failing at random
[11:47:51] <Amir1>	 where are you seeing this? nothing in https://logstash.wikimedia.org/goto/83874c6c6b848b8236a12c8f470be6f8
[11:48:07] <Amir1>	 network fails at random. It might not be grants
[11:48:23] <claime>	 db1166 has not been touched by the cookbook
[11:48:23] <taavi>	 read-only errors are in https://logstash.wikimedia.org/goto/ae819cf364437054868c6cf56829bee1, and not limited to s3
[11:48:47] <jynus>	 it started at 11:03
[11:48:55] <jynus>	 so maybe test related
[11:49:04] <Amir1>	 https://orchestrator.wikimedia.org/web/clusters
[11:49:07] <jynus>	 I am repooling db1166, seems something else
[11:49:11] <Amir1>	 nothing has issues on db side
[11:49:24] <Amir1>	 I think the test might have actually set production to read-only
[11:49:27] <Amir1>	 let me check dbctl
[11:49:41] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1166, errors not fixed', diff saved to https://phabricator.wikimedia.org/P44727 and previous config saved to /var/cache/conftool/dbconfig/20230222-114940-jynus.json
[11:49:46] <taavi>	 the timing matches with the read-only cookbook
[11:49:52] <claime>	 wtf
[11:50:21] <jynus>	 icinga checks for read only looks fine
[11:50:58] <jynus>	 edits are flowing, so it is localized
[11:51:32] <claime>	 ok why are mw2* which are in codfw trying to write?
[11:51:46] <Amir1>	 https://www.irccloud.com/pastebin/nSHr1J7j/
[11:52:04] <Amir1>	 is it ro?
[11:53:04] <Amir1>	 dbctl says it's not ro, everything is rw
[11:53:21] <Amir1>	 all are codfw
[11:54:05] <claime>	 There are no A/P mediawiki services pooled in codfw
[11:54:07] <jynus>	 dbs are ok, so we should not be in a split brain
[11:54:24] <claime>	 so why are edits going there?
[11:54:52] <Amir1>	 it's not all edits, these are read views that are giving ro
[11:55:12] <Amir1>	 some can write to the db but it should reach eqiad master
[11:55:14] <jynus>	 claime: can you check etcd status for priamary and active dc for mw?
[11:55:49] <bawolff>	 If its helpful, users reporting getting read only also report that edits made via API go through
[11:55:57] <jynus>	 maybe something changed there, I am trying to check for causes
[11:56:07] <Amir1>	 from mw point of view, regardless of dc, the master is eqiad's master
[11:56:36] <jynus>	 then what could it be?
[11:56:59] <Amir1>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/799685
[11:57:12] <claime>	 cgoubert@cumin1001:/var/log/spicerack/sre/switchdc$ confctl --object-type mwconfig select name=ReadOnly get
[11:57:14] <claime>	 {"ReadOnly": {"val": "You can't edit now. This is because of maintenance. Copy and save your text and try again in a few minutes."}, "tags": "scope=codfw"}
[11:57:16] <claime>	 {"ReadOnly": {"val": false}, "tags": "scope=eqiad"}
[11:57:53] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/val=false; selector: name=ReadOnly,scope=codfw
[11:58:19] <claime>	 ok the cookbook doesn't reset the ReadOnly val
[11:58:30] <jynus>	 you think you found it an fixed it?
[11:58:38] <claime>	 Can someone sanity check that false/false is the right status?
[11:58:45] <Amir1>	 this day is getting more interesting by the hour
[11:58:46] <claime>	 for mwconfig
[11:59:02] <Amir1>	 yup fixed
[11:59:07] <Amir1>	 claime: Thanks!
[11:59:25] <claime>	 Please confirm mwconfig ReadOnly false/false is the right state
[11:59:39] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Change the active gitlab replica host to be the eqiad instance [puppet] - 10https://gerrit.wikimedia.org/r/890779 (https://phabricator.wikimedia.org/T329930) (owner: 10EoghanGaffney)
[11:59:50] <jynus>	 claime: I don't know
[11:59:55] <Amir1>	 me neither
[12:00:02] <jynus>	 but errors seem to have stopped?
[12:00:05] <claime>	 :death:
[12:00:05] <Amir1>	 but the errors are gone
[12:00:27] <claime>	 MasterDatacenter is set to eqiad so we should be ok
[12:00:48] <jynus>	 and dbs are a final protection agains split brains
[12:00:53] <jynus>	 which is a good thing
[12:01:00] <claime>	 Adding a big warning to the Switch Datacenter page
[12:01:11] <jynus>	 let me create a new "mini-incident"
[12:01:34] <jynus>	 and confirm with reporters things lookg good
[12:01:38] <volans>	 claime: was 07-set-readwrite.py not called?
[12:02:17] <claime>	 dude my tmux :D
[12:02:27] <moritzm>	 !log installing NSS security updates
[12:02:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:37] <claime>	 2023-02-22 11:13:51,074 cgoubert 2919708 [INFO] START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
[12:02:38] <claime>	 2023-02-22 11:13:51,076 cgoubert 2919708 [INFO] Set MediaWiki in read-write in eqiad
[12:02:40] <claime>	 2023-02-22 11:13:51,077 cgoubert 2919708 [INFO] Setting val=False for tags: {'scope': 'eqiad', 'name': 'ReadOnly'}
[12:02:44] <claime>	 It's only setting it for DC_TO
[12:02:50] <claime>	 It's not resetting DC_FROM
[12:03:02] <claime>	 I think that's a holdover from before multidc
[12:03:12] <claime>	 And it skipped right under my nose when I checked
[12:03:19] <volans>	 ahhh got it
[12:03:34] <claime>	 I'll fix it
[12:04:10] <Amir1>	 btw, I think we need to check affects of this patch done on multidc before the switchover: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/799685/11/wmf-config/etcd.php
[12:04:25] <volans>	 claime: also 02-set-readonly.py should have set_reaonly for both
[12:04:28] <volans>	 at this point
[12:04:40] <jynus>	 I am double checking with reporter
[12:05:15] <claime>	 volans: will double-chek 02.
[12:05:26] <volans>	 jynus: as for the original failure regarding db1118 the cookbook tries to get the heartbeat for codfw, but it's empty. Could it be because the replication is not yet enabled codfw->eqiad?
[12:05:28] <Amir1>	 to be clear, nothing to do with the grants?
[12:05:35] <wikibugs>	 (03PS2) 10EoghanGaffney: Update DNS to switch gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/890785 (https://phabricator.wikimedia.org/T329930)
[12:05:38] <jynus>	 Amir1: false positive on my side
[12:05:49] <jynus>	 althought there may be small issues there
[12:06:02] <jynus>	 saw some auth errors, but were unrelated
[12:06:07] <Amir1>	 small issues is the least terrible thing about the grants
[12:06:29] <jynus>	 claime: I get the report from someone that can edit on desktop but show read only on mobile
[12:06:46] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Update DNS to switch gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/890785 (https://phabricator.wikimedia.org/T329930) (owner: 10EoghanGaffney)
[12:07:30] <Amir1>	 in total 700 errors, it eats a lot of error budget but not really an incident I think
[12:07:56] <claime>	 jynus: i don´t  know what to do with that information :/
[12:08:22] <jynus>	 claime: basically I don't have confirmation that it is fully fixed
[12:08:38] <jynus>	 I am discussing with someone to check if the issue is still ongoing
[12:08:48] <volans>	 https://config-master.wikimedia.org/mediawiki.yaml
[12:09:13] <claime>	 cgoubert@cumin1001:/var/log/spicerack/sre/switchdc$ sudo confctl --object-type mwconfig select name=ReadOnly get
[12:09:15] <claime>	 {"ReadOnly": {"val": "false"}, "tags": "scope=codfw"}
[12:09:17] <claime>	 {"ReadOnly": {"val": false}, "tags": "scope=eqiad"}
[12:09:19] <claime>	 Stale config?
[12:11:01] <Amir1>	 yeah, read-only gets cached to avoid stampede. Maybe an overly aggressive cache? 
[12:11:17] <Amir1>	 mobile reaches the same cluster
[12:11:34] <jynus>	 "commons still locked i see.."
[12:11:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:11:47] <jynus>	 so not a single user, multiple are seeing the issue
[12:12:13] <jynus>	 I am going to update the status page at this point
[12:12:16] <claime>	 How can we reset cache?
[12:12:20] <claime>	 jynus: yes, thank you
[12:12:22] <claime>	 I'm so sorry :(
[12:12:37] <claime>	 I mean cache for that particular value?
[12:12:39] <claime>	 manually
[12:13:00] <volans>	 claime: we have icinga checks to ensure mw siteinfo returns the latest value in etcd
[12:13:03] <volans>	 and those are not firing
[12:13:04] <taavi>	 why is the other false a string and the other a boolean?
[12:13:20] <claime>	 taavi: good catch
[12:13:31] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/val=False; selector: name=ReadOnly,scope=codfw
[12:13:44] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[12:13:48] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/val=no; selector: name=ReadOnly,scope=codfw
[12:13:55] <claime>	 WTF
[12:14:32] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/val=false; selector: name=ReadOnly,scope=codfw
[12:14:41] <claime>	 it doesn't want to set it to a bool
[12:15:13] <claime>	 please advise
[12:15:35] <jynus>	 who in your team know about this, let's call them
[12:15:35] <akosiaris>	 what commands did you try ? 
[12:15:50] <wikibugs>	 (03CR) 10Ladsgroup: mariadb: Update grants to use wikiuser@10.% only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup)
[12:15:53] <claime>	 sudo confctl --object-type mwconfig select name=ReadOnly,scope=codfw set/val=false
[12:15:57] <claime>	 sudo confctl --object-type mwconfig select name=ReadOnly,scope=codfw set/val=False
[12:16:01] <claime>	 sudo confctl --object-type mwconfig select name=ReadOnly,scope=codfw set/val=no
[12:16:20] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/val=false; selector: name=ReadOnly
[12:16:22] <stashbot>	 akosiaris@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[12:16:37] <claime>	 akosiaris: still a string
[12:16:51] <jynus>	 0?
[12:16:53] <claime>	 I'll run the set rw cookbook with codfw as dc_to
[12:16:57] <akosiaris>	 I went for a string
[12:17:06] <akosiaris>	 it was a bool before
[12:17:26] <akosiaris>	 well, I went for quotes, it was unquoted before
[12:17:30] <wikibugs>	 (03PS1) 10Superpes15: [sysop_itwiki] Change the logo, the favicon, and add a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891261 (https://phabricator.wikimedia.org/T330279)
[12:17:44] <taavi>	 https://wikitech.wikimedia.org/wiki/MediaWiki_and_EtcdConfig says to use the edit command
[12:17:47] <akosiaris>	 and I do see the quotes now 
[12:18:07] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
[12:18:08] <stashbot>	 cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[12:18:11] <logmsgbot>	 !log cgoubert@cumin1001 MediaWiki read-only period ends at: 2023-02-22 12:18:11.451680
[12:18:11] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0)
[12:18:12] <stashbot>	 cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[12:18:14] <stashbot>	 cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[12:18:36] <akosiaris>	 why is stashbot failing to write? 
[12:18:39] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
[12:18:40] <stashbot>	 cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[12:18:43] <volans>	 the cookbook does
[12:18:43] <volans>	 self._conftool.set_and_verify("val", False, scope=datacenter, name="ReadOnly")
[12:18:45] <logmsgbot>	 !log cgoubert@cumin1001 MediaWiki read-only period ends at: 2023-02-22 12:18:45.829060
[12:18:46] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0)
[12:18:53] <akosiaris>	 it should hit eqiad anyway, shouldn't it ? 
[12:19:03] <claime>	 cgoubert@cumin1001:/var/log/spicerack/sre/switchdc$ sudo confctl --object-type mwconfig select name=ReadOnly get
[12:19:05] <claime>	 {"ReadOnly": {"val": false}, "tags": "scope=codfw"}
[12:19:07] <claime>	 {"ReadOnly": {"val": false}, "tags": "scope=eqiad"}
[12:19:09] <claime>	 There
[12:19:19] <claime>	 I had to run it twice with the two orders 
[12:19:41] <akosiaris>	 I don't get a read-only warning 
[12:19:57] <volans>	 claime: running  mediawiki.set_readwrite('codfw') coyuld have been quicker ;)
[12:20:01] <volans>	 *from a spicerack repl
[12:20:07] <claime>	 volans: yeah well I panicked ok :D
[12:20:17] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[12:20:28] <volans>	 but the question is why it does allow to set it to the wrong value
[12:20:30] <akosiaris>	 I am not sure what happened tbh. The issue was we had eqiad set to a string  ?
[12:20:44] <claime>	 akosiaris: codfw was originally not reset to rw
[12:20:45] <akosiaris>	 I thought the errors were about codfw ?
[12:20:59] <claime>	 Then I tried the confctl command to set codfw to ro=false
[12:21:05] <claime>	 It didn't work because string
[12:21:19] <claime>	 Now it's all bool
[12:21:26] <jynus>	 so what's the current status, we think it is fixed?
[12:21:33] <jynus>	 to ask reporters to confirm
[12:21:50] <claime>	 edit works on en.wiki for me
[12:22:03] <claime>	 I can't find my way through other languages unfortunately
[12:22:04] <Amir1>	 this is I think on codfw only
[12:22:14] <Amir1>	 not eqiad
[12:22:23] <Amir1>	 (has been, all errors were mw2xxx)
[12:22:25] <jynus>	 I asked again to see
[12:22:26] <akosiaris>	 !log test
[12:22:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:31] <claime>	 recentchange is picking back up
[12:22:35] <akosiaris>	 ah stashbot is fine again
[12:22:48] <jynus>	 "yep works" I get now
[12:22:51] <Amir1>	 maybe I'm missing something
[12:22:52] <jynus>	 so seem fixed
[12:23:15] <akosiaris>	 ok, I broke eqiad too apparently
[12:23:20] <odder>	 I had this on plsource, but it's back now
[12:23:43] <akosiaris>	 for a small window of time, around 2m, from 14:16 to 14:18 
[12:23:44] <jynus>	 akosiaris:  claime: I may need help with the timeline, I am quite confused
[12:23:48] <claime>	 akosiaris: yeah, you set it to a string because no selector and confctl doing bs
[12:23:54] <akosiaris>	 you are not the only one jynus
[12:23:59] <jynus>	 he he
[12:24:06] <claime>	 yeah I have it
[12:24:07] <akosiaris>	 I just figured out what happened 
[12:24:20] <akosiaris>	 or at least pieced enough pieces together
[12:24:37] <jynus>	 akosiaris: https://docs.google.com/document/d/12QY-N1oXRwY4tPHO0fwrvf2osvZnr-2Vjfl_3pAOjE4
[12:24:54] <taavi>	 I suspect MW treats 'false' inconsistently, which explains why I could not reproduce it initially at least
[12:24:58] <akosiaris>	 so, set val=value in confctl doesn't set a bool but sets a string, interesting
[12:25:17] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[12:25:22] <claime>	 akosiaris: yeah and that's... bad
[12:26:22] <volans>	 akosiaris: shouldn't it adhere to the db_readonly.schema?
[12:26:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:27:40] <jynus>	 I see, the real issue was the old cookbook asuming that setting codfw read only was ok
[12:27:55] <jynus>	 at leat the initial trigger, right?
[12:28:10] <jynus>	 and later confd weirdness?
[12:28:49] <akosiaris>	 volans: same question from my side 
[12:29:22] <jynus>	  setting status page to resolved unless someone disagrees
[12:29:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10jbond) @ayounsi i have took another look at this.  from the steps in the document above i have now configured   * Register the Juniper API gateway app in the Customer/Partner's IdP....
[12:30:22] <jynus>	 doing now
[12:31:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:31:58] <volans>	 akosiaris: here's the culprit
[12:31:58] <volans>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/files/conftool/json-schema/mediawiki-config/db_readonly.schema
[12:32:06] <moritzm>	 !log installing openssl security updates on buster
[12:32:08] <claime>	 jynus: Exactly that, the timeline I constructed in the doc should be explicit
[12:32:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:20] <volans>	 so it can be either a string or boolean false
[12:32:38] <jynus>	 so too main issue, un unexpected (old) change, an an issue preventing it from manually solve it
[12:32:42] <jynus>	 *2
[12:32:43] <volans>	 not sure if confctl CLI allow to set all of those values
[12:33:22] <claime>	 I haven't managed to use confctl to set it to a boolean
[12:36:39] <volans>	 claime: I think 'edit' would work
[12:36:45] <volans>	 judging from https://sal.toolforge.org/production?p=0&q=%22name%3DReadOnly%22&d=
[12:37:01] <claime>	 volans: wasn't aware of edit, will add it to my snippets
[12:37:17] <claime>	 Or at least it didn´t  come to mind in the heat of the moment
[12:37:59] <claime>	 so, 02-set-readonly does set both datacenters RO (except in live-test where it doesn't touch dc_to)
[12:38:28] <claime>	 But 07-set-readwrite doesn't revert it for dc_from
[12:38:40] <claime>	 That's my conclusion for root cause
[12:39:07] <taavi>	 I think the string option is for a reason
[12:39:07] <wikibugs>	 10SRE, 10Security-Team, 10Traffic-Icebox, 10WMF-General-or-Unknown, and 2 others: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618 (10Bawolff) a:05Bawolff→03None Huh. Guess this got deployed to `/wikipedia/(el|fr|ru|it|de|uk|ja|id|he|fi|zh|test)` but never everywh...
[12:39:14] <volans>	 yes, because we were going RW/RO -> RO/RO -> RO/RW
[12:39:52] <Amir1>	 yeah, the way mw handles it is either the value is falsey or it's the read only reason (polymorphic variable)
[12:39:55] <volans>	 now is RW/RW -> RO/RO -> RW/RW | RO/RW based on how you want eqiad to be after the switch
[12:40:02] <Amir1>	 falsey means it's not RO
[12:40:17] <claime>	 volans: well we can't keep eqiad RO in mediawiki terms apparently
[12:40:24] <Amir1>	 which is not great but hey I didn't design it
[12:40:35] <claime>	 Since here leaving codfw in that state caused user-facing issues
[12:40:56] <volans>	 so it's forcely RW/RW -> RO/RO -> RW/RW
[12:40:56] <moritzm>	 !log rolling restart of FPM on mw canaries
[12:40:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:06] <claime>	 volans: yep, and only the dbs are RO
[12:41:12] <claime>	 left RO*
[12:41:37] <volans>	 that should wonder the question if ReadOnly should not be anymore per-dc
[12:41:50] <volans>	 but a central value for mediawiki (no, not for next week ;) )
[12:42:01] <claime>	 Yeah, I feel like it should be a global setting in the future
[12:42:30] <claime>	 I don't see a use in having even the passive DC set RO from a mw point of view if it causes edit outages for users
[12:42:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] dnsquery: Add dnsquery module to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/890476 (owner: 10Jbond)
[12:42:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib::dns_lookup: switch to dnsquery::lookup [puppet] - 10https://gerrit.wikimedia.org/r/890477 (owner: 10Jbond)
[12:42:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] apereo_cas: Add missing docs and fix lint issues [puppet] - 10https://gerrit.wikimedia.org/r/890483 (owner: 10Jbond)
[12:42:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] apereo_cas: update to use dnsquery functions for lookups [puppet] - 10https://gerrit.wikimedia.org/r/890484 (owner: 10Jbond)
[12:42:59] <volans>	 yeah
[12:44:32] <wikibugs>	 10SRE: Improve process to add/update keys for pwstore repo - https://phabricator.wikimedia.org/T262393 (10MoritzMuehlenhoff) 05Open→03Resolved This has been resolved for quite a while: The PGP keys are now stored within the repo itself, current docs are at https://office.wikimedia.org/wiki/Pwstore
[12:44:35] <Amir1>	 the only possible explanation to me is that the gods of MediaWiki are angry at SREs and require a human sacrifice 
[12:44:55] <Amir1>	 otherwise I can't explain this many incidents in a day
[12:46:10] <claime>	 I really thought vgutierre.z had made the necessary blood sacrifice
[12:46:30] <claime>	 My understanding of the arcane mediawiki magic still leaves to be desired
[12:46:36] <jynus>	 Amir1: let me know if the new wording is ok for you
[12:46:50] <wikibugs>	 10SRE: Make bpfcc-tools available fleet-wide - https://phabricator.wikimedia.org/T261193 (10MoritzMuehlenhoff) 05Open→03Resolved bpfcc-tools is available for one-off debuging on all servers (it's in debian startig with Bullseye and for Buster it can be installed from buster-backports). So I think this task a...
[12:47:09] <Amir1>	 jynus: thanks looks good
[12:47:14] <claime>	 Although I still don't understand why some writes go to codfw
[12:47:34] <Amir1>	 and also IIRC, we have some cookies being sent that if the user is edited, they'd be routed to eqiad for a while
[12:48:04] <jynus>	 what I mean is some users edited, saw "we are in read only, stop editing"
[12:48:10] <jynus>	 and they did
[12:48:22] <claime>	 Apparently the mobile button for edits was unclickable
[12:48:23] <jynus>	 so those are not accounted in errors
[12:48:31] <jynus>	 ^that
[12:48:42] <claime>	 Which, wth
[12:48:57] <claime>	 I gotta step out and breathe for a minute if that's ok with y'all
[12:49:03] <jynus>	 that could be an actionable, although I am unsure of which
[12:49:19] <Amir1>	 claime: two reasons: 1- Still some GETs do write, so they can end up in codfw, they'd be just slow 2- Some dbs are writable in codfw (x2 for some ways of caching, PC) 3- the error is just saying "this wiki is read-only", e.g. it shows it to you when you attempt to edit, which is a get in codfw
[12:49:26] <Amir1>	 I need to learn how to count
[12:49:56] <claime>	 Amir1: Thanks, I understand now
[12:50:03] <claime>	 also off by one errors
[12:50:06] <Amir1>	 I think the last one is the biggest culprit
[12:50:13] <wikibugs>	 (03Abandoned) 10Jbond: Remove Hiera option to enable adduser config [puppet] - 10https://gerrit.wikimedia.org/r/644808 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond)
[12:51:18] <wikibugs>	 10SRE, 10Observability-Logging, 10Traffic: varnish-frontend-fetcherr sets incorrect level in logstash - https://phabricator.wikimedia.org/T330267 (10fgiunchedi) I took a quick look at this and found the following:  * the logger program seems to be `modules/varnish/files/varnishfetcherr.py` ran by `modules/va...
[12:51:52] <wikibugs>	 (03PS1) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272)
[12:52:18] <wikibugs>	 10Puppet, 10Infrastructure Security, 10Infrastructure-Foundations, 10User-jbond: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10jbond)
[12:53:23] <wikibugs>	 (03PS2) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272)
[12:54:21] <wikibugs>	 (03PS3) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272)
[12:56:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez)
[12:57:05] <volans>	 jynus: do you have an answer for my previous question a while ago? (regarding the cookbook failure for db1118)
[12:57:19] <jynus>	 sorry, it got lost with the incident
[12:57:20] <jynus>	 checking
[12:58:10] <volans>	 jynus: reposting: """as for the original failure regarding db1118 the cookbook tries to get the heartbeat for codfw, but it's empty. Could it be because the replication is not yet enabled codfw->eqiad?"""
[12:58:12] <jynus>	 volans: yes and no- circular replication AFAIAA, is not enabled
[12:58:23] <jynus>	 but heartbeat should be running anyway
[12:58:31] <jynus>	 but maybe it changed since orchestrator was enabled
[12:58:37] <jynus>	 that would be new
[12:58:46] <volans>	 I queried all s1 in eqiad and they have hearthbeat only for 'eqiad' not for 'codfw'
[12:58:46] <jynus>	 and if it, it would require more changes
[12:58:57] <jynus>	 yes, that is expected at the moment
[12:58:59] <volans>	 while hosts in codfw have it for both
[12:59:03] <volans>	 that's why it failed
[12:59:05] <jynus>	 ah
[12:59:12] <jynus>	 then it is just the circular replication
[12:59:16] <volans>	 SELECT ts FROM heartbeat.heartbeat WHERE datacenter = 'codfw' and shard = 's1' ORDER BY ts DESC LIMIT 1;
[12:59:20] <wikibugs>	 (03Abandoned) 10Ladsgroup: [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup)
[12:59:21] <volans>	 just returned empty
[12:59:29] <jynus>	 it will work when circular is enabled
[12:59:38] <jynus>	 but then it failed on all hosts, right?
[12:59:40] <jynus>	 not just s1
[13:00:10] <jynus>	 or would have failed is what I mean
[13:00:13] <volans>	 it stopped there because it failed
[13:00:22] <volans>	 yes I think it would have failed on all eqiad hosts
[13:00:27] <jynus>	 yeah, so the test has to be done with circual replication enabled
[13:00:32] <volans>	 yep
[13:00:38] <jynus>	 which is ok, it means the check works!
[13:00:46] <jynus>	 but that will be done when manuel returns
[13:00:48] <Amir1>	 I don't think orch replaced/changed anything related to heartbeat
[13:01:08] <jynus>	 yeah, I knew leftovers beats had been removed
[13:01:19] <jynus>	 but maybe also codfw master ones
[13:01:31] <jynus>	 but it wasn't, so it is just circular
[13:01:50] <jynus>	 volans: on my calendar I have 23- no more maintenance
[13:02:07] <jynus>	 27th enable codfw-> replication
[13:02:11] <Amir1>	 jouncebot: nowandnext
[13:02:11] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 57 minute(s)
[13:02:11] <jouncebot>	 In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1400)
[13:02:17] <Amir1>	 can I deploy stuff?
[13:02:38] <jynus>	 so maybe retry on monday?
[13:03:08] <volans>	 claime: ^^^
[13:03:14] <volans>	 ack thx
[13:03:31] <jynus>	 not a big deal because in an emergency, we don't care about checks, we just switch in whatever state we have
[13:04:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10IPv6, 10User-jbond: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jbond) a:05jbond→03None
[13:04:11] <jynus>	 that's a check to make sure eqiad is working after failover
[13:04:37] <wikibugs>	 (03PS3) 10Ladsgroup: Move userrights related configs from IS.php to core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890789 (https://phabricator.wikimedia.org/T308932)
[13:04:46] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Move userrights related configs from IS.php to core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890789 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[13:04:51] <wikibugs>	 10SRE, 10Observability-Logging, 10Security, 10User-jbond: ulog: filter out diffscan from ulog - https://phabricator.wikimedia.org/T265590 (10jbond) a:05jbond→03None
[13:05:15] * volans grabbing some lunch
[13:05:24] <wikibugs>	 (03Merged) 10jenkins-bot: Move userrights related configs from IS.php to core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890789 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[13:08:19] <wikibugs>	 (03CR) 10Nicolas Fraison: Add a spark-operator chart and helmfile configuration (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[13:12:19] <wikibugs>	 (03CR) 10Muehlenhoff: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez)
[13:16:35] <wikibugs>	 10SRE, 10User-fgiunchedi: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) a:05fgiunchedi→03None
[13:16:45] <wikibugs>	 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) a:05fgiunchedi→03None
[13:16:57] <wikibugs>	 10SRE, 10Observability-Alerting: Missing 'notify' for some Icinga configuration files - https://phabricator.wikimedia.org/T263027 (10fgiunchedi) a:05fgiunchedi→03None
[13:17:21] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Some issues inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (owner: 10Clément Goubert)
[13:18:10] <wikibugs>	 (03CR) 10Nicolas Fraison: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[13:21:34] <Amir1>	 https://www.irccloud.com/pastebin/OhErE4ol/
[13:21:44] <Amir1>	 should I worry?
[13:29:36] <Amir1>	 scap is realllllllly slow
[13:29:52] <Amir1>	 stuck in rebuilding the mw docker images it seems
[13:31:33] <Amir1>	 akosiaris: maybe you know? P44731
[13:31:39] <Amir1>	 https://phabricator.wikimedia.org/P44731
[13:34:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for nginx on durum [puppet] - 10https://gerrit.wikimedia.org/r/891271 (https://phabricator.wikimedia.org/T135991)
[13:35:54] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:38:11] <Amir1>	 almost forty minutes now...
[13:40:50] <wikibugs>	 (03PS1) 10Jbond: P:idp: use production variable [puppet] - 10https://gerrit.wikimedia.org/r/891287
[13:41:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:idp: use production variable [puppet] - 10https://gerrit.wikimedia.org/r/891287 (owner: 10Jbond)
[13:43:49] <wikibugs>	 (03PS2) 10Jbond: P:idp: use production variable [puppet] - 10https://gerrit.wikimedia.org/r/891287
[13:43:56] <kart_>	 Amir1: ouch. We've backport deployment in around 17 minutes :/
[13:44:49] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891271 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:45:05] <Amir1>	 it probably needs to be cancelled to be honest
[13:47:08] <wikibugs>	 (03CR) 10Jbond: P:idm move OIDC endpoint to variable. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891242 (owner: 10Slyngshede)
[13:49:31] <wikibugs>	 (03PS3) 10Jbond: P:idp: use production variable [puppet] - 10https://gerrit.wikimedia.org/r/891287
[13:49:54] <kart_>	 :/
[13:51:04] <claime>	 Amir1: Looks like capacity issues in codfw again
[13:51:38] <claime>	 I'll send a CR for even less replicas in codfw for mw-*
[13:52:05] <wikibugs>	 (03CR) 10Jbond: "pcc: https://puppet-compiler.wmflabs.org/output/891287/39786/" [puppet] - 10https://gerrit.wikimedia.org/r/891287 (owner: 10Jbond)
[13:57:59] <wikibugs>	 (03CR) 10Jbond: "lgtm pending CI and the comment from Moritz" [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez)
[14:00:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1400). nyaa~
[14:00:04] <jouncebot>	 kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:20] * Lucas_WMDE can’t deploy
[14:00:27] <TheresNoTime>	 I can deploy :)
[14:00:50] <Lucas_WMDE>	 (also I think Amir1’s scap might still be running)
[14:01:04] <TheresNoTime>	 ack
[14:01:34] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:59] <Amir1>	 it is running
[14:02:09] <Amir1>	 so marked the hour
[14:02:19] <Amir1>	 I have two more syncs 
[14:02:24] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1108 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:02:30] <TheresNoTime>	 Amir1: no problem, let me know when I can proceed
[14:02:32] <wikibugs>	 (03PS10) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783
[14:02:32] <Amir1>	 otherwise everything breaks
[14:02:52] <Amir1>	 TheresNoTime: currently that means two hours from now but I hope it gets fixed by then
[14:02:59] <TheresNoTime>	 ah!
[14:03:34] <TheresNoTime>	 kart_: it doesn't look like your patches are going to be deployed — can they be rescheduled?
[14:03:52] <kart_>	 TheresNoTime: Let me withdraw/reschedule it later or tomorrow.
[14:03:53] <claime>	 Amir1: Is it still stuck on helmfile ?
[14:03:58] <Amir1>	 yup
[14:04:09] <Amir1>	 last line: 13:55:24 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 01m 56s)
[14:04:17] <Amir1>	 ten minutes on it already
[14:04:31] <TheresNoTime>	 kart_: sure thing
[14:05:39] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release mw-api-int/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:06:19] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[14:06:26] <TheresNoTime>	 !log UTC afternoon backport window not done due to in-progress deployment
[14:06:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:42] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] mw-on-k8s,thumbor: reduce codfw replicas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891277 (https://phabricator.wikimedia.org/T330048) (owner: 10Clément Goubert)
[14:07:31] <Amir1>	 claime: I'm tiny bit confused https://phabricator.wikimedia.org/T330048 is resolved and it's green
[14:07:36] <Amir1>	 (in icinga)
[14:08:01] <claime>	 Amir1: akosiaris is deploying the new nodes
[14:08:24] <claime>	 Which were held up because of that task + some nodes couldn't be updated and pooled
[14:08:44] <akosiaris>	 !log test network connectivity of kubernetes20{17,18,19,21}
[14:08:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:59] <wikibugs>	 (03CR) 10David Caro: "Oh, I forgot to "git add" 😮" [puppet] - 10https://gerrit.wikimedia.org/r/890881 (owner: 10David Caro)
[14:09:06] <claime>	 I'll destroy/recreate the failed release for mw-api-int
[14:09:46] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[14:09:50] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[14:10:08] <akosiaris>	 wmfdebug ping containers been created 
[14:10:22] <akosiaris>	 and all of them completed successfully
[14:10:29] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: sync
[14:10:36] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: sync
[14:10:41] <akosiaris>	 I need to create a job that does this in a nice mesh way and reports it somewhere
[14:10:45] <akosiaris>	 I think it actually already exists
[14:10:51] <akosiaris>	 anyway, uncordoning hosts
[14:10:54] <wikibugs>	 (03PS2) 10David Caro: profile.cloudceph: Add some tests [puppet] - 10https://gerrit.wikimedia.org/r/890881
[14:11:19] <akosiaris>	 !log uncordon kubernetes20{17,18,19,21} T330048
[14:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:23] <stashbot>	 T330048: asw-a-codfw management interface unreachable - https://phabricator.wikimedia.org/T330048
[14:11:49] <akosiaris>	 done
[14:13:00] <Amir1>	 it's now syncing apaches
[14:13:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile.cloudceph: Add some tests [puppet] - 10https://gerrit.wikimedia.org/r/890881 (owner: 10David Caro)
[14:13:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/890843 (owner: 10Jbond)
[14:14:10] <wikibugs>	 (03CR) 10Volans: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez)
[14:14:13] <Amir1>	 I need to do the next sync and then we can see if it's fixed. I can wait until the new nodes are in place
[14:15:03] <claime>	 Amir1: the new nodes are uncordoned, you should be good
[14:15:20] <claime>	 We'll wait until all backports are done to scale back up
[14:15:30] <claime>	 That way you shouldn't hit capacity issues
[14:15:34] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/core-Permissions.php: Move all of userrights config out of IS.php to a dedicated file, part I (T308932) (duration: 68m 38s)
[14:15:38] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[14:15:39] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release mw-api-int/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:15:45] <Amir1>	 duration: 68m 38s
[14:16:48] <Amir1>	 the next one is starting
[14:17:01] <Amir1>	 on 14:16:42 Started Running helmfile -e codfw --selector name=pinkunicorn apply in /srv/deployment-charts/helmfile.d/services/mw-debug
[14:18:02] <claime>	 STATUS: deployed
[14:18:04] <claime>	 REVISION: 14
[14:18:08] <Amir1>	 running helmfile is fast
[14:18:20] <claime>	 Yeah, when it can schedule its pods properly, it is
[14:20:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[14:21:49] <Amir1>	 this alert doesn't look very informative ^
[14:21:53] <wikibugs>	 (03Abandoned) 10Clément Goubert: mw-on-k8s,thumbor: reduce codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/891277 (https://phabricator.wikimedia.org/T330048) (owner: 10Clément Goubert)
[14:22:54] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert "mw-on-k8s: reduce codfw replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/891288 (https://phabricator.wikimedia.org/T330048)
[14:23:08] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: Move all of userrights config out of IS.php to a dedicated file, part II (T308932) (duration: 07m 01s)
[14:23:13] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[14:23:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Don't merge yet, we need first confirmation" [deployment-charts] - 10https://gerrit.wikimedia.org/r/891288 (https://phabricator.wikimedia.org/T330048) (owner: 10Alexandros Kosiaris)
[14:23:16] <claime>	 That was faster :D
[14:23:22] <godog>	 Amir1: it does not, I suspect it is related to the grafana 9 upgrade
[14:24:25] <Amir1>	 akosiaris: I think it's fixed now
[14:24:31] <Amir1>	 7 minutes
[14:24:49] <akosiaris>	 \o/
[14:25:46] <wikibugs>	 (03PS3) 10David Caro: profile.cloudceph: Add some tests [puppet] - 10https://gerrit.wikimedia.org/r/890881
[14:26:17] <Amir1>	 godog: :D want me to file a ticket?
[14:26:23] <logmsgbot>	 !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@1a041e2] (releasing): (no justification provided)
[14:26:47] <godog>	 Amir1: thank you, I think updating T317887 should be enough, I see Peter reported the same
[14:26:48] <stashbot>	 T317887: Upgrade to Grafana 9 - https://phabricator.wikimedia.org/T317887
[14:27:13] <logmsgbot>	 !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@1a041e2] (releasing): (no justification provided) (duration: 00m 49s)
[14:29:38] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:03] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Move all of userrights config out of IS.php to a dedicated file, part III (T308932) (duration: 06m 16s)
[14:30:07] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[14:30:26] <wikibugs>	 (03PS2) 10Clément Goubert: sre.switchdc.mediawiki: Set both datacenters to rw [cookbooks] - 10https://gerrit.wikimedia.org/r/891266
[14:30:45] <Amir1>	 I'm done
[14:31:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/891271 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:32:15] <wikibugs>	 (03CR) 10Jbond: redfish: add update commands using the patch method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond)
[14:32:32] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:33:18] <wikibugs>	 (03PS1) 10Elukey: role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261)
[14:33:47] <wikibugs>	 (03CR) 10Jbond: systemd::timer: update services to onshot and set RemainAfterExit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890843 (owner: 10Jbond)
[14:33:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] systemd::timer: update services to onshot and set RemainAfterExit [puppet] - 10https://gerrit.wikimedia.org/r/890843 (owner: 10Jbond)
[14:34:21] <wikibugs>	 (03PS2) 10Elukey: role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261)
[14:34:49] <Amir1>	 TheresNoTime: kart_ Lucas_WMDE feel free to deploy
[14:37:41] <wikibugs>	 (03CR) 10Ottomata: Add a postgresql database and user for airflow_search_platform (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889572 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis)
[14:38:09] <sukhe>	 some DNS and BGP (in cr*-ulsfo) incoming; expected 
[14:38:18] <sukhe>	 *alerts
[14:38:41] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS bullseye
[14:38:51] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS bullseye
[14:39:29] <Lucas_WMDE>	 okay, I could deploy now if you want kart_ 
[14:41:57] <wikibugs>	 (03Abandoned) 10Ladsgroup: Rework DNS entries of wikis in wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/637849 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup)
[14:42:16] <icinga-wm>	 PROBLEM - Host 2620:0:863:1:198:35:26:8 is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:38] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-Needs-Improvement: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Ladsgroup) a:05Ladsgroup→03None
[14:43:00] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:43:18] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:43:24] <sukhe>	 ^ expected
[14:43:44] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:43:48] <wikibugs>	 (03CR) 10Elukey: ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou)
[14:43:48] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:44:14] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: eventstreams-internal: Switch to the new way of defining service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/891283
[14:44:41] <wikibugs>	 (03CR) 10Clément Goubert: sre.switchdc.mediawiki: Set both datacenters to rw (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (owner: 10Clément Goubert)
[14:44:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/891252 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou)
[14:46:26] <icinga-wm>	 PROBLEM - Recursive DNS on 198.35.26.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[14:46:29] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:46:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:47:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm left some optional nits" [puppet] - 10https://gerrit.wikimedia.org/r/890881 (owner: 10David Caro)
[14:47:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:48:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:49:07] <wikibugs>	 (03PS4) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272)
[14:49:18] <wikibugs>	 (03CR) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[14:49:56] <wikibugs>	 (03CR) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez)
[14:50:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams-internal: Switch to the new way of defining service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/891283 (owner: 10Alexandros Kosiaris)
[14:51:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez)
[14:51:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for nginx on durum [puppet] - 10https://gerrit.wikimedia.org/r/891271 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:53:06] <wikibugs>	 (03PS5) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272)
[14:53:37] <wikibugs>	 (03PS5) 10Jbond: P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750)
[14:55:20] <wikibugs>	 (03PS1) 10Elukey: admin_ng: upgrade the DSE cluster to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/891284 (https://phabricator.wikimedia.org/T330261)
[14:56:48] <icinga-wm>	 RECOVERY - Host 2620:0:863:1:198:35:26:8 is UP: PING OK - Packet loss = 0%, RTA = 71.81 ms
[14:56:53] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams-internal: Switch to the new way of defining service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/891283 (owner: 10Alexandros Kosiaris)
[14:57:27] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage
[14:57:53] <wikibugs>	 (03PS1) 10Urbanecm: growthexperiments: Run refreshPraiseworthyMentees daily [puppet] - 10https://gerrit.wikimedia.org/r/891285 (https://phabricator.wikimedia.org/T322444)
[14:58:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] growthexperiments: Run refreshPraiseworthyMentees daily [puppet] - 10https://gerrit.wikimedia.org/r/891285 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm)
[14:59:25] <wikibugs>	 (03PS2) 10Urbanecm: growthexperiments: Run refreshPraiseworthyMentees daily [puppet] - 10https://gerrit.wikimedia.org/r/891285 (https://phabricator.wikimedia.org/T322444)
[15:00:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[15:00:40] <icinga-wm>	 PROBLEM - Recursive DNS on 2620:0:863:1:198:35:26:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[15:00:43] <kart_>	 Lucas_WMDE: Sorry, rescheduled it tomorrow already. Stepped out too :/
[15:01:00] <kart_>	 Next window is too late for me.
[15:01:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:01:50] <wikibugs>	 (03PS1) 10Slyngshede: P:IDM Ensure that social auth can lookup username. [puppet] - 10https://gerrit.wikimedia.org/r/891307
[15:01:56] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Set GEPersonalizedPraiseBackendEnabled to true on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891308 (https://phabricator.wikimedia.org/T322444)
[15:02:07] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage
[15:02:34] <Lucas_WMDE>	 ok sure
[15:02:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Thanks for this (and for cleaning up my mess). FWIW, this appears cleaner, more maintenable and effectively achieves the same goal as far " [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup)
[15:02:55] <wikibugs>	 (03CR) 10Muehlenhoff: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez)
[15:03:50] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "These look correct, but I'm wondering a bit about the value of having separate prometheus servers in PAWS anymore. The primary reason for " [puppet] - 10https://gerrit.wikimedia.org/r/889998 (https://phabricator.wikimedia.org/T329212) (owner: 10Vivian Rook)
[15:04:04] <wikibugs>	 (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891285 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm)
[15:04:13] <urbanecm>	 jouncebot: nowandnext
[15:04:13] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 55 minute(s)
[15:04:13] <jouncebot>	 In 2 hour(s) and 55 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1800)
[15:04:25] <wikibugs>	 (03PS7) 10Urbanecm: [tox] Make running `tox` work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231)
[15:04:30] <wikibugs>	 (03CR) 10Urbanecm: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm)
[15:05:07] <wikibugs>	 (03CR) 10Majavah: [tox] Make running `tox` work (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm)
[15:06:13] <wikibugs>	 (03PS8) 10Urbanecm: [tox] Make running `tox` work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231)
[15:06:28] <urbanecm>	 taavi: was going to merge it :). thanks for catching.
[15:06:43] <wikibugs>	 (03CR) 10Urbanecm: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm)
[15:08:05] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: sync
[15:08:13] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: sync
[15:08:27] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: sync
[15:08:33] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: sync
[15:09:09] <wikibugs>	 (03CR) 10Vivian Rook: Update dns for paws prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889998 (https://phabricator.wikimedia.org/T329212) (owner: 10Vivian Rook)
[15:09:47] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: sync
[15:10:13] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: sync
[15:11:17] <wikibugs>	 (03PS6) 10Jbond: P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750)
[15:11:32] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Update dns for paws prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889998 (https://phabricator.wikimedia.org/T329212) (owner: 10Vivian Rook)
[15:11:45] <jinxer-wm>	 (JobUnavailable) resolved: (3) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:11:54] <icinga-wm>	 RECOVERY - Recursive DNS on 198.35.26.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[15:13:11] <wikibugs>	 10SRE, 10Data-Engineering, 10Event-Platform Value Stream: > ~1 request/second to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10CDanis)
[15:13:12] <icinga-wm>	 RECOVERY - Recursive DNS on 2620:0:863:1:198:35:26:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[15:13:18] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:13:35] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Clement_Goubert)
[15:13:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) (owner: 10Jbond)
[15:13:59] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Clement_Goubert) p:05Triage→03High
[15:14:06] <wikibugs>	 (03PS3) 10Clément Goubert: sre.switchdc.mediawiki: Set both datacenters to rw [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (https://phabricator.wikimedia.org/T330300)
[15:14:47] <wikibugs>	 (03PS7) 10Jbond: P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750)
[15:15:04] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10taavi)
[15:15:11] <wikibugs>	 (03CR) 10Jbond: P:sre::check_user: add support for namely API (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) (owner: 10Jbond)
[15:15:20] <wikibugs>	 (03PS6) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272)
[15:15:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Adjust monitoring for KDC processes if worker threads are in use [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831)
[15:15:29] <wikibugs>	 (03CR) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez)
[15:15:44] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:16:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) (owner: 10Jbond)
[15:17:58] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert)
[15:18:26] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) p:05Triage→03High
[15:19:30] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:19:52] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:20:11] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert)
[15:20:14] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:20:17] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Clement_Goubert)
[15:20:18] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:20:27] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) 05Stalled→03In progress
[15:20:32] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert)
[15:21:25] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert)
[15:21:33] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[15:22:18] <wikibugs>	 (03PS1) 10Jbond: sre:check_user: make namley_api_key optional [puppet] - 10https://gerrit.wikimedia.org/r/891311
[15:23:16] <wikibugs>	 (03PS4) 10David Caro: profile.cloudceph: Add some tests [puppet] - 10https://gerrit.wikimedia.org/r/890881
[15:23:19] <wikibugs>	 (03CR) 10David Caro: profile.cloudceph: Add some tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/890881 (owner: 10David Caro)
[15:24:39] <wikibugs>	 (03PS2) 10Jbond: sre:check_user: make namley_api_key optional [puppet] - 10https://gerrit.wikimedia.org/r/891311
[15:24:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one final remark/warning (feel free to ignore!)" [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez)
[15:25:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[15:25:12] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10Clement_Goubert)
[15:25:29] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10Clement_Goubert) p:05Triage→03Medium
[15:25:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39788/console" [puppet] - 10https://gerrit.wikimedia.org/r/891311 (owner: 10Jbond)
[15:26:08] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:26:34] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:27:57] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff)
[15:29:23] <wikibugs>	 (03PS3) 10Jbond: sre:check_user: make namley_api_key optional [puppet] - 10https://gerrit.wikimedia.org/r/891311
[15:29:25] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-Needs-Improvement: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10MdsShakil)
[15:30:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/890881 (owner: 10David Caro)
[15:30:31] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39789/console" [puppet] - 10https://gerrit.wikimedia.org/r/891311 (owner: 10Jbond)
[15:30:39] <moritzm>	 !log update mwdebug2002 to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2  T323358
[15:30:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:51] <wikibugs>	 (03PS7) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272)
[15:31:10] <wikibugs>	 (03CR) 10Vgutierrez: sre.cdn.roll-upgrade-haproxy: Add cookbook to upgrade HAProxy (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez)
[15:31:18] <wikibugs>	 (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/781055 (https://phabricator.wikimedia.org/T306223)
[15:31:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[15:31:51] <wikibugs>	 (03PS4) 10Jbond: sre:check_user: make namley_api_key optional [puppet] - 10https://gerrit.wikimedia.org/r/891311
[15:31:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez)
[15:33:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39790/console" [puppet] - 10https://gerrit.wikimedia.org/r/891311 (owner: 10Jbond)
[15:33:33] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "lgtm but missing dollar" [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff)
[15:33:44] <wikibugs>	 (03PS1) 10Ssingh: prometheus: update ensure_packages for node_gdnsd (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309)
[15:33:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[15:33:49] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[15:34:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: update ensure_packages for node_gdnsd (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[15:34:22] <dcausse>	 these wdqs/wcqs updater alerts are expected
[15:35:46] <wikibugs>	 (03PS1) 10Krinkle: Remove redundant wgOriginTrials and wgFeaturePolicyReportOnly settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891314
[15:35:46] <claime>	 jouncebot: nowandnext
[15:35:46] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 24 minute(s)
[15:35:46] <jouncebot>	 In 2 hour(s) and 24 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1800)
[15:35:49] <wikibugs>	 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF) 05Open→03Resolved I'll mark this as Resolved. No-one wants to confess to knowing what the remain...
[15:36:19] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "I think we're good now that deploys have been running as usual." [deployment-charts] - 10https://gerrit.wikimedia.org/r/891288 (https://phabricator.wikimedia.org/T330048) (owner: 10Alexandros Kosiaris)
[15:36:49] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39791/console" [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[15:37:09] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: Use S3 instead of Swift for bucket access [deployment-charts] - 10https://gerrit.wikimedia.org/r/889155 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking)
[15:37:22] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: Use S3 instead of Swift for bucket access [deployment-charts] - 10https://gerrit.wikimedia.org/r/889155 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking)
[15:38:42] <wikibugs>	 (03PS2) 10Ssingh: prometheus: update ensure_packages for node_gdnsd (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309)
[15:40:21] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39792/console" [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[15:40:55] <wikibugs>	 (03PS5) 10Jbond: sre:check_user: make namley_api_key optional [puppet] - 10https://gerrit.wikimedia.org/r/891311
[15:41:57] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39793/console" [puppet] - 10https://gerrit.wikimedia.org/r/891311 (owner: 10Jbond)
[15:42:01] <wikibugs>	 (03PS2) 10Muehlenhoff: Adjust monitoring for KDC processes if worker threads are in use [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831)
[15:42:54] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: Use S3 instead of Swift for bucket access [deployment-charts] - 10https://gerrit.wikimedia.org/r/889155 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking)
[15:43:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] sre:check_user: make namley_api_key optional [puppet] - 10https://gerrit.wikimedia.org/r/891311 (owner: 10Jbond)
[15:45:37] <wikibugs>	 (03PS1) 10Volans: apt: add new module with new AptGetHosts class [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315
[15:45:45] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff)
[15:45:48] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[15:45:51] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply
[15:47:04] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply
[15:47:17] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [tox] Make running `tox` work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm)
[15:47:55] <wikibugs>	 (03Merged) 10jenkins-bot: [tox] Make running `tox` work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm)
[15:50:21] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:887830|[tox] Make running `tox` work (T329231)]]
[15:50:25] <stashbot>	 T329231: CI should ensure that wmf-config/logos.php matches logos/config.yaml - https://phabricator.wikimedia.org/T329231
[15:50:34] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/891267 (https://phabricator.wikimedia.org/T330272) (owner: 10Vgutierrez)
[15:50:34] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[15:52:08] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:887830|[tox] Make running `tox` work (T329231)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[15:52:19] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[15:54:32] <wikibugs>	 (03PS1) 10Muehlenhoff: idm::jobs: Adapt auto restart to only run of idm-rq is active/present [puppet] - 10https://gerrit.wikimedia.org/r/891318 (https://phabricator.wikimedia.org/T320797)
[15:55:19] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10MdsShakil) >>! In T152882#8637729, @gerritbot wrote: > Change 891289 **abandoned** by MdsShakil: > %%%[operations/dns@master] add missing mobile domain for ombu...
[15:56:30] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: eventstreams-internal: Fix public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/891319
[15:56:34] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[15:57:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[15:57:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Exclude traindev from tests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 (owner: 10Clément Goubert)
[15:58:16] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:887830|[tox] Make running `tox` work (T329231)]] (duration: 07m 54s)
[15:58:19] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[15:58:20] <stashbot>	 T329231: CI should ensure that wmf-config/logos.php matches logos/config.yaml - https://phabricator.wikimedia.org/T329231
[15:59:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, idea LGTM though" [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[15:59:51] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Works as it is, couple of possible improvements inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/891266 (https://phabricator.wikimedia.org/T330300) (owner: 10Clément Goubert)
[16:00:09] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Dzahn) I wonder if the traffic team sees any blocker here to just add the missing .m. names to DNS.
[16:00:19] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[16:00:39] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/891318 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff)
[16:01:34] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[16:02:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[16:04:44] <wikibugs>	 (03CR) 10Muehlenhoff: prometheus: update ensure_packages for node_gdnsd (bullseye) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[16:06:57] <wikibugs>	 (03Abandoned) 10Ilias Sarantopoulos: ml-services: upgrade revertrisk staging to debian bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/890401 (https://phabricator.wikimedia.org/T328439) (owner: 10Ilias Sarantopoulos)
[16:08:21] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[16:08:29] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] prometheus: update ensure_packages for node_gdnsd (bullseye) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[16:09:20] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[16:09:32] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[16:10:22] <wikibugs>	 (03PS3) 10Ssingh: prometheus: update ensure_packages for node_gdnsd (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309)
[16:10:30] <icinga-wm>	 PROBLEM - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-webserver@search.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:11:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[16:11:28] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39795/console" [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[16:11:36] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "Forgot to add the new intermediate pkis, going to do it :)" [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[16:12:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/891307 (owner: 10Slyngshede)
[16:12:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams-internal: Fix public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/891319 (owner: 10Alexandros Kosiaris)
[16:14:18] <wikibugs>	 (03PS1) 10Ssingh: prometheus: refactor prometheus-gdnsd-stats.py to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/891320 (https://phabricator.wikimedia.org/T321309)
[16:14:22] <wikibugs>	 (03PS3) 10Elukey: role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261)
[16:14:24] <wikibugs>	 (03PS1) 10Elukey: profile::pki::root_ca: add new pkis for the DSE k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/891321 (https://phabricator.wikimedia.org/T330261)
[16:15:17] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39796/console" [puppet] - 10https://gerrit.wikimedia.org/r/891320 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[16:17:35] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[16:17:43] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams-internal: Fix public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/891319 (owner: 10Alexandros Kosiaris)
[16:21:43] <logmsgbot>	 !log hashar@deploy1002 Started deploy [integration/docroot@956dd11]: zuul: Link to report_url if available
[16:21:59] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [integration/docroot@956dd11]: zuul: Link to report_url if available (duration: 00m 15s)
[16:22:59] <logmsgbot>	 !log hashar@deploy1002 Started deploy [integration/docroot@b32e023]: doc: Add GrowthExperiments to MediaWiki components - T329034
[16:23:03] <stashbot>	 T329034: Publish frontend docs to doc.wikimedia.org/ - https://phabricator.wikimedia.org/T329034
[16:23:07] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [integration/docroot@b32e023]: doc: Add GrowthExperiments to MediaWiki components - T329034 (duration: 00m 07s)
[16:26:48] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Decide which cookbooks using icinga_hosts.wait_for_optimal() should use skip_acked=True - https://phabricator.wikimedia.org/T330136 (10MoritzMuehlenhoff) I ran into issues with cookbooks running the SREBatchRunnerBase class, the ganeti.reboot-vm and ganeti.reboot-sing...
[16:29:17] <wikibugs>	 (03PS1) 10Jbond: wmf-update-known-hosts-production: handle multiple algorithems [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891326
[16:34:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891310 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff)
[16:35:39] <dcausse>	 Lucas_WMDE: o/ was looking at https://grafana-rw.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&from=now-3h&to=now to see the impact we had on some maintainance on the wdqs updater but seems like metrics stopped being collected a couple hours ago
[16:36:15] <Lucas_WMDE>	 yeah, we noticed that too :/ but I’m in a meeting right now, haven’t looked into it yet
[16:36:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891321 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[16:36:45] <dcausse>	 sure no rush just wanted to let you know :)
[16:37:13] <wikibugs>	 (03CR) 10Volans: "couple of nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/891320 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[16:39:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/891315 (owner: 10Volans)
[16:48:00] <wikibugs>	 (03PS2) 10Ssingh: prometheus: refactor prometheus-gdnsd-stats.py to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/891320 (https://phabricator.wikimedia.org/T321309)
[16:48:56] <wikibugs>	 (03CR) 10Ssingh: prometheus: refactor prometheus-gdnsd-stats.py to Python 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/891320 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[16:51:00] <Lucas_WMDE>	 dcausse: I think someone™ with the right permissions needs to give wmde-analytics-minutely.service on stat1007 a kick
[16:51:24] <Lucas_WMDE>	 I have enough permissions to see that it’s “active (exited) since … 14:55”, which is exactly when the stats stopped coming in (I think)
[16:51:28] <Lucas_WMDE>	 but not enough permissions to restart it
[16:52:11] <Lucas_WMDE>	 I don’t remember who I asked for such requests in the past though :S
[16:52:14] <wikibugs>	 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Unexpected auditd service restart failure - https://phabricator.wikimedia.org/T287266 (10BCornwall) Ah, my bad, I thought this *was* affecting bullseye. Oops. Sounds good then.
[16:54:19] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Ship systemd user units, fix a bug [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891330
[16:56:18] <Lucas_WMDE>	 filed T330311 for the wikidata stats issue
[16:56:18] <stashbot>	 T330311: wmde-analytics-minutely.service is no longer running on stat1007 - https://phabricator.wikimedia.org/T330311
[16:56:33] <Lucas_WMDE>	 if someone with root on stat1007 could look into this I’d be much obliged
[16:56:39] <Lucas_WMDE>	 not sure what the right phab tags would be
[16:56:57] <wikibugs>	 (03PS1) 10Cwhite: logstash: remove SEVERITY_LABEL from syslog messages [puppet] - 10https://gerrit.wikimedia.org/r/890363 (https://phabricator.wikimedia.org/T330267)
[16:59:07] <dcausse>	 Lucas_WMDE: I'd ping someone in #wikimedia-analytics
[16:59:21] <Lucas_WMDE>	 thanks, will do
[17:00:26] <icinga-wm>	 PROBLEM - Ensure that passive node gets the certificates from the active node as expected on acmechief2001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/acme-chief/certs/.rsync.status is 7224 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief
[17:01:02] <icinga-wm>	 PROBLEM - Ensure cert-sync script runs successfully in the active node on acmechief1001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/acme-chief/certs/.rsync.done is 7258 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief
[17:05:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[17:05:59] <vgutierrez>	 jbond: cert-sync alert could be related to https://github.com/wikimedia/operations-puppet/commit/0eee4e1b5208815e367b9ff3bf14a810b53edcc1
[17:06:25] <vgutierrez>	 On my smartphone right now... no real git client at the moment
[17:09:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 10%: T329864', diff saved to https://phabricator.wikimedia.org/P44733 and previous config saved to /var/cache/conftool/dbconfig/20230222-170920-root.json
[17:09:26] <stashbot>	 T329864: db2106 crashed - https://phabricator.wikimedia.org/T329864
[17:09:35] <wikibugs>	 (03CR) 10Ssingh: "Does this have a task by any chance? Looks good otherwise." [dns] - 10https://gerrit.wikimedia.org/r/890908 (owner: 10Zabe)
[17:10:15] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "I think this might have caused T330311 – please take a look?" [puppet] - 10https://gerrit.wikimedia.org/r/890843 (owner: 10Jbond)
[17:10:45] <wikibugs>	 (03PS2) 10Zabe: Add mobile domain for ombuds.wm.o [dns] - 10https://gerrit.wikimedia.org/r/890908 (https://phabricator.wikimedia.org/T152882)
[17:10:58] <wikibugs>	 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10Ladsgroup)
[17:11:51] <wikibugs>	 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover, and 2 others: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Ladsgroup)
[17:12:15] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Add mobile domain for ombuds.wm.o [dns] - 10https://gerrit.wikimedia.org/r/890908 (https://phabricator.wikimedia.org/T152882) (owner: 10Zabe)
[17:12:33] <wikibugs>	 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Ladsgroup)
[17:13:01] <wikibugs>	 (03PS3) 10Ssingh: Add mobile domain for ombuds.wm.o [dns] - 10https://gerrit.wikimedia.org/r/890908 (https://phabricator.wikimedia.org/T152882) (owner: 10Zabe)
[17:15:07] <wikibugs>	 (03CR) 10Dzahn: "yea, this very much has a task, since 2016. for some reason they never got added. https://phabricator.wikimedia.org/T152882" [dns] - 10https://gerrit.wikimedia.org/r/890908 (https://phabricator.wikimedia.org/T152882) (owner: 10Zabe)
[17:15:19] <wikibugs>	 (03CR) 10Ottomata: "wgEventStreams is not an EventLogging config. It is for EventStreamConfig extension.  EventLogging and EventBus use wgEventStreams via the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889557 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[17:16:10] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Migrate EventLogging config into its own file (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889557 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[17:17:33] <sukhe>	 !log running authdns-update for T152882 / CR 890908
[17:17:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:38] <stashbot>	 T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882
[17:18:59] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] prometheus: refactor prometheus-gdnsd-stats.py to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/891320 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[17:19:09] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] prometheus: update ensure_packages for node_gdnsd (bullseye) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/891312 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[17:24:21] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns4004.wikimedia.org with OS bullseye
[17:24:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 25%: T329864', diff saved to https://phabricator.wikimedia.org/P44734 and previous config saved to /var/cache/conftool/dbconfig/20230222-172424-root.json
[17:24:29] <stashbot>	 T329864: db2106 crashed - https://phabricator.wikimedia.org/T329864
[17:24:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS bullseye executed with errors: - dns4004 (**FAIL**)   - Downtimed o...
[17:27:41] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] varnish: Set `X-Content-Type-Options: nosniff` on upload requests [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm)
[17:30:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[17:30:16] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[17:31:40] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): systemd::timer: unset RemainAfterExit again [puppet] - 10https://gerrit.wikimedia.org/r/891340 (https://phabricator.wikimedia.org/T330311)
[17:32:33] <Lucas_WMDE>	 ^ I could use some SRE-y person taking a look at this puppet change
[17:33:23] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Zabe)
[17:34:00] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "CCing people from Iee96d252fa." [puppet] - 10https://gerrit.wikimedia.org/r/891340 (https://phabricator.wikimedia.org/T330311) (owner: 10Lucas Werkmeister (WMDE))
[17:34:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "thanks for the patch ill merge this now as its causing issues and will look again at the cloudbackup issue tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/891340 (https://phabricator.wikimedia.org/T330311) (owner: 10Lucas Werkmeister (WMDE))
[17:34:52] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] profile::pki::root_ca: add new pkis for the DSE k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/891321 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[17:35:06] <Lucas_WMDE>	 jbond: thanks for the quick response!
[17:35:16] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[17:35:24] <Lucas_WMDE>	 I’m testing the After= behavior locally now
[17:35:36] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng: upgrade the DSE cluster to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/891284 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[17:36:00] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39798/console" [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm)
[17:36:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::pki::root_ca: add new pkis for the DSE k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/891321 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[17:37:23] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39799/console" [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm)
[17:39:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 75%: T329864', diff saved to https://phabricator.wikimedia.org/P44736 and previous config saved to /var/cache/conftool/dbconfig/20230222-173929-root.json
[17:39:35] <stashbot>	 T329864: db2106 crashed - https://phabricator.wikimedia.org/T329864
[17:40:19] <Lucas_WMDE>	 I’ll check again after the puppet change is rolled out, but I suspect the service might also need to be stopped manually
[17:40:30] <sukhe>	 Lucas_WMDE: yes, I think so, and happy to do it fwiw
[17:40:32] <Lucas_WMDE>	 (and if true, that could apply to who knows how many units 😬)
[17:40:37] <Lucas_WMDE>	 ok
[17:40:42] <Lucas_WMDE>	 if so thanks in advance
[17:40:58] <jbond>	 Lucas_WMDE: no problem im still looking at it seems onshot is causing an issue as well.  im going to roll that back and take another look at the whole issue tomorrow
[17:41:00] <Lucas_WMDE>	 I’m looking at `systemctl cat wmde-analytics-minutely.{service,timer}` on stat1007 and right now it still has the RemainAfterExit=yes, so puppet didn’t run yet
[17:41:05] <Lucas_WMDE>	 jbond: ok
[17:41:22] <sukhe>	 !log force puppet run on stat1007
[17:41:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:35] <sukhe>	 already in progress, should pick up the patch
[17:42:35] <Lucas_WMDE>	 yeah, now it’s gone from the unit file and the service is still active (exited)
[17:43:00] <sukhe>	 stopped
[17:43:02] <sukhe>	 timer should pick it up now
[17:43:18] <sukhe>	 cool, service is inactive again
[17:43:24] <Lucas_WMDE>	 cool
[17:43:40] <Lucas_WMDE>	 looking at list-timers, I think wmf_auto_restart_exim4 might be in the same situation
[17:43:48] <Lucas_WMDE>	 but I know literally nothing about that service
[17:43:52] <Lucas_WMDE>	 take that with a grain of salt ^^
[17:43:55] <sukhe>	 yeah me neither :) 
[17:45:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[17:45:09] <Lucas_WMDE>	 okay, we have multiple new data points in the wikidata analytics
[17:45:18] <Lucas_WMDE>	 so I think that part is working – thanks a lot jbond and sukhe!
[17:45:37] <jbond>	 Lucas_WMDE: i ran the job manually about ~30 mins ago
[17:45:58] <Lucas_WMDE>	 I’m looking at 4 new points in the last 4 minutes
[17:46:09] <jbond>	 yes it looks like its rinning 
[17:46:09] <jbond>	 Wed 2023-02-22 17:46:00 UTC  4s left             Wed 2023-02-22 17:45:01 UTC  53s ago            wmde-analytics-minutely.timer                   wmde-analytics-minutely.service
[17:46:26] <Lucas_WMDE>	 do you want to leave the task open to check on other timers, or should I close it?
[17:46:40] <ottomata>	 jbond: Lucas_WMDE we are encountering an alert that i think is related to what you are talking about
[17:46:43] <jbond>	 no please leave it open
[17:46:44] <ottomata>	 cc mforns 
[17:46:46] <Lucas_WMDE>	 (IIUC it would’ve affected any timer that ran in the past… two hours or so)
[17:46:50] <ottomata>	 https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[17:46:50] <Lucas_WMDE>	 ok ack
[17:46:59] <ottomata>	 havn 't totally grokked
[17:47:00] <ottomata>	 but
[17:47:00] <jbond>	 Lucas_WMDE: yes i agree
[17:47:13] <ottomata>	 should i just run puppet (since somethign has been reverted) and see if my timers work again?
[17:47:20] <jbond>	 ottomata: sysmemd timeres havn;t been running for the past few hourse is the tl;dr
[17:47:37] <jbond>	 yes give that a go and ping me if not
[17:47:41] <ottomata>	 phewf that is the symptom i see too, thought somethign was wrong with jobs 
[17:47:41] <ottomata>	 okay
[17:48:26] <wikibugs>	 (03PS1) 10Elukey: Add fake intermediate PKI key for DSE k8s [labs/private] - 10https://gerrit.wikimedia.org/r/891344 (https://phabricator.wikimedia.org/T330261)
[17:48:38] <icinga-wm>	 RECOVERY - Ensure cert-sync script runs successfully in the active node on acmechief1001 is OK: FILE_AGE OK: /var/lib/acme-chief/certs/.rsync.done is 16 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief
[17:48:48] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake intermediate PKI key for DSE k8s [labs/private] - 10https://gerrit.wikimedia.org/r/891344 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[17:49:58] <icinga-wm>	 RECOVERY - Ensure that passive node gets the certificates from the active node as expected on acmechief2001 is OK: FILE_AGE OK: /var/lib/acme-chief/certs/.rsync.status is 96 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief
[17:50:46] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[17:51:27] <hashar>	 dduvall: I have done a bit of log triage and the train looks quiet ;)
[17:51:46] <dduvall>	 hashar: excellent :)
[17:52:02] <hashar>	 I am calling it a day.  See you all tomorrow!
[17:52:13] <ottomata>	 jbond: puppet run on an-launcher1002, i see the RemainAfterExit being removed
[17:52:20] <ottomata>	 i would expect 
[17:52:21] <ottomata>	 sudo systemctl start gobblin-webrequest.service
[17:52:27] <ottomata>	 to run the service that the timer does
[17:52:28] <ottomata>	 but
[17:52:32] <ottomata>	 doing so, nothign happens
[17:52:36] <Lucas_WMDE>	 what does status look like?
[17:52:39] <wikibugs>	 (03PS4) 10Elukey: role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261)
[17:52:39] <jbond>	 ottomata: you need to restart it
[17:52:41] <wikibugs>	 (03PS1) 10Elukey: Add K8s DSE intermediate PKI configs and public certs [puppet] - 10https://gerrit.wikimedia.org/r/891346 (https://phabricator.wikimedia.org/T330261)
[17:52:42] <sukhe>	 ottomata: (not jbond) but stop the service first
[17:52:47] <sukhe>	 that worked for us on a few other ones
[17:52:50] <sukhe>	 and then let the timer pick it up again
[17:52:50] <ottomata>	 we need to restart all systemd timer schedule services?
[17:53:03] <jbond>	 ottomata: yes im writing a cumin now
[17:53:24] <ottomata>	 okay, yeah thank you there are many 10s (maybe 100s?)
[17:54:06] <jbond>	 ott the following onliner shuld recover things
[17:54:07] <jbond>	 systemctl list-timers | awk '/n\/a/ {print $NF}' | while read line ; do echo sudo systemctl restart  $line ; done 
[17:54:25] <Lucas_WMDE>	 dcausse: FYI, our edit rate stats are back but I’m not yet seeing any query service lag in the maxlag panel – not sure if that part is working properly
[17:54:31] <Lucas_WMDE>	 though maybe the ^ ongoing restarts will fix that
[17:54:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 100%: T329864', diff saved to https://phabricator.wikimedia.org/P44737 and previous config saved to /var/cache/conftool/dbconfig/20230222-175434-root.json
[17:54:39] <stashbot>	 T329864: db2106 crashed - https://phabricator.wikimedia.org/T329864
[17:55:10] <Lucas_WMDE>	 (I don’t know if there’s also a systemd timer somewhere in the query service lag pipeline or not)
[17:55:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: differentiate between both serviceops teams for insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/890014 (owner: 10Dzahn)
[17:55:47] <wikibugs>	 (03PS3) 10Dzahn: site: differentiate between both serviceops teams for insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/890014
[17:56:33] <ottomata>	 jbond: if i don't want each service to run via that command, can I just sttop instead of restart?
[17:56:40] <ottomata>	 and the next scheduled run will pick it up?
[17:57:20] <jbond>	 ottomata: could yu try stoping the service; restarting the timer and see what list-timers shows?
[17:57:56] <wikibugs>	 (03PS5) 10Elukey: role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261)
[17:58:18] <ottomata>	 jbond: 
[17:58:19] <ottomata>	 e.gh.
[17:58:20] <ottomata>	 e.g.
[17:58:21] <ottomata>	 Wed 2023-02-22 18:05:00 UTC  6min left     n/a                          n/a           gobblin-event_default_test.timer                              gobblin-event_default_test.service
[17:58:28] <ottomata>	 so i think it will scheudle it?
[17:58:43] <jbond>	 ottomata:  what servr is this ill take a quick look
[17:58:54] <ottomata>	 that one is an-test-coord1001.eqiad.wmnet
[17:59:04] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39800/console" [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[17:59:49] <dcausse>	 Lucas_WMDE: I think wdqs lag -> maxlag is collected via a mw maint script?
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1800)
[18:00:10] <jbond>	 ottomata: yes that looks good to me
[18:00:14] <Lucas_WMDE>	 it seems to be some prometheus-based thing in the Wikidata.org extension
[18:00:20] <jbond>	 ill use that fix for the rest of the fleet as well thanks
[18:01:48] <Lucas_WMDE>	 dcausse: you’re right, there is a maintenance script too, I thought it was querying prometheus live in the request
[18:01:53] <Lucas_WMDE>	 so this is probably the timer https://gerrit.wikimedia.org/g/operations/puppet/+/baa0836c8405f3ad110935655e9039b27dd12de7/modules/profile/manifests/mediawiki/maintenance/wikidata.pp#25
[18:01:59] <ottomata>	 ok thanks jbond, if you are running a ffull fleet fix
[18:02:00] <Lucas_WMDE>	 that will hopefully be fixed soon
[18:02:03] <ottomata>	 i'll wait for your cumin run
[18:03:13] <wikibugs>	 (03PS1) 10Dzahn: Revert "ci::firewall: allow http monitoring from prometheus hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891291
[18:03:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "ci::firewall: allow http monitoring from prometheus hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891291 (owner: 10Dzahn)
[18:05:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[18:05:20] <wikibugs>	 10SRE, 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10BCornwall) 05Open→03Stalled
[18:05:53] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "Better now! Ready for a review ;)" [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[18:06:17] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[18:06:32] <wikibugs>	 10SRE, 10Traffic-Icebox: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10BCornwall)
[18:06:35] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) 05Open→03In progress
[18:06:45] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder)
[18:08:17] <ottomata>	 jbond:  i see changed list-timers output, guessing you have run your command?
[18:08:48] <jbond>	 !log stop all failed timer servies and restart the corrosponding timer unit
[18:08:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:52] <jbond>	 ottomata: yes :)
[18:08:55] <jbond>	 just ran now
[18:08:57] <ottomata>	 ty!
[18:09:29] <jbond>	 no thanks every one
[18:09:49] <Lucas_WMDE>	 thanks a lot!
[18:10:00] <jbond>	 Lucas_WMDE: sukhe: hopefully th issues are all resolved now
[18:10:15] <sukhe>	 jbond: all good on the issues we noticed so far, yep
[18:10:15] <sukhe>	 thanks!
[18:10:28] <jbond>	 sukhe: specifically for on call its possible that there may be some other issues that buble up fomr this so just keep it in mind
[18:10:29] <Lucas_WMDE>	 I’m watching https://www.wikidata.org/w/api.php?action=query&format=json&maxlag=-1, hopefully the type will change from db to wikibase-queryservice soon :)
[18:10:34] <icinga-wm>	 PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[18:10:41] <jbond>	 and ill be around for the next few hours so please ping if needed
[18:10:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'depool db1112', diff saved to https://phabricator.wikimedia.org/P44738 and previous config saved to /var/cache/conftool/dbconfig/20230222-181046-ladsgroup.json
[18:10:52] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump-s4.service,cirrussearch-dump-s8.service,wikidatajson-dump.service,wikidatardf-all-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:11:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run - https://phabricator.wikimedia.org/T330318 (10ssingh)
[18:11:52] <icinga-wm>	 PROBLEM - Check systemd state on poolcounter1004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-exporter-apt.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:12:10] <sukhe>	 jbond: thanks, brett and mutante ^ please be aware 
[18:12:34] <brett>	 ack
[18:12:49] <sukhe>	 (here too, if that matters)
[18:13:18] <brett>	 it does matter :*
[18:13:24] <sukhe>	 haha
[18:15:48] <icinga-wm>	 PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[18:16:45] <mutante>	 ok, I dont know what exactly is happening but that looks like the puppetmaster is down?
[18:18:39] <sukhe>	 yeah the https check is failing?
[18:18:40] <mutante>	 jbond: seems like that would qualify for the "ping if needed"?
[18:19:28] <icinga-wm>	 RECOVERY - puppetmaster https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 415 bytes in 6.644 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[18:19:38] <sukhe>	 cool, probably the service restarts )
[18:20:08] <sukhe>	 mutante: https://puppetboard.wikimedia.org/report/puppetmaster2001.codfw.wmnet/e39e9d93d54be5a1f259f4d7a59d11c69775e0b8
[18:20:09] <mutante>	 ok, I don't know what restarts but this looks much better
[18:20:19] <sukhe>	 we might see a few other such alerts too
[18:20:23] <Lucas_WMDE>	 I also have a service that didn’t restart, but puppetmaster down sounds more important
[18:20:41] <sukhe>	 Lucas_WMDE: not down but a monitoring check that was failing
[18:21:11] <Lucas_WMDE>	 ok, I see the recovery now
[18:21:29] <Lucas_WMDE>	 in that case – mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002 is still active (exited)
[18:21:30] <taavi>	 what's up with the systemd timers? I have several cloud vps instances which don't seem to be running the puppet timer and am wondering if that's related
[18:21:33] <Lucas_WMDE>	 cc jbond
[18:21:39] <jbond>	 sukhe: mutante: i think that the puppet masters may be getting a bit or extra load do to restarting the timers
[18:21:45] <jbond>	 ill check on it in 30 mins
[18:21:54] <sukhe>	 taavi: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/d7a044a76158c80d333ae590d37cb3b6542c184d revert for that
[18:22:13] <Lucas_WMDE>	 taavi: you might need to manually stop the services activated by the timers (after the puppet change is rolled out)
[18:22:22] <Lucas_WMDE>	 if systemctl status says it’s “active (exited)”
[18:22:27] <taavi>	 hm. that fix is not being applied because puppet is not running due to that bug :/
[18:22:33] <Lucas_WMDE>	 hmm. ok that is worse
[18:22:52] <taavi>	 yeah, puppet-agent-timer.service says 'active (exited)'
[18:23:09] <taavi>	 so we would need to cumin 'systemctl stop puppet-agent-timer.service' or something similar? (or run puppet via cumin)
[18:23:17] <sukhe>	 er
[18:23:37] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[18:23:59] <icinga-wm>	 RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 2.629 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[18:24:29] <jbond>	 taavi: this is what i did on production https://phabricator.wikimedia.org/P44739.  i can look at cloud in a bit but still some fall out in production
[18:24:29] <sukhe>	 taavi: which host is that?
[18:24:37] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[18:24:51] <sukhe>	 for the prod hosts at least, I see no issues on a few random ones I tried
[18:24:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[18:24:58] <taavi>	 sukhe: any WMCS instance I log in to
[18:25:51] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01036 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[18:26:50] <sukhe>	 guess I was wrong on the failures
[18:27:01] <sukhe>	 but then this is expected because it seems like the hosts can't reach puppetmasters
[18:27:08] <jbond>	 sukhe: i think that is tiggered from when the puppetmasters where down
[18:27:10] <sukhe>	 yep
[18:27:14] <jbond>	 running puppet on failed nodes now
[18:27:18] <sukhe>	 request https://puppet:8140//puppet/v3/file_metadatas/modules/admin/home/... timed out after 60.263 seconds
[18:28:33] <Lucas_WMDE>	 taavi: I guess you’d also need to cumin something like `sed -i '/RemainAfterExit/d' /lib/systemd/system/puppet-agent-timer.service`
[18:28:44] <Lucas_WMDE>	 (but probably don’t just take my word for it)
[18:30:40] <jbond>	 Lucas_WMDE: mwmaint looks good to me
[18:30:46] <Lucas_WMDE>	 jbond: I’m now happy with mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002, thank you :)
[18:30:49] <Lucas_WMDE>	 jinx ^^
[18:30:57] <Lucas_WMDE>	 dcausse: mediawiki should have query service maxlag information again
[18:31:07] <jbond>	 i think ill run the two fix commands above again in 30 mins once puppet has run every where to make sure we get an slackers
[18:31:11] <jbond>	 :)
[18:31:17] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[18:33:14] <mutante>	 !log planet* - stopping and restarting all the timers for the various languages, commands from https://phabricator.wikimedia.org/P44739
[18:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:10] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Patch-For-Review, 10Software-Licensing: Add LICENSE to operations/dns scripts - https://phabricator.wikimedia.org/T291323 (10Ottomata) Also fine with my work being licensed at Apache 2.0.  Thank you!
[18:42:37] <icinga-wm>	 RECOVERY - Check systemd state on poolcounter1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:45:47] <wikibugs>	 10SRE, 10noc.wikimedia.org, 10serviceops: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Dzahn) a:05Dzahn→03None removing assignee based on automated mail from Andre pointing out it has been assigned...
[18:46:29] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:49:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P44740 and previous config saved to /var/cache/conftool/dbconfig/20230222-184908-root.json
[18:49:55] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) Sharing Shopify's latest update below. If anyone has any ideas, please send them my way since I still have no clue what to do!  I made a few tests and found the issue. The subd...
[18:50:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10Jclark-ctr) @MatthewVernon  Can you advise when and what Server you would like to test in
[18:50:21] <icinga-wm>	 PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-discovery-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:50:35] <icinga-wm>	 PROBLEM - Check systemd state on restbase1033 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-exporter-apt.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:50:39] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s2.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s3.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s5.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s6.service,mediawiki_job_growthexperiments-refreshLinkRecommendati
[18:50:39] <icinga-wm>	 ervice,mediawiki_job_growthexperiments-updateMenteeData-s1.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:51:32] <wikibugs>	 (03CR) 10Slyngshede: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/891318 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff)
[18:52:07] <icinga-wm>	 PROBLEM - Check systemd state on registry1004 is CRITICAL: CRITICAL - degraded: The following units failed: build-homepage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:52:13] <icinga-wm>	 PROBLEM - Check systemd state on cp6004 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsyslog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:52:19] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] Growth: Set GEPersonalizedPraiseBackendEnabled to true on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891308 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm)
[18:53:20] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] growthexperiments: Run refreshPraiseworthyMentees daily [puppet] - 10https://gerrit.wikimedia.org/r/891285 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm)
[18:54:12] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:IDM Ensure that social auth can lookup username. [puppet] - 10https://gerrit.wikimedia.org/r/891307 (owner: 10Slyngshede)
[18:58:12] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[19-27].eqiad.wmnet: Replace expiring keys/certs - eevans@cumin1001
[19:00:05] <jouncebot>	 hashar and dduvall: OwO what's this, a deployment window?? Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1900). nyaa~
[19:00:05] <jouncebot>	 hashar and dduvall: Dear deployers, time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T1900).
[19:03:21] <icinga-wm>	 RECOVERY - Check systemd state on restbase1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:04:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P44741 and previous config saved to /var/cache/conftool/dbconfig/20230222-190413-root.json
[19:09:33] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[19:11:23] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:11:49] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "I updated OMG and nothing has grant on 10.64 and 10.192 anymore. All are 10.%" [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup)
[19:11:54] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Update grants to use wikiuser@10.% only [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185)
[19:11:58] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2] mariadb: Update grants to use wikiuser@10.% only [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185) (owner: 10Ladsgroup)
[19:14:30] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "We'll also need to add an-airflow1005 to the list of refinery scap targets." [puppet] - 10https://gerrit.wikimedia.org/r/890906 (https://phabricator.wikimedia.org/T329870) (owner: 10Ebernhardson)
[19:19:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P44742 and previous config saved to /var/cache/conftool/dbconfig/20230222-191918-root.json
[19:32:52] <wikibugs>	 (03Abandoned) 10Ottomata: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu)
[19:34:07] <mforns>	 !log restarted the following an-launcher1002 timers, which seemed stuck (next run = n/a): gobblin-webrequest.timer, reportupdater-browser.timer, reportupdater-reference-previews.timer, refine_event.timer, refine_eventlogging_legacy.timer
[19:34:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P44743 and previous config saved to /var/cache/conftool/dbconfig/20230222-193422-root.json
[19:38:59] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01036 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[19:48:47] <icinga-wm>	 RECOVERY - Check systemd state on registry1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:06:38] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[20:07:36] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[20:08:31] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[20:08:36] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[20:14:26] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase10[19-27].eqiad.wmnet: Replace expiring keys/certs - eevans@cumin1001
[20:17:32] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[26-27].eqiad.wmnet: Replace expiring keys/certs - eevans@cumin1001
[20:22:39] <wikibugs>	 (03PS49) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[20:23:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[20:30:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Jclark-ctr) ms-be1072. A4 U27  cableid 20220021 port 42 ms-be1073. B4. U10 cableid 5018         port 12 ms-be1074. E3. U5 cableid 20220227 Port 5 ms-be1075. F3. U1 cableid 20...
[20:30:32] <wikibugs>	 (03PS50) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[20:30:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[20:33:21] <wikibugs>	 (03PS51) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[20:33:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[20:35:07] <wikibugs>	 (03PS52) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[20:35:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[20:36:27] <wikibugs>	 (03PS53) 10Ottomata: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[20:36:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[20:37:27] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[26-27].eqiad.wmnet: Replace expiring keys/certs - eevans@cumin1001
[20:47:49] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10Rmaung)
[20:51:57] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:54:31] <icinga-wm>	 PROBLEM - puppet last run on puppetdb2003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:56:56] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890365
[20:57:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890365 (owner: 10Zabe)
[20:57:35] <wikibugs>	 (03Abandoned) 10Zabe: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890365 (owner: 10Zabe)
[20:58:04] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) Hi, @SHust. We appear to be running in circles here! What we're after has nothing to do with DNS/domain names/CNAME/A records, etc. This is entirely about adjusting a secur...
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230222T2100).
[21:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[21:00:12] * urbanecm waves
[21:00:24] <urbanecm>	 jouncebot's right
[21:00:29] <urbanecm>	 but i'll deploy few things anyway
[21:00:49] <icinga-wm>	 PROBLEM - puppet last run on puppetdb1003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:01:33] <wikibugs>	 (03PS2) 10Urbanecm: Growth: Set GEPersonalizedPraiseBackendEnabled to true on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891308 (https://phabricator.wikimedia.org/T322444)
[21:01:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891308 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm)
[21:02:19] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Set GEPersonalizedPraiseBackendEnabled to true on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891308 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm)
[21:02:46] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:891308|Growth: Set GEPersonalizedPraiseBackendEnabled to true on pilot wikis (T322444)]]
[21:02:51] <stashbot>	 T322444: Personalized praise: backend data and logic for the new mentor dashboard module - https://phabricator.wikimedia.org/T322444
[21:03:06] <wikibugs>	 (03PS1) 10Urbanecm: Build backend for PersonalizedPraise [extensions/GrowthExperiments] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891293 (https://phabricator.wikimedia.org/T322444)
[21:03:15] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Build backend for PersonalizedPraise [extensions/GrowthExperiments] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891293 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm)
[21:03:33] <wikibugs>	 (03PS1) 10Zabe: admin: Update zabe's .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/891360
[21:03:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Build backend for PersonalizedPraise [extensions/GrowthExperiments] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891293 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm)
[21:10:20] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:891308|Growth: Set GEPersonalizedPraiseBackendEnabled to true on pilot wikis (T322444)]] (duration: 07m 33s)
[21:10:25] <stashbot>	 T322444: Personalized praise: backend data and logic for the new mentor dashboard module - https://phabricator.wikimedia.org/T322444
[21:15:12] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Dzahn) >>! In T128559#8638381, @SHust wrote: > Sharing Shopify's latest update below. If anyone has any ideas, please send them my way since I still have no clue what to do!   Hi, tha...
[21:15:35] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:17:27] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:17:27] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Dzahn) P.S. Yea, just listen to what @BCornwall said above. That is going to make it less confusing. And thanks for doing this!
[21:23:34] <wikibugs>	 (03Merged) 10jenkins-bot: Build backend for PersonalizedPraise [extensions/GrowthExperiments] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891293 (https://phabricator.wikimedia.org/T322444) (owner: 10Urbanecm)
[21:24:16] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:891293|Build backend for PersonalizedPraise (T322444)]]
[21:24:21] <stashbot>	 T322444: Personalized praise: backend data and logic for the new mentor dashboard module - https://phabricator.wikimedia.org/T322444
[21:31:39] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:891293|Build backend for PersonalizedPraise (T322444)]] (duration: 07m 22s)
[21:31:44] <stashbot>	 T322444: Personalized praise: backend data and logic for the new mentor dashboard module - https://phabricator.wikimedia.org/T322444
[21:36:10] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[21:39:24] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[21:40:16] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[21:40:22] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[21:45:34] <wikibugs>	 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330343 (10phaultfinder)
[22:11:40] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] admin: Update zabe's .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/891360 (owner: 10Zabe)
[22:25:23] <wikibugs>	 10SRE, 10Traffic: create a puppetized abstraction for haproxy blocklist hysteresis - https://phabricator.wikimedia.org/T329331 (10BCornwall) p:05Triage→03Low
[22:27:28] <wikibugs>	 10SRE, 10Analytics-Radar, 10Data-Engineering-Icebox, 10Traffic: Requests to (hard) redirect pages return their target's contents but are counted as pageviews to the redirect page - https://phabricator.wikimedia.org/T125015 (10BCornwall) p:05Medium→03Triage
[22:29:26] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Traffic: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10BCornwall) p:05Triage→03Low
[22:38:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Jclark-ctr)
[22:39:13] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891386
[22:40:14] <wikibugs>	 (03PS2) 10Zabe: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891386
[22:40:37] <wikibugs>	 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330343 (10phaultfinder)
[22:40:38] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891386 (owner: 10Zabe)
[22:41:21] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891386 (owner: 10Zabe)
[22:46:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:48:04] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891387 (https://phabricator.wikimedia.org/T230382)
[22:48:06] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891387 (https://phabricator.wikimedia.org/T230382) (owner: 10Zabe)
[22:48:49] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891387 (https://phabricator.wikimedia.org/T230382) (owner: 10Zabe)
[22:56:08] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Traffic: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10BCornwall) Looks like `interpret_wildcard()` in [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/media...
[22:56:31] <logmsgbot>	 !log zabe@deploy1002 Synchronized wmf-config/interwiki.php: T230382 (duration: 07m 06s)
[22:56:36] <stashbot>	 T230382: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382
[22:57:24] <wikibugs>	 (03PS1) 10Dzahn: switch planet from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891369 (https://phabricator.wikimedia.org/T330091)
[22:57:30] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite)
[23:00:14] <wikibugs>	 (03CR) 10Dzahn: "openssl x509 -noout -text -in planet.discovery.wmnet.crt | grep DNS" [dns] - 10https://gerrit.wikimedia.org/r/891369 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn)
[23:07:03] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[23:07:54] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[23:08:48] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[23:10:33] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[23:10:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "double checked and there is no (more) concept of an "active server" for planet. the timers and updates simply run in both DCs all this tim" [dns] - 10https://gerrit.wikimedia.org/r/891369 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn)
[23:27:13] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Traffic: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10BCornwall) Looking into it further, it seems this is a very possible change! nginx mappings/site names support wildcards.  Pulling back a bit, does anyth...
[23:41:14] <wikibugs>	 (03PS2) 10Dzahn: switch planet from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891369 (https://phabricator.wikimedia.org/T330091)
[23:42:03] <wikibugs>	 (03PS3) 10Dzahn: switch planet from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891369 (https://phabricator.wikimedia.org/T330091)
[23:55:58] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for azwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891378 (https://phabricator.wikimedia.org/T306015)
[23:59:07] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Rename deployment-cache-(text|upload)0x to deployment-cp0x - https://phabricator.wikimedia.org/T280393 (10BCornwall) cp hosts have now been updated to bullseye, FYI