[00:00:28] (03PS1) 10Superpes15: [idwikiquote] Change the sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937551 (https://phabricator.wikimedia.org/T341177) [00:19:04] (03PS1) 10Superpes15: [idwikiquote] Change the logo and add a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937553 (https://phabricator.wikimedia.org/T341177) [00:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936825 [00:38:43] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936825 (owner: 10TrainBranchBot) [00:50:16] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[17,22,23,24,29,32].eqiad.wmnet: Applying JVM update - eevans@cumin1001 [00:52:54] !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase10[17,22,23,24,29,32].eqiad.wmnet: Applying JVM update - eevans@cumin1001 [00:53:22] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936825 (owner: 10TrainBranchBot) [00:53:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936825 (owner: 10TrainBranchBot) [00:53:36] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[17,22,23,24,29,32].eqiad.wmnet: Applying JVM update - eevans@cumin1001 [01:06:51] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:11:19] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:31:26] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[17,22,23,24,29,32].eqiad.wmnet: Applying JVM update - eevans@cumin1001 [01:50:19] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[18,25,26,27,30,33].eqiad.wmnet: Applying JVM update - eevans@cumin1001 [01:59:04] !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase10[18,25,26,27,30,33].eqiad.wmnet: Applying JVM update - eevans@cumin1001 [02:08:21] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:21] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:35] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:33:09] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[25,26,27,30,33].eqiad.wmnet: Applying JVM update - eevans@cumin1001 [02:36:55] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:37:26] (03CR) 10Subramanya Sastry: [C: 03+1] Set default for UseLegacyMediaStyles and disable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937544 (https://phabricator.wikimedia.org/T318433) (owner: 10Arlolra) [02:41:21] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:05:05] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[25,26,27,30,33].eqiad.wmnet: Applying JVM update - eevans@cumin1001 [03:36:43] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:41:11] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:48:31] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [04:36:55] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:41:23] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:07:31] (03CR) 10Anzx: "also requested project namespace to be Wikikutip" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937551 (https://phabricator.wikimedia.org/T341177) (owner: 10Superpes15) [05:34:48] (03PS1) 10Ryan Kemper: wdqs: disable alerts for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/937572 [05:36:41] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:37:49] (03PS2) 10Ryan Kemper: wdqs: disable alerts for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/937572 (https://phabricator.wikimedia.org/T332314) [05:40:40] (03PS1) 10Giuseppe Lavagetto: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) [05:41:01] (03CR) 10CI reject: [V: 04-1] confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [05:41:07] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:44:04] (03PS2) 10Giuseppe Lavagetto: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) [05:44:28] (03CR) 10CI reject: [V: 04-1] confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T0600) [06:00:05] kormat, marostegui, and Amir1: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T0600). nyaa~ [06:06:39] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:11:09] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:16:15] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Upgrade new codfw switches to Juniper recommended - https://phabricator.wikimedia.org/T341670 (10ayounsi) [06:16:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) [06:25:16] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:28:41] <_joe_> nsot sure why it's firing [06:28:54] <_joe_> there is an increase in how busy it is, but nothing that could justify an alert [06:28:57] <_joe_> !incidents [06:28:58] 3860 (UNACKED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service eqiad) [06:29:03] <_joe_> !ack 3860 [06:29:03] 3860 (ACKED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service eqiad) [06:30:16] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:33:21] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:39:32] (03PS1) 10Stevemunene: Change analytics_test airflow to use an-test-client1002 [puppet] - 10https://gerrit.wikimedia.org/r/937577 (https://phabricator.wikimedia.org/T341700) [06:41:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:46:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:48:54] (03PS1) 10Santhosh: Update cxserver to 2023-07-13-063245-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937578 (https://phabricator.wikimedia.org/T340953) [06:51:25] (03PS1) 10Superpes15: [idwikiquote] Change the name of the project ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937579 (https://phabricator.wikimedia.org/T341177) [06:52:06] (03CR) 10CI reject: [V: 04-1] [idwikiquote] Change the name of the project ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937579 (https://phabricator.wikimedia.org/T341177) (owner: 10Superpes15) [06:52:29] (03PS2) 10Superpes15: [idwikiquote] Change the name of the project ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937579 (https://phabricator.wikimedia.org/T341177) [06:53:31] (Not accepting/receiving prefixes from anycast BGP peer) resolved: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [07:00:04] Amir1, apergos, and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T0700). [07:00:05] Superpes: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:08] morning! it looks like we have one deployer with 5 config patches to go today. no trainees are signed up for the window, though two are in the queue to get signed up. [07:00:17] Hi! :) [07:00:28] Superpes: if I recall from last time you do not (yet) self-deploy, is that right? [07:01:06] Yep lol I need to find the time to schedule a training :'( [07:01:56] ok! [07:02:39] is the order of the patches on the calendar the order you'd like them deployed?\ [07:03:04] apergos If possible yes! [07:03:17] ok, we'll start with the first one then [07:03:26] (03CR) 10ArielGlenn: [C: 03+2] [knwiki] Reverting the temporary logo and updating logo/wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937183 (https://phabricator.wikimedia.org/T338136) (owner: 10Superpes15) [07:04:04] (03Merged) 10jenkins-bot: [knwiki] Reverting the temporary logo and updating logo/wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937183 (https://phabricator.wikimedia.org/T338136) (owner: 10Superpes15) [07:05:25] !log ariel@deploy1002 Started scap: Backport for [[gerrit:937183|[knwiki] Reverting the temporary logo and updating logo/wordmark/tagline (T338136)]] [07:05:28] T338136: Requesting temporary logo change for kn.wikipedia.org - https://phabricator.wikimedia.org/T338136 [07:07:00] !log ariel@deploy1002 superpes and ariel: Backport for [[gerrit:937183|[knwiki] Reverting the temporary logo and updating logo/wordmark/tagline (T338136)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [07:07:09] Testing [07:07:27] excellent [07:08:10] It's fine! Thanks apergos :) [07:08:34] proceeding with scap to production [07:12:11] 937551 and 937579 should be merged together but, if you prefer, I can also make a single patch for both changes (maybe it's better)! [07:12:33] yes please [07:13:17] (03PS2) 10Superpes15: [idwikiquote] Change the sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937551 (https://phabricator.wikimedia.org/T341177) [07:13:39] (03PS3) 10Superpes15: [idwikiquote] Change the sitename and the project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937551 (https://phabricator.wikimedia.org/T341177) [07:14:00] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:937183|[knwiki] Reverting the temporary logo and updating logo/wordmark/tagline (T338136)]] (duration: 08m 35s) [07:14:04] T338136: Requesting temporary logo change for kn.wikipedia.org - https://phabricator.wikimedia.org/T338136 [07:14:11] please test your patch in production now [07:14:29] (03CR) 10Superpes15: [idwikiquote] Change the sitename and the project namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937551 (https://phabricator.wikimedia.org/T341177) (owner: 10Superpes15) [07:14:41] Superpes: ^ [07:14:59] (03Abandoned) 10Superpes15: [idwikiquote] Change the name of the project ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937579 (https://phabricator.wikimedia.org/T341177) (owner: 10Superpes15) [07:15:45] apergos It works also in production :) [07:15:51] great! [07:16:20] moving on to 937540, ok? ([mywiki] Create 'autopatrolled' and 'patroller' usergroups) [07:16:51] Wonderful! Thanks :) [07:16:52] Superpes: [07:16:54] ok [07:16:55] (03CR) 10ArielGlenn: [C: 03+2] [mywiki] Create 'autopatrolled' and 'patroller' usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937540 (https://phabricator.wikimedia.org/T341026) (owner: 10Superpes15) [07:17:33] (03Merged) 10jenkins-bot: [mywiki] Create 'autopatrolled' and 'patroller' usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937540 (https://phabricator.wikimedia.org/T341026) (owner: 10Superpes15) [07:18:25] !log ariel@deploy1002 Started scap: Backport for [[gerrit:937540|[mywiki] Create 'autopatrolled' and 'patroller' usergroups (T341026)]] [07:18:28] T341026: create autopatroller and patroller group on mywiki - https://phabricator.wikimedia.org/T341026 [07:19:46] please update the dpeloyment calendar with the new combined patch, Superpes [07:19:58] !log ariel@deploy1002 ariel and superpes: Backport for [[gerrit:937540|[mywiki] Create 'autopatrolled' and 'patroller' usergroups (T341026)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:20:00] ah you have already done it! excellent [07:20:07] Yep lol :D [07:20:21] please test on mwdebug1002 the current patch [07:20:57] Tested! Everything is fine apergos :) [07:21:05] great! continuing. [07:21:46] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet CI, and 2 others: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10hashar) //[[ https://fr.wiktionary.org/wiki/p%E2%80%99t%C3%AAt_ben_qu%E2%80%99oui,_p%E2%80%99t%C3%AAt_ben_qu%E2%80%99non | m'ybe yes, m'... [07:27:03] (03PS3) 10Giuseppe Lavagetto: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) [07:27:04] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:937540|[mywiki] Create 'autopatrolled' and 'patroller' usergroups (T341026)]] (duration: 08m 39s) [07:27:07] T341026: create autopatroller and patroller group on mywiki - https://phabricator.wikimedia.org/T341026 [07:27:12] plese test in production Superpes [07:27:30] (03CR) 10CI reject: [V: 04-1] confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [07:27:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:27:39] Good also in production :) Thanks! [07:28:05] great, continuing on [07:28:48] 937551 [idwikiquote] Change the sitename and the project namespace yes? [07:29:01] Superpes: [07:29:33] Yes! [07:29:41] (03CR) 10ArielGlenn: [C: 03+2] [idwikiquote] Change the sitename and the project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937551 (https://phabricator.wikimedia.org/T341177) (owner: 10Superpes15) [07:30:22] (03Merged) 10jenkins-bot: [idwikiquote] Change the sitename and the project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937551 (https://phabricator.wikimedia.org/T341177) (owner: 10Superpes15) [07:30:29] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42445/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [07:31:03] !log ariel@deploy1002 Started scap: Backport for [[gerrit:937551|[idwikiquote] Change the sitename and the project namespace (T341177)]] [07:31:08] T341177: Change the Indonesian Wikiquote's name and project namespace from Wikiquote to Wikikutip - https://phabricator.wikimedia.org/T341177 [07:31:49] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ayounsi) The last host connected to asw-b1-codfw (the prod switch) is cloudweb2002-dev (https://netbox.wikimedia.org/dci... [07:32:31] there is a merge conflict for the last patch, Superpes; please resolve it while we wait [07:32:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:32:43] !log ariel@deploy1002 ariel and superpes: Backport for [[gerrit:937551|[idwikiquote] Change the sitename and the project namespace (T341177)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:33:03] or well, please test the current patch on mwdebug1002 and then do the merge conflict :-D [07:33:25] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42446/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [07:34:43] Yep seems is working (but maybe some script should be run after the deploy?) apergos [07:34:54] Maybe not lol [07:35:19] (03PS2) 10Superpes15: [idwikiquote] Change the logo and add a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937553 (https://phabricator.wikimedia.org/T341177) [07:35:38] there is probably a script to be run on mwmaint but I do't remember which one. go ahead and fix up the merge conflict on the last patch and lemme see which thing needs to be run [07:35:54] (I don't think I should be the person running that though) [07:36:29] maybe namespaceDups.php [07:36:36] Yep I was thinking about it [07:36:41] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:39:26] Well, I don't know if this is the case, but you could run 'mwscript maintenance/namespaceDupes.php --wiki idwikiquote --fix' anyway [07:40:11] I should not be the person running mwmaint scripts for someone's patches though [07:40:34] as the deployer for the window, my job is to get your patch safely out into production; running followon scripts must be handled by others [07:40:47] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:937551|[idwikiquote] Change the sitename and the project namespace (T341177)]] (duration: 09m 43s) [07:40:50] T341177: Change the Indonesian Wikiquote's name and project namespace from Wikiquote to Wikikutip - https://phabricator.wikimedia.org/T341177 [07:41:02] please test your patch in production now! [07:41:13] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:42:09] Looks fine!! [07:42:20] great! [07:42:41] apergos: namespaceDupes will need running at some point once deployed yes [07:42:51] I think normally it's just been done in the window [07:43:06] (03CR) 10Vgutierrez: [C: 03+1] "looking good, please fix the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [07:43:38] my undestanding is that this is not the deployer's job. not to make things more difficult for anyone, just to have clear scope [07:43:43] anyways, continuing on to the last patch [07:43:49] (03CR) 10ArielGlenn: [C: 03+2] [idwikiquote] Change the logo and add a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937553 (https://phabricator.wikimedia.org/T341177) (owner: 10Superpes15) [07:44:40] (03Merged) 10jenkins-bot: [idwikiquote] Change the logo and add a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937553 (https://phabricator.wikimedia.org/T341177) (owner: 10Superpes15) [07:45:25] !log ariel@deploy1002 Started scap: Backport for [[gerrit:937553|[idwikiquote] Change the logo and add a wordmark (T341177)]] [07:45:29] (03PS4) 10Giuseppe Lavagetto: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) [07:47:06] !log ariel@deploy1002 ariel and superpes: Backport for [[gerrit:937553|[idwikiquote] Change the logo and add a wordmark (T341177)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [07:47:10] T341177: Change the Indonesian Wikiquote's name and project namespace from Wikiquote to Wikikutip - https://phabricator.wikimedia.org/T341177 [07:47:18] Superpes: please test on mwdebug1002 [07:48:07] apergos It works! Thanks :) [07:48:14] continuing! [07:51:29] https://phabricator.wikimedia.org/T334277 sample task for getting the namespace dups script run, if you do not yourself have access to the mwmaint hosts and are not working with someone who does... [07:51:55] (php fpm restart in progress, testing will be soon!) [07:53:46] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:937553|[idwikiquote] Change the logo and add a wordmark (T341177)]] (duration: 08m 20s) [07:53:46] (03CR) 10Jelto: [C: 03+2] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/937514 (https://phabricator.wikimedia.org/T340182) (owner: 10Dzahn) [07:53:49] T341177: Change the Indonesian Wikiquote's name and project namespace from Wikiquote to Wikikutip - https://phabricator.wikimedia.org/T341177 [07:53:57] please test in production! [07:54:20] Good also in production!! [07:54:39] ok! please follow up with others that work with you, or on hab for the maintenance script run and possible purges [07:54:54] (03CR) 10JMeybohm: [C: 04-1] kask: make TLS configuration a secret (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/849117 (owner: 10Hnowlan) [07:54:56] Many thanks for your help! Yep I'll do! [07:55:24] that concludes our deployment backport window today, see everyone next time! [07:55:46] !log UTC morning backport and config deployment window done [07:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:18] (03PS14) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [07:59:37] I think I broke the CI Jenkins [08:00:05] dduvall and hashar: gettimeofday() says it's time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T0800) [08:00:28] er just in time for the train? :-D [08:13:13] (03PS4) 10JMeybohm: cfssl::cert: Add support for notifying multiple services [puppet] - 10https://gerrit.wikimedia.org/r/937441 (https://phabricator.wikimedia.org/T329826) [08:13:15] (03PS6) 10JMeybohm: kubernetes::master: admin.conf on control-plane should use the local api [puppet] - 10https://gerrit.wikimedia.org/r/937434 (https://phabricator.wikimedia.org/T329826) [08:13:17] (03PS8) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) [08:14:10] !log Restarting CI Jenkins for plugin installation [08:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:45] the jenkins restarts [08:16:27] \o/ [08:19:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931263 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [08:26:08] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet CI, and 2 others: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) @hashar thanks for the write up and the CR's, this all looks quite promising. ill work on getting pcc-worker4 up and running. [08:26:10] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42448/console" [puppet] - 10https://gerrit.wikimedia.org/r/937441 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:27:01] (03PS1) 10JMeybohm: kubernetes: Add etcd srv names to clusterconfig structure [puppet] - 10https://gerrit.wikimedia.org/r/937793 (https://phabricator.wikimedia.org/T329826) [08:31:54] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42449/console" [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:36:13] (03PS2) 10JMeybohm: kubernetes: Add etcd srv names to clusterconfig structure [puppet] - 10https://gerrit.wikimedia.org/r/937793 (https://phabricator.wikimedia.org/T329826) [08:37:18] (03CR) 10JMeybohm: [V: 03+1] cfssl::cert: Add support for notifying multiple services (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937441 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:38:10] (03PS3) 10Slyngshede: Allow users to update their email address. [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) [08:38:15] (03CR) 10Slyngshede: Allow users to update their email address. (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) (owner: 10Slyngshede) [08:40:34] (03PS2) 10Slyngshede: Credit logo artist. [software/bitu] - 10https://gerrit.wikimedia.org/r/934265 (https://phabricator.wikimedia.org/T338828) [08:40:49] (03CR) 10Slyngshede: Credit logo artist. (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/934265 (https://phabricator.wikimedia.org/T338828) (owner: 10Slyngshede) [08:41:20] (03CR) 10Slyngshede: [V: 03+2] Forgot username [software/bitu] - 10https://gerrit.wikimedia.org/r/935462 (owner: 10Slyngshede) [08:41:23] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Forgot username [software/bitu] - 10https://gerrit.wikimedia.org/r/935462 (owner: 10Slyngshede) [08:47:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/937588 [08:47:43] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/937588 (owner: 10TrainBranchBot) [08:54:14] (03CR) 10Fabfur: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:56:14] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/937441 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:57:47] (03CR) 10JMeybohm: [C: 04-1] "The code assumes that there is always an instance of the confd class. If that's expected we should probably note that in the comments of i" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [09:01:00] (03CR) 10Slyngshede: [C: 03+1] "LGTM, arguments for puppet matches Puppet 7." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (https://phabricator.wikimedia.org/T236373) (owner: 10Jbond) [09:03:25] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: admin.conf on control-plane should use the local api [puppet] - 10https://gerrit.wikimedia.org/r/937434 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:03:30] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] cfssl::cert: Add support for notifying multiple services [puppet] - 10https://gerrit.wikimedia.org/r/937441 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:04:05] !log update NAT on pfw3-eqiad - T340252 [09:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:15] (03PS9) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) [09:05:18] (03PS3) 10JMeybohm: kubernetes: Add etcd srv names to clusterconfig structure [puppet] - 10https://gerrit.wikimedia.org/r/937793 (https://phabricator.wikimedia.org/T329826) [09:06:31] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:07:07] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42450/console" [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:09:12] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [09:09:47] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [09:09:57] (03CR) 10Vgutierrez: [C: 03+1] "looking good assuming that the intention is letting caches and UAs to store responses in the cache but not use them without revalidating t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/936765 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [09:10:52] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [09:11:01] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:11:37] !log increased kafka partitions for mediawiki.job.cirrusSearchLinksUpdate and mediawiki.job.cirrusSearchLinksUpdate (eqiad/codfw) - T341558 [09:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:40] T341558: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 [09:11:43] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [09:14:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/937588 (owner: 10TrainBranchBot) [09:16:53] 10SRE: Cannot download large files from commons - https://phabricator.wikimedia.org/T341755 (10Peachey88) [09:20:21] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260) (owner: 10Jdlrobson) [09:34:22] 10Puppet, 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10fgiunchedi) IMHO between the migration to pki and the fact that we monitor cert expiration directly from probes... [09:36:33] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:41:03] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:43:13] (03PS1) 10Effie Mouzeli: thumbor: tweak failure_throttling_memcache variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/937886 [09:45:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "I noticed today we're getting a "lint" error for these alerts, specifically the fact that flink isn't running in k8s codfw, is this expect" [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [09:46:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "The link to the alert: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DAlertLintProblem" [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [09:47:01] PROBLEM - Disk space on cephosd1001 is CRITICAL: DISK CRITICAL - /var/lib/ceph/osd/ceph-0 is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cephosd1001&var-datasource=eqiad+prometheus/ops [09:48:00] (03PS1) 10Mvolz: Update Zotero in include Polish library [deployment-charts] - 10https://gerrit.wikimedia.org/r/937887 (https://phabricator.wikimedia.org/T340484) [09:51:50] btullis stevemunene see above re: cephosd disk space critical, known? ^ [09:52:38] godog: Many thanks. Yes, pre-prod testing. Apologies for the noise. I will set a silence. [09:52:58] sure np btullis, thanks for checking [09:54:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labs_lvm: wipe fs signatures when creating logical volume [puppet] - 10https://gerrit.wikimedia.org/r/935418 (https://phabricator.wikimedia.org/T300002) (owner: 10Hashar) [09:54:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labs_lvm: add `.sh` extension to shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/935422 (owner: 10Hashar) [09:54:28] (03PS4) 10Arturo Borrero Gonzalez: labs_lvm: add `.sh` extension to shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/935422 (owner: 10Hashar) [09:54:45] (03PS4) 10Arturo Borrero Gonzalez: labs_lvm: pass shellcheck on scripts [puppet] - 10https://gerrit.wikimedia.org/r/935423 (owner: 10Hashar) [09:55:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labs_lvm: pass shellcheck on scripts [puppet] - 10https://gerrit.wikimedia.org/r/935423 (owner: 10Hashar) [09:56:05] (03PS2) 10Effie Mouzeli: thumbor: tweak failure_throttling_memcache variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/937886 [09:57:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:59:44] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: tweak failure_throttling_memcache variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/937886 (owner: 10Effie Mouzeli) [10:00:04] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1000). [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1000) [10:00:31] (03Merged) 10jenkins-bot: thumbor: tweak failure_throttling_memcache variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/937886 (owner: 10Effie Mouzeli) [10:02:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:03:30] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [10:03:52] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:04:26] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [10:04:53] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:07:44] !log disable puppet on cp3052 and cp5017 to safely monitor https://gerrit.wikimedia.org/r/c/operations/puppet/+/936701 [10:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:03] (03PS2) 10Mvolz: Update Zotero to include Polish library [deployment-charts] - 10https://gerrit.wikimedia.org/r/937887 (https://phabricator.wikimedia.org/T340484) [10:08:10] (03CR) 10Mvolz: [C: 03+2] Update Zotero to include Polish library [deployment-charts] - 10https://gerrit.wikimedia.org/r/937887 (https://phabricator.wikimedia.org/T340484) (owner: 10Mvolz) [10:08:47] (03CR) 10Fabfur: [C: 03+2] hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:08:53] (03Merged) 10jenkins-bot: Update Zotero to include Polish library [deployment-charts] - 10https://gerrit.wikimedia.org/r/937887 (https://phabricator.wikimedia.org/T340484) (owner: 10Mvolz) [10:11:31] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [10:12:03] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [10:12:26] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet CI, and 2 others: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10hashar) For the job being run manually, I have made a copy at https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compile... [10:12:52] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [10:13:23] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [10:15:05] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [10:15:36] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [10:19:11] !log puppet enabled on cp3052 and cp5017 and new configuration applied (https://gerrit.wikimedia.org/r/c/operations/puppet/+/936701) [10:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:28] (03PS1) 10Hnowlan: changeprop: bump node-rdkafka, use buster base [deployment-charts] - 10https://gerrit.wikimedia.org/r/937894 (https://phabricator.wikimedia.org/T341140) [10:29:08] (03PS1) 10Effie Mouzeli: thumbor: switch to use port 11213 for mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/937896 [10:29:31] Should pybal be restarted on lvs1020? [10:29:46] It's been CRITICAL: Services known to PyBal but not to IPVS: set(['208.80.154.242:3316', '208.80.154.242:3317', '208.80.154.242:3314', '208.80.154.242:3315', '208.80.154.242:3312', '208.80.154.242:3313', '208.80.154.242:3311', '208.80.154.243:3315', '208.80.154.243:3314', '208.80.154.243:3317', '208.80.154.243:3316', '208.80.154.243:3311', '208.80.154.242:3318', '208.80.154.243:3312', [10:29:47] '208.80.154.243:3313', '208.80.154.243:3318']) since yesterday [10:30:08] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: switch to use port 11213 for mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/937896 (owner: 10Effie Mouzeli) [10:30:50] (03Merged) 10jenkins-bot: thumbor: switch to use port 11213 for mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/937896 (owner: 10Effie Mouzeli) [10:33:21] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:33:46] claime: hmmm [10:34:04] those are wikireplicas IPs [10:34:21] vgutierrez: I'm finding in this channel's backlog that it was done yesterday for 1018 for btullis [10:34:29] But I think it wasn't done for 1020 [10:34:44] See around 2023-07-12 16:40:07 [10:34:45] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [10:35:07] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:36:02] * vgutierrez looking [10:36:33] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:36:50] I suspect that it probably should be restarted. We're still looking into the cause of the incident. suk.he restarted pybal on lvs1018 for me. [10:37:43] The only thing that I touched was conftool on puppetmaster1001, so I don't yet know what caused pybal and PVS to get out of sync. [10:37:48] LVS [10:41:01] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:42:05] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [10:42:25] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05In progress→03Resolved Done, all looks ok. We'll now start preparing for 5% [10:42:49] sorry.. was on a 1:1 [10:42:57] * vgutierrez checking [10:43:54] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Ladsgroup) >>! In T341463#9012011, @Clement_Goubert wrote: > Done, all looks ok. We'll now start preparing for 5% {meme, src=itshappening} [10:46:23] (03PS1) 10ArielGlenn: add proper partition reuse recipe for dumpsdata1004,5 [puppet] - 10https://gerrit.wikimedia.org/r/937899 (https://phabricator.wikimedia.org/T339929) [10:46:50] jouncebot: !nowandnext [10:47:07] no ! [10:47:09] jouncebot: nowandnext [10:47:10] For the next 0 hour(s) and 12 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1000) [10:47:10] For the next 0 hour(s) and 12 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1000) [10:47:10] In 2 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1300) [10:47:10] In 2 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1300) [10:51:23] taavi: oh, thx [10:51:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:52:59] 10SRE: Cannot download large files from commons - https://phabricator.wikimedia.org/T341755 (10TheDJ) Eh.. these are 3GB PDFs ??? what I gods name justifies using such large files ? That's hours of video ! [10:53:13] 10SRE-tools, 10Infrastructure-Foundations: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045 (10ayounsi) [10:53:15] 10SRE, 10Infrastructure-Foundations, 10netops: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [10:53:21] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Users management on SONiC - https://phabricator.wikimedia.org/T338028 (10ayounsi) [10:56:37] claime: it looks like pooling/depooling of dbproxy instances was rather aggressive and IPVS got those VIPs without any backend servers pooled [10:56:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:08] !log restarting pybal on lvs1020 [10:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:37] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Clement_Goubert) [10:57:43] https://www.irccloud.com/pastebin/QUcbBx47/ [10:58:37] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Clement_Goubert) p:05Triage→03High [10:59:03] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:59:17] vgutierrez: <3 [11:04:03] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:04:32] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate group1 to Kubernetes - https://phabricator.wikimedia.org/T340549 (10Clement_Goubert) 05In progress→03Resolved After moving a couple group1 wikis, we have decided to go with a global traffic percentage to roll forward. Marking Resolved. [11:05:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:06:11] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:10:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:40] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:25:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1014.eqiad.wmnet [11:29:57] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:30:25] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [11:31:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:31:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy1014.eqiad.wmnet [11:32:16] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001" [11:32:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:37:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:38:46] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001" [11:38:46] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:41:12] (03CR) 10Gmodena: [C: 03+2] data-engineering: add alerts flink enrichment apps (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [11:46:55] (03CR) 10Gmodena: [C: 03+2] data-engineering: add alerts flink enrichment apps (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [11:48:11] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05Resolved→03In progress I'm wondering if in the vein of >>! In T290536#8466377, @Ladsgroup wrote: > This is not really user-impacting, spec... [11:48:23] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:48:33] (03CR) 10Slyngshede: [C: 04-1] pcc: update the parse commit method to support "Change-Private:" footer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937534 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [11:52:20] (03CR) 10Slyngshede: [C: 03+1] "LGTM, minor nit on Docstring." [puppet] - 10https://gerrit.wikimedia.org/r/937534 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [12:00:16] (03PS1) 10Jelto: miscweb: use timestamp in image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/937938 (https://phabricator.wikimedia.org/T300171) [12:05:40] (03PS1) 10Effie Mouzeli: thumbor: switch production to use mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/937939 [12:10:48] (03PS8) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [12:11:25] (03CR) 10CI reject: [V: 04-1] nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [12:11:59] (03CR) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [12:13:20] (03PS1) 10David Caro: labs_lvm: fix if condition [puppet] - 10https://gerrit.wikimedia.org/r/937941 [12:13:34] (03PS2) 10Effie Mouzeli: thumbor: switch production to use mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/937939 (https://phabricator.wikimedia.org/T318695) [12:16:41] (03CR) 10David Caro: "Tested on toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/937941 (owner: 10David Caro) [12:23:11] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10SLyngshede-WMF) Depending on how the Gitlab OIDC pulls information we might have to change: userinfo_endpoint in client config opt... [12:31:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/937941 (owner: 10David Caro) [12:31:15] (03CR) 10David Caro: [C: 03+2] labs_lvm: fix if condition [puppet] - 10https://gerrit.wikimedia.org/r/937941 (owner: 10David Caro) [12:36:50] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:41:22] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:41:43] (03PS1) 10Jelto: gitlab: set userinfo_endpoint in client_options: [puppet] - 10https://gerrit.wikimedia.org/r/937945 (https://phabricator.wikimedia.org/T320390) [12:46:02] jouncebot: now [12:46:02] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [12:46:13] I’ll start a maintenance script now, it’ll take probably days anyways [12:47:19] !log Start `mwscript DiscussionTools:persistRevisionThreadItems ruwiki --current --all --start '["10086120"]'; touch ~/T315510-ruwiki-exited-$?` in tmux on mwmaint1002 (T315510) [12:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:23] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [12:48:12] (03CR) 10Slyngshede: gitlab: set userinfo_endpoint in client_options: (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937945 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:48:22] (03CR) 10Slyngshede: [C: 04-1] gitlab: set userinfo_endpoint in client_options: [puppet] - 10https://gerrit.wikimedia.org/r/937945 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:53:46] (03PS2) 10Jelto: gitlab: set userinfo_endpoint in client_options: [puppet] - 10https://gerrit.wikimedia.org/r/937945 (https://phabricator.wikimedia.org/T320390) [12:53:49] (03CR) 10Elukey: "Shall we change staging-only first? To do some tests, and avoid to pick up the new images in case of emergency rollouts etc.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/937894 (https://phabricator.wikimedia.org/T341140) (owner: 10Hnowlan) [12:54:20] (03CR) 10CI reject: [V: 04-1] gitlab: set userinfo_endpoint in client_options: [puppet] - 10https://gerrit.wikimedia.org/r/937945 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:54:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:56:37] (03PS3) 10Jelto: gitlab: set userinfo_endpoint in client_options: [puppet] - 10https://gerrit.wikimedia.org/r/937945 (https://phabricator.wikimedia.org/T320390) [12:58:00] (03CR) 10Jelto: gitlab: set userinfo_endpoint in client_options: (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937945 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:58:17] (03PS9) 10Jbond: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [12:58:55] (03CR) 10CI reject: [V: 04-1] nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1300) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1300). [13:00:04] aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] (03CR) 10Slyngshede: "From re-reading the docs I think /oidcProfile should be fine." [puppet] - 10https://gerrit.wikimedia.org/r/937945 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [13:00:20] I haven’t had lunch yet, so I hope someone else can deploy, sorry [13:00:31] i can deploy today [13:00:34] enjoy your lunch Lucas_WMDE [13:00:37] thanks! [13:00:56] o/ [13:02:07] (03PS4) 10Jelto: gitlab: set userinfo_endpoint in client_options: [puppet] - 10https://gerrit.wikimedia.org/r/937945 (https://phabricator.wikimedia.org/T320390) [13:02:34] aanzx: what should be the canonical name of the namespaces? should they be called `Reconstruction` and `Rhymes`, with mn-language aliases? or should mn language be the canonical name, and english as an alias? [13:02:49] Yes [13:03:00] (03CR) 10Jelto: gitlab: set userinfo_endpoint in client_options: (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937945 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [13:03:22] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/937945 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [13:04:07] Since previously was done like that for appendix namespace so I did it like that @urbanecm [13:04:41] (03PS1) 10DCausse: Link to new repo to build docker dev image [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/937949 [13:05:01] fair enough. this is the opposite way to the mediawiki core namespaces (like `Wiktionary` or `Help`), but well, it's done for the other extra namespace, so why not here. [13:05:06] (03CR) 10Jelto: [C: 03+2] gitlab: set userinfo_endpoint in client_options: [puppet] - 10https://gerrit.wikimedia.org/r/937945 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [13:05:08] (03PS6) 10Urbanecm: Create Reconstruction and Rhymes namespaces in mnwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937468 (https://phabricator.wikimedia.org/T330689) (owner: 10Anzx) [13:05:12] (03CR) 10Urbanecm: [C: 03+2] Create Reconstruction and Rhymes namespaces in mnwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937468 (https://phabricator.wikimedia.org/T330689) (owner: 10Anzx) [13:05:51] (03Merged) 10jenkins-bot: Create Reconstruction and Rhymes namespaces in mnwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937468 (https://phabricator.wikimedia.org/T330689) (owner: 10Anzx) [13:08:05] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:937468|Create Reconstruction and Rhymes namespaces in mnwwiktionary (T330689)]] [13:08:09] T330689: Create Reconstruction and Rhymes namespaces in Mon Wiktionary - https://phabricator.wikimedia.org/T330689 [13:09:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:09:49] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/937510 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [13:09:51] !log urbanecm@deploy1002 anzx and urbanecm: Backport for [[gerrit:937468|Create Reconstruction and Rhymes namespaces in mnwwiktionary (T330689)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:10:00] aanzx: your patch is at mwdebug1001, can you test please? [13:10:08] Ok [13:11:36] working can you run namespaceDupes.php [13:11:48] will do, once it is synced [13:11:55] proceeding [13:17:16] (03PS1) 10Hashar: Rakefile: add tasks to run a global shellcheck [puppet] - 10https://gerrit.wikimedia.org/r/937952 [13:17:48] (03CR) 10CI reject: [V: 04-1] Rakefile: add tasks to run a global shellcheck [puppet] - 10https://gerrit.wikimedia.org/r/937952 (owner: 10Hashar) [13:17:52] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:937468|Create Reconstruction and Rhymes namespaces in mnwwiktionary (T330689)]] (duration: 09m 46s) [13:17:55] T330689: Create Reconstruction and Rhymes namespaces in Mon Wiktionary - https://phabricator.wikimedia.org/T330689 [13:19:05] !log Run `mwscript namespaceDupes.php --wiki=mnwwiktionary --fix` (T330689) [13:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:17] (03PS1) 10Filippo Giunchedi: data-engineering: ignore series not found for MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/937953 [13:19:47] (03CR) 10Filippo Giunchedi: "As per discussion in I33bbd94e184c, should be safe to ignore lint/pint checks" [alerts] - 10https://gerrit.wikimedia.org/r/937953 (owner: 10Filippo Giunchedi) [13:19:58] aanzx: there is a couple of conflicting pages, please review and delete/move as needed https://www.irccloud.com/pastebin/pvuJ0A0l/ [13:21:16] ok i will let admin know on phabricator task , thanks urbanecm [13:21:21] no problem [13:21:39] !log UTC afternoon B&C window done [13:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:01] (03PS2) 10Hashar: Rakefile: add tasks to run a global shellcheck [puppet] - 10https://gerrit.wikimedia.org/r/937952 [13:29:00] (03CR) 10Elukey: "Tried to add some comments :)" [puppet] - 10https://gerrit.wikimedia.org/r/937899 (https://phabricator.wikimedia.org/T339929) (owner: 10ArielGlenn) [13:31:01] (03CR) 10Hashar: "The rationale is `rake shellcheck` is only possible when the current commits affects some '**/*.sh' files and it would only run against th" [puppet] - 10https://gerrit.wikimedia.org/r/937952 (owner: 10Hashar) [13:32:00] RECOVERY - Disk space on cephosd1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cephosd1001&var-datasource=eqiad+prometheus/ops [13:34:40] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2022/2023-Q4): [spicerack] support including {project} in SAL messages - https://phabricator.wikimedia.org/T341793 (10fnegri) [13:35:21] (03PS1) 10Func: Avoid calling wfMessage in the hook handler constructor [extensions/Wikistories] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937913 (https://phabricator.wikimedia.org/T339272) [13:36:36] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:41:06] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:41:22] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) Recap after the latest chat with @volans: * log messages follow this path: cookbo... [13:41:47] (03PS5) 10Giuseppe Lavagetto: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) [13:43:55] (03CR) 10CI reject: [V: 04-1] confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [13:44:10] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42452/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [13:44:11] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) 05Open→03Resolved checked status this morning. no new errors after DIMM swap. [13:48:25] (03PS6) 10Giuseppe Lavagetto: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) [13:48:44] (03CR) 10Bking: [C: 03+1] Change analytics_test airflow to use an-test-client1002 [puppet] - 10https://gerrit.wikimedia.org/r/937577 (https://phabricator.wikimedia.org/T341700) (owner: 10Stevemunene) [13:50:35] (03CR) 10CI reject: [V: 04-1] confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [13:50:54] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42453/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [13:51:07] (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: ignore series not found for MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/937953 (owner: 10Filippo Giunchedi) [13:55:23] (03PS1) 10Vgutierrez: trafficserver: Drop T255368 workaround [puppet] - 10https://gerrit.wikimedia.org/r/937955 (https://phabricator.wikimedia.org/T255368) [13:56:36] (03CR) 10Hnowlan: [C: 03+1] thumbor: switch production to use mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/937939 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [14:01:32] (03CR) 10Btullis: "This change is ready for review." (0318 comments) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [14:03:30] (03CR) 10MVernon: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/937528 (https://phabricator.wikimedia.org/T341732) (owner: 10Eevans) [14:05:41] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: switch production to use mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/937939 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [14:05:58] (03PS1) 10JMeybohm: Prepare for new helm module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/937956 (https://phabricator.wikimedia.org/T300033) [14:06:00] (03PS1) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) [14:06:02] (03PS1) 10JMeybohm: Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) [14:06:04] (03PS1) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [14:06:25] (03Merged) 10jenkins-bot: thumbor: switch production to use mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/937939 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [14:06:29] (03CR) 10CI reject: [V: 04-1] Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:06:31] (03CR) 10CI reject: [V: 04-1] Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:06:36] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:06:43] (03CR) 10CI reject: [V: 04-1] Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:06:59] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42455/console" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [14:08:22] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:06] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:14:12] (03PS3) 10David Caro: add fake novaproxy passwords [labs/private] - 10https://gerrit.wikimedia.org/r/928477 (owner: 10Majavah) [14:16:22] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:18:22] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:12] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:26] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:26:14] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:28:31] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:28:46] (03PS1) 10Jelto: Revert "gitlab: set userinfo_endpoint in client_options:" [puppet] - 10https://gerrit.wikimedia.org/r/937917 [14:28:58] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:29:05] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:29:32] (03CR) 10Slyngshede: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/937917 (owner: 10Jelto) [14:30:04] (03CR) 10Jelto: [C: 03+2] Revert "gitlab: set userinfo_endpoint in client_options:" [puppet] - 10https://gerrit.wikimedia.org/r/937917 (owner: 10Jelto) [14:30:04] kindrobot, James_F, and apine: I, the Bot under the Fountain, call upon thee, The Deployer, to do Abstract Wikipedia one-off staging maintenance deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1430). [14:30:14] Heya. [14:30:40] (We're only going to be mucking around in k8s staging, so nothing should go wrong… go wrong… go wrong… ;-)) [14:31:30] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10SLyngshede-WMF) We didn't find a solution yet, but I'll spend some time looking into the CAS side of things tomorrow. [14:32:01] (03CR) 10Hnowlan: api-gateway: emit no-cache unless otherwise asked (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/936765 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [14:32:05] (03CR) 10Hnowlan: [C: 03+2] api-gateway: emit no-cache unless otherwise asked [deployment-charts] - 10https://gerrit.wikimedia.org/r/936765 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [14:32:15] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:32:20] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:32:52] (03Merged) 10jenkins-bot: api-gateway: emit no-cache unless otherwise asked [deployment-charts] - 10https://gerrit.wikimedia.org/r/936765 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [14:33:57] (03CR) 10Stevemunene: [C: 03+2] Change analytics_test airflow to use an-test-client1002 [puppet] - 10https://gerrit.wikimedia.org/r/937577 (https://phabricator.wikimedia.org/T341700) (owner: 10Stevemunene) [14:35:36] 10ops-codfw, 10Machine-Learning-Team: ManagementSSHDown - https://phabricator.wikimedia.org/T341648 (10RhinosF1) [14:35:44] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:42] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:37:19] (03PS1) 10Hnowlan: api-gateway: respond for all domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/937965 [14:41:14] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:42:25] (03PS1) 10Majavah: P:toolforge:prometheus: scrape maintain-kubeusers metrics [puppet] - 10https://gerrit.wikimedia.org/r/937966 [14:42:48] (03CR) 10CI reject: [V: 04-1] P:toolforge:prometheus: scrape maintain-kubeusers metrics [puppet] - 10https://gerrit.wikimedia.org/r/937966 (owner: 10Majavah) [14:43:07] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ores2003.codfw.wmnet with reason: DCops working on it [14:43:08] (03PS2) 10Majavah: P:toolforge:prometheus: scrape maintain-kubeusers metrics [puppet] - 10https://gerrit.wikimedia.org/r/937966 [14:43:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ores2003.codfw.wmnet with reason: DCops working on it [14:43:22] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:43:55] !log depool ores2003 to allow DCops maintenance work [14:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:57] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42457/console" [puppet] - 10https://gerrit.wikimedia.org/r/937966 (owner: 10Majavah) [14:44:01] cc: klausman: --^ [14:44:39] thx! [14:44:43] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:45:01] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:45:24] 10ops-codfw, 10Machine-Learning-Team: ManagementSSHDown - https://phabricator.wikimedia.org/T341648 (10elukey) [14:48:22] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:18] (03CR) 10Dzahn: [C: 03+1] miscweb: use timestamp in image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/937938 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [14:54:21] (03PS2) 10JMeybohm: Prepare for new helm module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/937956 (https://phabricator.wikimedia.org/T300033) [14:54:23] (03PS2) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) [14:54:25] (03PS2) 10JMeybohm: Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) [14:54:28] (03PS2) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [14:55:04] (03CR) 10CI reject: [V: 04-1] Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:55:15] (03CR) 10CI reject: [V: 04-1] Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:55:36] (03CR) 10CI reject: [V: 04-1] Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:57:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Andrew) Hello folks, sorry for arriving late to this ticket. I'm not sure if this will be useful or not, but here's some context about what we'll b... [15:00:09] (03PS10) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [15:01:37] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Split Thanos components from thanos-fe hosts - https://phabricator.wikimedia.org/T341488 (10MatthewVernon) It might not be possible, but if we could end up with the `thanos-fe*` nodes running `swift::*`classes that would be nice. It feels like... [15:07:02] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:10:04] (03PS3) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) [15:10:06] (03PS3) 10JMeybohm: Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) [15:10:08] (03PS3) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [15:10:39] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Arnoldokoth) @Jelto @SLyngshede-WMF I believe my login works now. I did sign in to the replica and I can see `Signed in with ope... [15:11:13] (03CR) 10CI reject: [V: 04-1] Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:11:15] (03CR) 10CI reject: [V: 04-1] Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:11:17] (03CR) 10CI reject: [V: 04-1] Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:11:32] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:13:27] (03PS1) 10Jforrester: wikifunctions: Specify Envoy URL and use image with Head: [deployment-charts] - 10https://gerrit.wikimedia.org/r/937969 (https://phabricator.wikimedia.org/T297314) [15:14:16] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:14:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:14:34] Hi parsoid [15:15:03] same as this morning? [15:15:07] in a meeting right now [15:15:08] Probably [15:15:10] checking [15:15:15] let me know if you need me [15:15:21] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Specify Envoy URL and use image with Head: [deployment-charts] - 10https://gerrit.wikimedia.org/r/937969 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [15:15:22] sure [15:16:24] (03Merged) 10jenkins-bot: wikifunctions: Specify Envoy URL and use image with Head: [deployment-charts] - 10https://gerrit.wikimedia.org/r/937969 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [15:16:29] Looks like an undelete page rise [15:16:53] claime: how are you monitoring? [15:17:11] jhathaway: parsoid idle workers https://grafana.wikimedia.org/goto/E2yoYOCVk?orgId=1 [15:17:18] thanks [15:17:34] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:17:42] jhathaway: https://grafana-rw.wikimedia.org/d/t_x3DEu4k/parsoid-health [15:18:04] There in Parser Cache usage you can see the reasons for parsoid ParserCache writes [15:18:28] coincides perfectly with the rise in active workers [15:18:34] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:18:36] thanks I see the bump [15:19:09] Checking changeprop [15:19:16] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:19:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:20:25] eqiad.resource-purge and eqiad.changeprop.transcludes.resource-change spikes seem to match too [15:21:30] It's probably going to juke around 30% for a bit [15:23:22] (03CR) 10Jbond: "LGTM: some minor nits niline" [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [15:25:01] claime: do we have any actions to take, my understanding from back scroll is that this is due to having fewer parsoid servers than we did yesterday? [15:25:09] possibly yes [15:25:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.decommission for hosts sretest2002 [15:25:15] Not yesterday [15:25:17] A while ago [15:25:31] ah [15:28:28] (03PS1) 10Jforrester: [WIP] wikifunctions: Add network ability for orchestrator to talk to evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/937972 (https://phabricator.wikimedia.org/T297314) [15:28:57] (03CR) 10Jforrester: "Is this going in the right direction for this?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937972 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [15:29:07] (03CR) 10CI reject: [V: 04-1] [WIP] wikifunctions: Add network ability for orchestrator to talk to evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/937972 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [15:29:56] looks like we halved it at one point, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/774462 [15:29:57] 10ops-codfw, 10Machine-Learning-Team: ManagementSSHDown - https://phabricator.wikimedia.org/T341648 (10Jhancock.wm) found the actual mgmt port to be loose. had server deposed and inspected. bracket that was holding idrac card has broken. tried reseating but port mgmt port goes down shortly after booting up.... [15:30:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:30:27] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:32:31] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2002 decommissioned, removing all IPs except the asset tag one - pt1979@cumin2002" [15:33:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2002 decommissioned, removing all IPs except the asset tag one - pt1979@cumin2002" [15:33:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:33:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts sretest2002 [15:33:44] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by pt1979@cumin2002 for hosts: `sretest2002` - sretest2002 (**WARN**) - Downtimed host on Icinga/Alertmanager - Found physical host - //Managem... [15:35:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:36:58] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:38:27] (03CR) 10David Caro: [C: 03+2] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/937966 (owner: 10Majavah) [15:41:33] (03CR) 10Hnowlan: [C: 03+2] api-gateway: respond for all domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/937965 (owner: 10Hnowlan) [15:42:24] (03Merged) 10jenkins-bot: api-gateway: respond for all domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/937965 (owner: 10Hnowlan) [15:43:16] PROBLEM - Host ores2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:45] (03PS1) 10Jbond: idm::jobs: add spec test [puppet] - 10https://gerrit.wikimedia.org/r/937977 [15:47:47] (03PS1) 10Jbond: monkey_path: fix up monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/937978 [15:47:49] (03PS1) 10Jbond: profile::cassandra: Add spec test [puppet] - 10https://gerrit.wikimedia.org/r/937979 [15:48:22] (03PS11) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [15:49:42] (03CR) 10Jbond: [C: 03+2] idm::jobs: add spec test [puppet] - 10https://gerrit.wikimedia.org/r/937977 (owner: 10Jbond) [15:50:18] (03PS12) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [15:51:19] (03PS2) 10Jbond: monkey_patch: fix up monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/937978 [15:54:24] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:54:58] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:56:04] (03CR) 10JHathaway: "looks good, though ruby argument handling is almost turing complete," [puppet] - 10https://gerrit.wikimedia.org/r/937978 (owner: 10Jbond) [15:56:20] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:56:45] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [16:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:17] (03PS2) 10Hnowlan: changeprop: bump node-rdkafka, use buster base [deployment-charts] - 10https://gerrit.wikimedia.org/r/937894 (https://phabricator.wikimedia.org/T341140) [16:05:10] (03PS1) 10Ilias Sarantopoulos: ores-legacy: set logging level to DEBUG [deployment-charts] - 10https://gerrit.wikimedia.org/r/937983 [16:06:44] (03PS13) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [16:08:31] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] nftables: spec: introduce service tests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [16:09:31] (03CR) 10Elukey: [C: 03+1] ores-legacy: set logging level to DEBUG [deployment-charts] - 10https://gerrit.wikimedia.org/r/937983 (owner: 10Ilias Sarantopoulos) [16:10:35] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ores-legacy: set logging level to DEBUG [deployment-charts] - 10https://gerrit.wikimedia.org/r/937983 (owner: 10Ilias Sarantopoulos) [16:11:15] (03Merged) 10jenkins-bot: ores-legacy: set logging level to DEBUG [deployment-charts] - 10https://gerrit.wikimedia.org/r/937983 (owner: 10Ilias Sarantopoulos) [16:11:24] (03PS1) 10Hnowlan: images: add debug logging for memcache [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/937985 (https://phabricator.wikimedia.org/T341805) [16:13:06] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#9011559, @ayounsi wrote: > The last host connected to asw-b1-codfw (the prod switch) is cloudweb... [16:13:29] (03CR) 10Elukey: [C: 03+1] "<3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937894 (https://phabricator.wikimedia.org/T341140) (owner: 10Hnowlan) [16:13:42] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [16:14:36] !log isaranto@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [16:15:05] !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [16:18:03] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10KFrancis) HI all, I am confirming the NDA is complete. Please proceed with the access request. [16:30:17] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10cmooney) Thanks @KFrancis appreciate that :) [16:33:27] (03PS2) 10Hnowlan: images: add debug logging for memcache [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/937985 (https://phabricator.wikimedia.org/T341805) [16:33:47] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10cmooney) @Ifrahkhanyaree apologies can you confirm your email address? I'll add the account then just missing that detail. [16:34:17] (03CR) 10CI reject: [V: 04-1] images: add debug logging for memcache [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/937985 (https://phabricator.wikimedia.org/T341805) (owner: 10Hnowlan) [16:35:21] (03PS3) 10Hnowlan: images: add debug logging for memcache [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/937985 (https://phabricator.wikimedia.org/T341805) [16:36:51] (03CR) 10Hnowlan: [C: 03+2] changeprop: bump node-rdkafka, use buster base [deployment-charts] - 10https://gerrit.wikimedia.org/r/937894 (https://phabricator.wikimedia.org/T341140) (owner: 10Hnowlan) [16:37:10] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting access to release-engineering for aklapper - https://phabricator.wikimedia.org/T341749 (10cmooney) [16:37:18] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1014.eqiad.wmnet - https://phabricator.wikimedia.org/T341782 (10wiki_willy) a:05wiki_willy→03Jclark-ctr [16:37:35] (03Merged) 10jenkins-bot: changeprop: bump node-rdkafka, use buster base [deployment-charts] - 10https://gerrit.wikimedia.org/r/937894 (https://phabricator.wikimedia.org/T341140) (owner: 10Hnowlan) [16:37:59] 10ops-codfw, 10decommission-hardware: decommission krb2001.codfw.wmnet - https://phabricator.wikimedia.org/T340433 (10wiki_willy) [16:38:32] (03PS1) 10Cathal Mooney: Add aklapper to release-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/937988 (https://phabricator.wikimedia.org/T341749) [16:38:37] 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frpig1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340128 (10wiki_willy) [16:40:15] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [16:40:56] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:45:58] (03CR) 10Dzahn: [C: 03+1] Add aklapper to release-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/937988 (https://phabricator.wikimedia.org/T341749) (owner: 10Cathal Mooney) [16:47:43] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Requesting access to release-engineering for aklapper - https://phabricator.wikimedia.org/T341749 (10Dzahn) We talked about this in meetings yesterday and today. Andre should get phabricator root. And this will be resolve... [16:48:37] (03CR) 10Dzahn: [C: 03+1] "optional: you can remove from phabricator-admin because that is just a subset of phabricator-roots and with this change becomes kind of me" [puppet] - 10https://gerrit.wikimedia.org/r/937988 (https://phabricator.wikimedia.org/T341749) (owner: 10Cathal Mooney) [16:49:05] (03CR) 10Dzahn: [C: 03+1] "disregard last comment, just +1 as it is now :)" [puppet] - 10https://gerrit.wikimedia.org/r/937988 (https://phabricator.wikimedia.org/T341749) (owner: 10Cathal Mooney) [16:54:14] (03CR) 10Cathal Mooney: [C: 03+2] Add aklapper to release-engineering group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937988 (https://phabricator.wikimedia.org/T341749) (owner: 10Cathal Mooney) [16:58:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Requesting access to release-engineering for aklapper - https://phabricator.wikimedia.org/T341749 (10cmooney) Path to add to "release-engineering" group directly is now merged. @dzahn is there anything else I should do h... [17:00:04] bd808: Time to snap out of that daydream and deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1700). [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1700) [17:00:30] * bd808 looks at gerrit to see what might be ready to ship [17:03:10] (03PS1) 10Ssingh: dnsrecursor: allow configuring the webserver loglevel [puppet] - 10https://gerrit.wikimedia.org/r/937991 (https://phabricator.wikimedia.org/T341611) [17:05:12] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42458/console" [puppet] - 10https://gerrit.wikimedia.org/r/937991 (https://phabricator.wikimedia.org/T341611) (owner: 10Ssingh) [17:23:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:28:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:32:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:33:58] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10Ifrahkhanyaree) It's ifrah.khanyaree@wikimedia.de [18:00:05] dduvall and hashar: Dear deployers, time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T1800). [18:06:17] !log milimetric@deploy1002 Started deploy [analytics/aqs/deploy@91f8d92]: Deploying new AQS endpoint [18:06:23] !log milimetric@deploy1002 Finished deploy [analytics/aqs/deploy@91f8d92]: Deploying new AQS endpoint (duration: 00m 05s) [18:07:00] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting access to release-engineering for aklapper - https://phabricator.wikimedia.org/T341749 (10Dzahn) @Aklapper Feel free to try ssh to these hosts now. phab1004.eqiad.wmnet is prod phab, phab-test1001.eqiad.wmnet is the test machine, ph... [18:08:14] !log milimetric@deploy1002 Started deploy [analytics/aqs/deploy@91f8d92] (aqs-next): Deploying new AQS endpoint [18:08:25] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney I am ok moving the server when it is ready. We can move it to B5/U20 ge-5/0/15. Also I see that asw-b1... [18:09:02] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting access to release-engineering for aklapper - https://phabricator.wikimedia.org/T341749 (10Dzahn) 05Open→03Resolved a:03cmooney @cmooney Thanks! I confirm access should work now. No need to keep it open, looks all resolved to m... [18:11:54] !log milimetric@deploy1002 Finished deploy [analytics/aqs/deploy@91f8d92] (aqs-next): Deploying new AQS endpoint (duration: 03m 39s) [18:15:35] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting access to release-engineering for aklapper - https://phabricator.wikimedia.org/T341749 (10Dzahn) @Aklapper This came with a bunch of new access for you. Have fun with: - deployment-docker - contint-docker - gerrit-root - deployment... [18:22:50] Hello beautiful release engineer people!!! I hope you are all doing well :) We had a patch merged on dev-images earlier today that we're keen to get it published to the wmf docker registry and I'm wondering do people usually write a phab task for that or just try to charm designated conint group members on IRC? (hence the attempted charm at the beginning) [18:23:01] oh no. wrong channel [18:23:09] sorry [18:23:22] we're beautiful too tho! [18:23:27] yes you are! [18:24:32] I am not familiar with the specifics of dev-images..but .. in other cases where I needed stuff on the docker-registry it would be neither task nor pinging people [18:24:54] it would be the CI pipeline that builds and publishes the image to the registry [18:25:35] so basically if you add the write blubber and kokurri config files in your repo.. that you can copy from others.. then you automagically get images on the registry [18:26:54] an example would be the ".gitlab-ci.yml" and ".pipeline" dir in this repo: https://gitlab.wikimedia.org/repos/sre/miscweb/statictendril [18:26:59] mutante: so I was hoping it was an automatic post-merge event but after working with hashhar last week on getting another image published, it seems like there's a manual step required before it's available via the registry [18:27:47] *nod*, I see. I can only confirm it works without a manualy step in other repos [18:31:37] (03PS1) 10Gmodena: blubber: add buildkit syntax directive [alerts] - 10https://gerrit.wikimedia.org/r/937998 [18:31:43] ah ok. that would be great actually. I just checked the releng log to confirm the manual step and can see "11:59 hashar: Successfully published image docker-registry.discovery.wmnet/dev/fundraising-civiproxy-buster-php73-apache2:0.0.1-1-s4" from the 5th of this month https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:32:14] I'll ponder the question why dev-images is not automatically published [18:33:17] jgleeson: looks like it is lacking the config for a pipeline [18:33:42] but there are probably other things involved [18:33:48] that might actually justify going the ticket route [18:34:04] also simply because of the time in Europe [18:34:40] but I think you should be able to copy the files from the example and in theory this is all self-service now, as opposed to CI before [18:35:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:36:11] ty mutante ! [18:38:27] (03CR) 10Gmodena: "Hey Filippo," [alerts] - 10https://gerrit.wikimedia.org/r/937998 (owner: 10Gmodena) [18:40:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:42:21] (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937999 (https://phabricator.wikimedia.org/T340245) [18:42:23] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937999 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot) [18:43:03] (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937999 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot) [18:43:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:48:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:49:21] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:54:27] (03PS1) 10Bking: zookeeper: prepare for new zk cluster [puppet] - 10https://gerrit.wikimedia.org/r/938000 (https://phabricator.wikimedia.org/T341792) [18:57:12] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938000 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [18:59:47] (03CR) 10Dzahn: [C: 03+1] "once this is deployed, you can click resolve on https://phabricator.wikimedia.org/T338071" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [19:00:40] (03CR) 10Dzahn: [C: 03+2] miscweb: use timestamp in image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/937938 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [19:05:24] (03PS1) 10Effie Mouzeli: thumbor: Bye bye nutcracker! [deployment-charts] - 10https://gerrit.wikimedia.org/r/938001 (https://phabricator.wikimedia.org/T318695) [19:13:38] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Reedy) [19:14:36] (03PS1) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) [19:17:28] (03CR) 10Fabfur: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42459/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [19:20:06] (03CR) 10BCornwall: roll-restart-wikimedia-dns: Add reboot action (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [19:21:28] (03PS2) 10Bking: zookeeper: prepare for new zk cluster [puppet] - 10https://gerrit.wikimedia.org/r/938000 (https://phabricator.wikimedia.org/T341792) [19:21:47] (03PS8) 10BCornwall: roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 [19:22:36] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42460/console" [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [19:24:22] (03CR) 10CI reject: [V: 04-1] roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [19:25:31] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938000 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [19:25:38] PROBLEM - Host sretest2002 is DOWN: PING CRITICAL - Packet loss = 100% [19:25:39] (03PS2) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) [19:26:50] (03CR) 10Ssingh: [C: 03+1] "Will merge tomorrow unless you want to do it before that." [puppet] - 10https://gerrit.wikimedia.org/r/934634 (owner: 10Dzahn) [19:27:07] (03CR) 10Fabfur: "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42461/console" [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [19:28:29] (03CR) 10Fabfur: [V: 03+1 C: 04-1] hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [19:32:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) got the firmware for all the servers updated for everything (bios,idrac,network,sas) last night. finding time and connectivity to make sure all the bios se... [19:52:23] (03CR) 10Dzahn: "yea, tomorrow please" [puppet] - 10https://gerrit.wikimedia.org/r/934634 (owner: 10Dzahn) [19:52:38] 10SRE, 10ops-codfw, 10decommission-hardware: decommission krb2001.codfw.wmnet - https://phabricator.wikimedia.org/T340433 (10Papaul) @Jhancock.wm can you take over this? thanks [20:00:05] brennen and TheresNoTime: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230713T2000). [20:00:05] arlolra and Func: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] o/ [20:05:28] I can deploy [20:06:19] arlolra: ping [20:06:24] hi [20:06:48] my patch is in preparation for next week's train, there's nothing to test [20:07:07] (03PS2) 10Majavah: Set default for UseLegacyMediaStyles and disable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937544 (https://phabricator.wikimedia.org/T318433) (owner: 10Arlolra) [20:07:15] ack [20:07:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937544 (https://phabricator.wikimedia.org/T318433) (owner: 10Arlolra) [20:08:33] (03Merged) 10jenkins-bot: Set default for UseLegacyMediaStyles and disable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937544 (https://phabricator.wikimedia.org/T318433) (owner: 10Arlolra) [20:08:52] !log taavi@deploy1002 Started scap: Backport for [[gerrit:937544|Set default for UseLegacyMediaStyles and disable on officewiki (T318433)]] [20:08:56] T318433: Templates (and extensions) that mimic parser media output need migration to new structure - https://phabricator.wikimedia.org/T318433 [20:09:33] (03CR) 10Majavah: [C: 03+2] "in preparation for a backport" [extensions/Wikistories] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937913 (https://phabricator.wikimedia.org/T339272) (owner: 10Func) [20:10:19] !log taavi@deploy1002 taavi and arlolra: Backport for [[gerrit:937544|Set default for UseLegacyMediaStyles and disable on officewiki (T318433)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:10:33] syncing [20:14:41] (03Merged) 10jenkins-bot: Avoid calling wfMessage in the hook handler constructor [extensions/Wikistories] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937913 (https://phabricator.wikimedia.org/T339272) (owner: 10Func) [20:16:39] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:937544|Set default for UseLegacyMediaStyles and disable on officewiki (T318433)]] (duration: 07m 47s) [20:16:44] T318433: Templates (and extensions) that mimic parser media output need migration to new structure - https://phabricator.wikimedia.org/T318433 [20:17:28] !log taavi@deploy1002 Started scap: Backport for [[gerrit:937913|Avoid calling wfMessage in the hook handler constructor (T339272)]] [20:17:31] T339272: The display language in idwiki does not change even if a user has changed it - https://phabricator.wikimedia.org/T339272 [20:17:58] taavi: thank you [20:18:30] arlolra: you're welcome [20:19:03] !log taavi@deploy1002 func and taavi: Backport for [[gerrit:937913|Avoid calling wfMessage in the hook handler constructor (T339272)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:19:15] testing... [20:20:09] taavi: looks good [20:20:14] thx, syncing [20:20:40] thanks [20:22:15] !log taavi@deploy1002 Started scap: Backport for [[gerrit:937913|Avoid calling wfMessage in the hook handler constructor (T339272)]] [20:22:15] I did a ctrl+c in the wrong window. I hope I didn't break anything, retrying the deployment :/ [20:23:41] !log taavi@deploy1002 func and taavi: Backport for [[gerrit:937913|Avoid calling wfMessage in the hook handler constructor (T339272)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:23:44] T339272: The display language in idwiki does not change even if a user has changed it - https://phabricator.wikimedia.org/T339272 [20:28:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:29:53] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:937913|Avoid calling wfMessage in the hook handler constructor (T339272)]] (duration: 07m 38s) [20:29:57] T339272: The display language in idwiki does not change even if a user has changed it - https://phabricator.wikimedia.org/T339272 [20:30:01] ok, done [20:33:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:37:25] (03PS3) 10Bking: zookeeper: prepare for new zk cluster [puppet] - 10https://gerrit.wikimedia.org/r/938000 (https://phabricator.wikimedia.org/T341792) [20:37:26] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@889e13f]: (no justification provided) [20:37:50] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@889e13f]: (no justification provided) (duration: 00m 23s) [20:38:00] (03PS3) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) [20:40:09] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938000 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [20:40:54] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42462/console" [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [20:46:03] (03PS4) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) [20:46:28] (03CR) 10CI reject: [V: 04-1] hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [20:46:33] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10taavi) Here is the LDAPAuthentication debug log for an account creation where I could reproduce this: P49562 I think the main interesting t... [20:48:05] (03PS5) 10Fabfur: hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) [20:49:27] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42465/console" [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [20:49:58] (03CR) 10Dzahn: [C: 03+1] "looks good to me. moving the templates to profile makes the zookeeper module more universal so it can be used by other new profiles." [puppet] - 10https://gerrit.wikimedia.org/r/938000 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [20:57:31] (03CR) 10Fabfur: [V: 03+1 C: 04-1] hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [20:59:43] !log bking@cumin1001 'disable puppet on hosts using zookeeper class T341792' [20:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:48] T341792: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 [21:00:56] (03CR) 10Bking: [C: 03+2] zookeeper: prepare for new zk cluster [puppet] - 10https://gerrit.wikimedia.org/r/938000 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [21:03:46] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10taavi) That something is `LdapAuthenticationPlugin::updateExternalDB()` most likely, which seems to be called just a few rows below. The onl... [21:22:06] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10bd808) >>! In T339917#9014178, @taavi wrote: > 4. Some other extension runs some other hook which saves user preferences. That causes `LdapA... [21:29:39] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Quiddity) Thanks for the draft, appreciated! I've [[https://meta.wikimedia.org/wiki/Tech/News/2023/29#Tech_News:_2023-29|added this to Tech News]]. (The only major... [21:39:58] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10taavi) And I've missed something: since these are "auto-creations", the `LdapPrimaryAuthenticationProvider::onLocalUserCreated()` hook actua... [21:42:34] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10jijiki) [21:43:10] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10taavi) The preferences for the test account are all coming from a [[ https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaMessages/+... [22:19:37] (03PS2) 10Kimberly Sarabia: Turn off A/B Test in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) [22:35:00] (03PS9) 10BCornwall: roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 [22:36:48] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2014 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:37:20] (03CR) 10CI reject: [V: 04-1] roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [22:37:49] 10SRE, 10Collection, 10Release-Engineering-Team, 10Traffic, and 3 others: Strange error pattern noticed on viwiki during unrelated deploy - https://phabricator.wikimedia.org/T340850 (10Krinkle) 05Open→03Resolved a:03Reedy No entries in WMF Logstash for this after July 1st. {F37138383 height=200} [22:38:46] 10SRE, 10Collection, 10Release-Engineering-Team, 10Traffic, and 3 others: Strange error pattern noticed on viwiki during unrelated deploy - https://phabricator.wikimedia.org/T340850 (10Krinkle) [22:53:22] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:17:37] (03PS1) 10Dduvall: buildkitd: Fix gckeepstorage units [puppet] - 10https://gerrit.wikimedia.org/r/938016 (https://phabricator.wikimedia.org/T340887) [23:49:30] (03PS1) 10BryanDavis: striker: Bump container version to 2023-07-13-234503-production [puppet] - 10https://gerrit.wikimedia.org/r/938021 [23:50:40] (03CR) 10BryanDavis: "No rush to deploy. The included fixes are only cosmetic." [puppet] - 10https://gerrit.wikimedia.org/r/938021 (owner: 10BryanDavis)