[00:00:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on contint1002.wikimedia.org with reason: maintenance [00:00:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on contint1002.wikimedia.org with reason: maintenance [00:00:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on contint2002.wikimedia.org with reason: maintenance [00:00:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on contint2002.wikimedia.org with reason: maintenance [00:04:06] (ProbeDown) resolved: (3) Service releases1002:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:08:55] (03PS11) 10Ejegg: Assign the API portal to the Wikimedia group for CentralNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) [00:16:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on contint2001.wikimedia.org with reason: maintenance [00:16:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on contint2001.wikimedia.org with reason: maintenance [00:29:48] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on contint2001.wikimedia.org with reason: maintenance [00:29:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on contint2001.wikimedia.org with reason: maintenance [00:40:03] (NodeTextfileStale) firing: (3) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:44:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on contint2001.wikimedia.org with reason: maintenance [00:44:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on contint2001.wikimedia.org with reason: maintenance [00:45:39] !log short maintenance on main contint server (jenkins) [00:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:10:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on contint2001.wikimedia.org with reason: maintenance [01:10:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on contint2001.wikimedia.org with reason: maintenance [01:19:30] !log contint2001 - jenkins started again [01:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:11] (03CR) 10Dzahn: [C: 03+2] "[contint1002:/tmp] $ id jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/917919 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [01:30:52] (03CR) 10Dzahn: [C: 03+2] "finally contint2001: exactly as suggested..expectedly took a long time because /srv/jenkins is huge here but only here." [puppet] - 10https://gerrit.wikimedia.org/r/917919 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [01:31:43] (03CR) 10Dzahn: [C: 03+2] jenkins: switch to fixed uid/gid 924 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917919 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [01:35:14] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) after carefully deploying the patch above to change jenkins UID/GID, following the instructions, changing file ownership... [01:41:42] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad+codfw: 1 VM each site request for releases.wikimedia.org - https://phabricator.wikimedia.org/T337349 (10Dzahn) 05Open→03Resolved a:03Dzahn Thank you. The VMs are already created. This was just for the paper trail. ` [ganeti2021:~] $ sudo gnt-i... [01:42:00] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad+codfw: 1 VM each site request for releases.wikimedia.org - https://phabricator.wikimedia.org/T337349 (10Dzahn) a:05Dzahn→03eoghan [01:42:56] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad+codfw: 1 VM each site request for releases.wikimedia.org - https://phabricator.wikimedia.org/T337349 (10Dzahn) [01:43:32] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad+codfw: 1 VM each site request for releases.wikimedia.org - https://phabricator.wikimedia.org/T337349 (10Dzahn) releases1002/releases2002 will be decom'ed, so no change in overall cluster usage, except an overlap period [02:06:30] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:44] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:40:03] (NodeTextfileStale) firing: (3) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:03:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:06:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM :) Let's merge it and build it." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/922142 (https://phabricator.wikimedia.org/T336658) (owner: 10Kamila Součková) [05:08:25] (03PS2) 10Giuseppe Lavagetto: Patch helm defaults in helmfile during CI tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/922563 [05:08:27] (03PS1) 10Giuseppe Lavagetto: Execute all tests if CI changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/922639 [05:14:39] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 136106 [05:16:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 136106 [05:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:45:43] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10ayounsi) I worked around the issue by disabling "dhcp-relay" on cr2-eqiad `install1004:~$ sudo tcpdump -i ens13 "host 10.65.0.1"` is the easiest way to dete... [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T0600) [06:15:40] (03PS1) 10Marostegui: misc_multiinstance.my.cnf.erb: Set gtid_domain_id=0 [puppet] - 10https://gerrit.wikimedia.org/r/922790 (https://phabricator.wikimedia.org/T336228) [06:16:10] (03CR) 10Marostegui: "This is a noop in the sense that the host using this file aren't and won't ever be masters." [puppet] - 10https://gerrit.wikimedia.org/r/922790 (https://phabricator.wikimedia.org/T336228) (owner: 10Marostegui) [06:18:09] (03CR) 10Marostegui: [C: 03+2] misc_multiinstance.my.cnf.erb: Set gtid_domain_id=0 [puppet] - 10https://gerrit.wikimedia.org/r/922790 (https://phabricator.wikimedia.org/T336228) (owner: 10Marostegui) [06:35:05] (03PS1) 10Jelto: trafficserver: switch annual.wikimedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/922791 (https://phabricator.wikimedia.org/T337041) [06:50:45] (03PS6) 10Slyngshede: sre.ganeti.makevm call reimage after VM creation [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) [06:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:00:04] Amir1, Urbanecm, and taavi: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:12] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:02:15] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:09:52] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:09:56] ^ expected [07:10:50] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:11:14] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:11:17] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:13:56] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:14:32] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:17:32] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is CRITICAL: Test Suggest target section titles for given source sections returned the unexpected status 503 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the [07:17:32] ted status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [07:20:38] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/title/{title}/{from}/{to} (Suggest a target title for the given source title and language pairs) is CRITICAL: Test Suggest a target title for the given source title and language pairs returned the unexpected status 503 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is CRITICAL: Test Suggest [07:20:38] section titles for given source sections returned the unexpected status 503 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [07:24:56] (03PS1) 10Giuseppe Lavagetto: Add the possibility to override CI settings using a .fixturesctl.yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/922793 (https://phabricator.wikimedia.org/T337359) [07:30:02] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is CRITICAL: Test Suggest target section titles for given source sections returned the unexpected status 503 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the [07:30:02] ted status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [07:31:16] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:31:18] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:33:02] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:33:05] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:33:40] 10ops-eqiad: InterfaceSpeedError - https://phabricator.wikimedia.org/T337364 (10phaultfinder) [07:37:58] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10serviceops: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10elukey) @jbond thanks for checking! I think that the main question mark is what a client cert for kafka mirror maker (and potentially also... [07:40:58] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [07:41:06] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10serviceops: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10JMeybohm) >>! In T337248#8875545, @elukey wrote: > @jbond thanks for checking! I think that the main question mark is what a client cert f... [07:41:14] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [07:42:20] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [07:42:36] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [07:48:32] (03PS2) 10Daimona Eaytoy: Configure logging for the CampaignEvents channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919838 (https://phabricator.wikimedia.org/T320434) [07:48:51] (03PS3) 10Daimona Eaytoy: Configure logging for the CampaignEvents channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919838 (https://phabricator.wikimedia.org/T337365) [07:50:26] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [08:05:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/922565 (owner: 10Hashar) [08:09:06] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:41] (03PS1) 10Elukey: profile::kafka::mirror: add support for PKI certificate [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T333124) [08:11:04] (03CR) 10CI reject: [V: 04-1] profile::kafka::mirror: add support for PKI certificate [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T333124) (owner: 10Elukey) [08:12:20] (03PS2) 10Elukey: profile::kafka::mirror: add support for PKI certificate [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T333124) [08:13:20] RECOVERY - Check systemd state on mw1454 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:39] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41288/console" [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T333124) (owner: 10Elukey) [08:16:43] (03PS3) 10Elukey: profile::kafka::mirror: add support for PKI certificate [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) [08:24:04] (03CR) 10Hashar: "I have added some very rough QUnit tests ;)" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922605 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [08:26:20] (03CR) 10Volans: [C: 03+1] "Ship it :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [08:26:32] jouncebot: nowandnext [08:26:33] No deployments scheduled for the next 1 hour(s) and 33 minute(s) [08:26:33] In 1 hour(s) and 33 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T1000) [08:26:38] (03PS2) 10Urbanecm: Migrate GrowthExperiments config to its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921599 (https://phabricator.wikimedia.org/T308932) [08:26:44] (03CR) 10Urbanecm: [C: 03+2] Migrate GrowthExperiments config to its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921599 (https://phabricator.wikimedia.org/T308932) (owner: 10Urbanecm) [08:27:31] (03Merged) 10jenkins-bot: Migrate GrowthExperiments config to its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921599 (https://phabricator.wikimedia.org/T308932) (owner: 10Urbanecm) [08:28:14] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/922565 (owner: 10Hashar) [08:28:39] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:921599|Migrate GrowthExperiments config to its own file (T308932)]] [08:28:44] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [08:29:56] PROBLEM - puppet last run on cuminunpriv1001 is CRITICAL: CRITICAL: Puppet has been disabled for 604866 seconds, message: jmm testing, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:59] (PuppetDisabled) firing: Puppet disabled on cuminunpriv1001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [08:34:10] (03CR) 10JMeybohm: "nits only" [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [08:35:59] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:921599|Migrate GrowthExperiments config to its own file (T308932)]] (duration: 07m 20s) [08:36:04] * urbanecm done [08:36:04] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [08:38:25] (03CR) 10Jaime Nuche: [C: 04-1] "Please update step 3 of the instructions for the release Jenkins. You should run the `deploy.sh` script in the repo, not `scap deploy`" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:38:49] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Volans) @Jclark-ctr the DHCP traffic is back to the install servers (see the related task for more details). For now with a workaround but netops are looking for... [08:39:45] (03CR) 10Jaime Nuche: [C: 03+1] contint: Jenkins slave > agent [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:39:53] (03CR) 10Jaime Nuche: [C: 03+1] contint: set Jenkins agent username from hiera [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:40:03] (NodeTextfileStale) firing: (3) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:41:06] (03PS4) 10Elukey: profile::kafka::mirror: add support for PKI certificate [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) [08:41:34] (03CR) 10Elukey: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [08:42:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41289/console" [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [08:45:34] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) >>! In T334429#8833688, @Jhancock.wm wrote: > The recommended fix for this one (according to Dell) is a reboot and see if the error comes back.... [08:48:22] (03PS1) 10Marostegui: db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922797 [08:48:39] (03CR) 10Dreamy Jazz: wm-patch-demo: initial implementation (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922605 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [08:49:01] (03CR) 10Marostegui: [C: 03+2] db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922797 (owner: 10Marostegui) [08:49:21] !log Stop mariadb on db1154 (sanitarium) there will be lag on clouddb* hosts [08:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:02] (03CR) 10Hashar: wm-patch-demo: initial implementation (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922605 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [08:50:16] (03PS2) 10DCausse: [cirrus] Fix typo in config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 [08:50:50] (03PS4) 10Hashar: wm-patch-demo: initial implementation [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922605 (https://phabricator.wikimedia.org/T332474) [08:50:58] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [08:51:43] !log akosiaris@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka main-codfw cluster: Reboot kafka nodes [08:52:08] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:52:39] !log repooling mw2248.codfw.wmnet - T334429 [08:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:43] T334429: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 [08:54:18] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10cmooney) >>! In T337345#8875421, @ayounsi wrote: > So it's either a Junos bug or the need for another nerd knob. > Edit: [[ https://www.juniper.net/documenta... [08:55:57] (03PS1) 10Marostegui: Revert "db1154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/922403 [08:56:52] (03CR) 10Marostegui: [C: 03+2] Revert "db1154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/922403 (owner: 10Marostegui) [08:57:05] (03CR) 10Btullis: profile::kafka::mirror: add support for PKI certificate (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [08:59:32] (03PS1) 10Marostegui: db1155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922798 [08:59:56] (03CR) 10Marostegui: [C: 03+2] db1155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922798 (owner: 10Marostegui) [09:00:17] (03PS1) 10Jbond: admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) [09:00:38] 10SRE, 10SRE-Unowned, 10User-AKlapper: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10SCherukuwada) I think we can close this task for now. In the time since filing it, there hasn't been a real need to understand how we're doing on Yandex or Bing.... [09:00:59] 10SRE, 10Search-Console-access-request: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10SCherukuwada) [09:01:00] (03CR) 10CI reject: [V: 04-1] admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) (owner: 10Jbond) [09:01:03] 10SRE, 10Search-Console-access-request: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10SCherukuwada) [09:01:16] 10SRE, 10SRE-Unowned, 10User-AKlapper: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10SCherukuwada) 05Open→03Resolved a:03SCherukuwada [09:02:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10jbond) Thanks i have crated the [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/922799 | change ]] just need confirmation o... [09:03:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:04:18] !log dcausse@deploy1002 Started deploy [airflow-dags/search@c08e884]: search: build and use a smaller cirrus index dataset [09:04:35] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@c08e884]: search: build and use a smaller cirrus index dataset (duration: 00m 17s) [09:08:22] jouncebot: nowandnext [09:08:22] No deployments scheduled for the next 0 hour(s) and 51 minute(s) [09:08:22] In 0 hour(s) and 51 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T1000) [09:08:41] (03PS1) 10Marostegui: Revert "db1155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/922404 [09:09:02] Folks, I'd like to create a new table on beta now: https://phabricator.wikimedia.org/T336362 May I go ahead when ready? [09:09:15] (03PS5) 10Elukey: profile::kafka::mirror: add support for PKI certificate [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) [09:09:21] (03CR) 10Elukey: profile::kafka::mirror: add support for PKI certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [09:09:35] (03CR) 10Marostegui: [C: 03+2] Revert "db1155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/922404 (owner: 10Marostegui) [09:09:54] o/ @Daimona [09:09:58] (03CR) 10Elukey: profile::kafka::mirror: add support for PKI certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [09:10:10] And also... Would someone be willing to deploy a beta-only change once the table has been created? [09:10:34] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10serviceops, 10Patch-For-Review: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10jbond) > We can create multiple certs with the same CN on different machines (or even on the same machine). Thats us... [09:12:09] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Machine-Learning-Team, 10ORES: Clean up puppet & configs for ORES - https://phabricator.wikimedia.org/T142002 (10elukey) 05Open→03Declined The ML team is focusing on https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing, the replacement of ORES... [09:12:45] 10SRE, 10Machine-Learning-Team, 10ORES: Clean up redundant ORES celery_workers defaults - https://phabricator.wikimedia.org/T186734 (10elukey) 05Open→03Declined The ML team is focusing on https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing, the replacement of ORES. Please re-open if you feel th... [09:12:51] (03PS1) 10Daimona Eaytoy: [beta] Set $wgCampaignEventsUseNewTrackingToolsSchema to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922800 (https://phabricator.wikimedia.org/T336362) [09:18:17] Since it's a beta-only schema change and there doesn't seem to be anything wild going on, I'll go ahead in a few minutes barring objections (CC @HouseOfM) [09:22:36] Going ahead now, logging in #wikimedia-releng [09:24:24] (03CR) 10Hnowlan: [C: 03+1] benthos: create image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/922142 (https://phabricator.wikimedia.org/T336658) (owner: 10Kamila Součková) [09:25:45] (03PS1) 10Hnowlan: thumbor: remove imagemagick pins [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/922803 [09:25:55] Aaaand done, @HouseOfM can you help me test that nobody broke? Just do some random tests on beta (enable event, register, delete, etc.) [09:26:16] Sure :) [09:30:46] All looking good to me @Daimona [09:31:19] 10Puppet, 10Infrastructure-Foundations: role_owner.prom not getting updated on (re)installed hosts? - https://phabricator.wikimedia.org/T337375 (10fgiunchedi) [09:31:33] Yup, same here. Guess we can call this done then, thank you :) [09:31:45] (03CR) 10Vgutierrez: Create cookbook to upgrade Apache Traffic Server (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [09:31:50] RECOVERY - mediawiki-installation DSH group on mw2448 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:32:02] (03CR) 10Jbond: "lgtm a few minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [09:32:09] o7 [09:36:23] (03PS1) 10Jbond: cfssl::cert: allow users to override the mode of the outdir [puppet] - 10https://gerrit.wikimedia.org/r/922804 [09:36:41] (03CR) 10Jbond: profile::kafka::mirror: add support for PKI certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [09:38:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41290/console" [puppet] - 10https://gerrit.wikimedia.org/r/922804 (owner: 10Jbond) [09:38:17] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl::cert: allow users to override the mode of the outdir [puppet] - 10https://gerrit.wikimedia.org/r/922804 (owner: 10Jbond) [09:42:29] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2448.codfw.wmnet [09:42:29] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2448.codfw.wmnet [09:46:27] Daimona: if you give me the beta-only change, I can merge it [09:46:48] Ohhh that'd be great, thanks :) Here's the change: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/922800/ [09:47:16] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10cmooney) The Juniper [[ https://www.juniper.net/documentation/us/en/software/junos/dhcp/topics/topic-map/dhcp-relay-agent-security-devices.html | docs ]] do s... [09:47:43] (03CR) 10Zabe: [C: 03+2] [beta] Set $wgCampaignEventsUseNewTrackingToolsSchema to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922800 (https://phabricator.wikimedia.org/T336362) (owner: 10Daimona Eaytoy) [09:48:27] yw :) [09:48:30] (03Merged) 10jenkins-bot: [beta] Set $wgCampaignEventsUseNewTrackingToolsSchema to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922800 (https://phabricator.wikimedia.org/T336362) (owner: 10Daimona Eaytoy) [09:49:38] Nice :) BTW, I have another change for beta + prod but I think it'd be better to split that into two changes (one for beta, one for prod). Here it is: https://gerrit.wikimedia.org/r/c/919838 - Would you be willing to merge the beta-only version if I split it now? [09:50:20] (03PS1) 10Klausman: API GW: Fix RegEx in config for revertrisk models on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922805 (https://phabricator.wikimedia.org/T337378) [09:50:57] (03CR) 10Jbond: [C: 03+2] "LGTm will merge thanks" [puppet] - 10https://gerrit.wikimedia.org/r/922565 (owner: 10Hashar) [09:51:18] (03PS1) 10Daimona Eaytoy: [beta] Configure logging for the CampaignEvents channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922806 (https://phabricator.wikimedia.org/T337365) [09:52:04] (03PS4) 10Daimona Eaytoy: [prod] Configure logging for the CampaignEvents channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919838 (https://phabricator.wikimedia.org/T337365) [09:53:02] sure I can merge the beta one [09:53:07] i guess its https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/922806/ [09:53:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41291/console" [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:53:35] Yup, just created it [09:53:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922806 (https://phabricator.wikimedia.org/T337365) (owner: 10Daimona Eaytoy) [09:54:14] (03CR) 10Clément Goubert: mediawiki: Change naming scheme for resources (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert) [09:54:27] (03PS3) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [09:54:39] (03Merged) 10jenkins-bot: [beta] Configure logging for the CampaignEvents channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922806 (https://phabricator.wikimedia.org/T337365) (owner: 10Daimona Eaytoy) [09:54:47] done [09:55:06] Amazing, thanks again :) [09:55:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:56:45] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [09:57:31] (03CR) 10Clément Goubert: [C: 03+1] monitoring: introduce exclude list for checking systemd units (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [09:58:25] (03PS1) 10Filippo Giunchedi: sre: update exclusion expression for NodeTextfileStale [alerts] - 10https://gerrit.wikimedia.org/r/922807 (https://phabricator.wikimedia.org/T337375) [09:58:26] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) Hopping on this thread to confirm that we are now able to store sn... [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T1000) [10:02:19] (03CR) 10MVernon: cassandra: add support for version 4.1.1 (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [10:03:23] (03CR) 10Jbond: "adding an optional nit i missed last time" [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [10:04:53] (03CR) 10Hashar: "Thanks for the PCC and review! Eoghan is switching over the doc hosts currently and I am not there this afternoon. But we can do that tom" [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [10:08:13] (03PS1) 10Effie Mouzeli: kubernetes.yaml: add iPoid user/tokens [labs/private] - 10https://gerrit.wikimedia.org/r/922808 (https://phabricator.wikimedia.org/T325147) [10:08:28] iPoid? srsly? :D [10:08:50] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] kubernetes.yaml: add iPoid user/tokens [labs/private] - 10https://gerrit.wikimedia.org/r/922808 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [10:09:34] (03PS6) 10Jelto: Gitlab: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [10:10:50] (03CR) 10Jcrespo: "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [10:12:50] (03PS2) 10Jelto: gitlab: sync all configured providers [puppet] - 10https://gerrit.wikimedia.org/r/916522 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [10:13:41] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922810 (https://phabricator.wikimedia.org/T336834) (owner: 10WMDE-Fisch) [10:14:32] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10jbond) >>! In T316358#8469318, @cmooney wrote: > @jbond I've uplaoded a separate patch (above) that makes a stab and working this clos... [10:16:40] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41292/console" [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [10:19:16] (03CR) 10EoghanGaffney: [C: 03+2] Move doc.discovery.wmnet to new bullseye hosts [dns] - 10https://gerrit.wikimedia.org/r/922493 (https://phabricator.wikimedia.org/T319477) (owner: 10EoghanGaffney) [10:19:44] (03CR) 10Jbond: First stab at possible ferm::qos resource for DSCP marking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [10:20:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/922807 (https://phabricator.wikimedia.org/T337375) (owner: 10Filippo Giunchedi) [10:20:47] (03CR) 10Cathal Mooney: [C: 03+2] Automate and update DHCP relay configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [10:21:22] (03CR) 10EoghanGaffney: [C: 03+2] Switch doc host from doc1002 to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/922487 (https://phabricator.wikimedia.org/T319477) (owner: 10EoghanGaffney) [10:21:34] (03PS3) 10EoghanGaffney: Switch doc host from doc1002 to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/922487 (https://phabricator.wikimedia.org/T319477) [10:22:18] (03CR) 10EoghanGaffney: Switch doc host from doc1002 to doc1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922487 (https://phabricator.wikimedia.org/T319477) (owner: 10EoghanGaffney) [10:22:39] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10cmooney) >>! In T337345#8874938, @Volans wrote: > I wonder if this has something to do with https://gerrit.wikimedia.org/r/c/operations/homer/public/+/908346... [10:24:06] (03CR) 10Jelto: [V: 03+1] "I rebased the change. Can you double check? Also one question in-line." [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [10:25:50] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable Kartographer Nearby on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922810 (https://phabricator.wikimedia.org/T336834) (owner: 10WMDE-Fisch) [10:31:08] (03PS1) 10Jbond: idp cloud: add gitlab instance [puppet] - 10https://gerrit.wikimedia.org/r/922814 [10:31:38] (03PS6) 10Clément Goubert: monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [10:31:40] PROBLEM - Check systemd state on doc1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service,rsync-doc-doc2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:18] RECOVERY - Check systemd state on doc1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:04] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: update exclusion expression for NodeTextfileStale [alerts] - 10https://gerrit.wikimedia.org/r/922807 (https://phabricator.wikimedia.org/T337375) (owner: 10Filippo Giunchedi) [10:48:01] (NodeTextfileStale) resolved: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:50:03] (NodeTextfileStale) resolved: (3) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:56:32] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka main-codfw cluster: Reboot kafka nodes [10:58:49] (03CR) 10FNegri: [C: 03+2] wmnet: Remove nfs-tools-project.svc.eqiad [dns] - 10https://gerrit.wikimedia.org/r/907136 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [10:59:17] (03CR) 10Btullis: profile::kafka::mirror: add support for PKI certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [10:59:29] (03CR) 10Clément Goubert: monitoring: introduce exclude list for checking systemd units (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [10:59:43] (03PS4) 10FNegri: wmnet: Remove nfs-tools-project.svc.eqiad [dns] - 10https://gerrit.wikimedia.org/r/907136 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [10:59:53] (03CR) 10FNegri: [V: 03+2] wmnet: Remove nfs-tools-project.svc.eqiad [dns] - 10https://gerrit.wikimedia.org/r/907136 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [11:09:45] (03PS11) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [11:09:47] (03PS11) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [11:09:49] (03PS12) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [11:09:51] (03PS1) 10Jbond: proffile::firewall: create new firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) [11:09:53] (03PS1) 10Jbond: base::firewall: remove the old firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/922816 (https://phabricator.wikimedia.org/T279683) [11:11:55] (03CR) 10CI reject: [V: 04-1] firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:12:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41294/console" [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:13:57] (03CR) 10Jbond: [C: 03+1] monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [11:14:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41295/console" [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:14:40] (03PS13) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [11:17:06] (03CR) 10CI reject: [V: 04-1] firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:20:57] (03CR) 10Jbond: [V: 03+1] "from the pcc report" [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:21:23] (03CR) 10Giuseppe Lavagetto: mediawiki: Change naming scheme for resources (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert) [11:22:46] (03PS2) 10Jbond: proffile::firewall: create new firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) [11:22:48] (03PS12) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [11:22:50] (03PS2) 10Jbond: base::firewall: remove the old firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/922816 (https://phabricator.wikimedia.org/T279683) [11:22:52] (03PS12) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [11:22:54] (03PS14) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [11:22:56] (03PS1) 10Jbond: base::firewall: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/922819 [11:23:02] (03CR) 10Jbond: [C: 03+2] base::firewall: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/922819 (owner: 10Jbond) [11:23:29] (03CR) 10Jbond: [C: 03+2] idp cloud: add gitlab instance [puppet] - 10https://gerrit.wikimedia.org/r/922814 (owner: 10Jbond) [11:23:31] (03CR) 10CI reject: [V: 04-1] profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:24:34] (03CR) 10jenkins-bot: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:26:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for Manuel - https://phabricator.wikimedia.org/T336841 (10Manuel) 05Resolved→03Open Hi CDanis, while SSH is working, I get the following message when I try to use kinit in JupyterHub: Client 'manuel-wmde@WIKIMEDIA' not found in Kerberos databa... [11:29:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [11:31:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41296/console" [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:32:12] (03PS1) 10Aklapper: Phabricator monthly email: Improve Differential user activity stats [puppet] - 10https://gerrit.wikimedia.org/r/922820 (https://phabricator.wikimedia.org/T337382) [11:33:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:02] (03CR) 10Clément Goubert: mediawiki: Change naming scheme for resources (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert) [11:38:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:41:46] (03PS8) 10Clément Goubert: mediawiki: Change naming scheme for resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) [11:42:04] (03PS4) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [11:42:24] (03CR) 10Clément Goubert: mediawiki: Change naming scheme for resources (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert) [11:42:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for Manuel - https://phabricator.wikimedia.org/T336841 (10jbond) @Manuel Seems like the analytics-privatedata-users access was not configured. @Ottomata / @odimitrijevic are you able to approve Manuel's access to analytics-privatedata-users, thanks [11:44:23] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [11:47:11] (03PS5) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [11:48:25] (03PS3) 10Jbond: gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [11:49:16] (03CR) 10Jbond: [V: 03+1] profile::gerrit: make dependency on gerrit class explicit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922107 (owner: 10Jbond) [11:49:26] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [11:49:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41297/console" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [11:49:32] (03PS6) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [11:50:02] (03CR) 10JMeybohm: [C: 03+2] Execute all tests if CI changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/922639 (owner: 10Giuseppe Lavagetto) [11:50:31] (03CR) 10CI reject: [V: 04-1] gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [11:50:47] (03CR) 10Jbond: [V: 03+1] gerrit: remove duplicate $gerrit_site definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [11:51:00] (03Abandoned) 10Jbond: profile::gerrit: make dependency on gerrit class explicit [puppet] - 10https://gerrit.wikimedia.org/r/922107 (owner: 10Jbond) [11:51:36] (03CR) 10Jbond: [V: 03+1] "will fix tests after lucnh" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [11:51:47] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [11:52:40] (03PS1) 10Sergio Gimeno: MultiPaneDialog: remove attribute hidden instead of class [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922405 (https://phabricator.wikimedia.org/T337256) [11:52:43] (03CR) 10Awight: [C: 03+1] "Seems right! Confirmed that the flags are still needed, they reverse the defaults provided in Kartographer/extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922810 (https://phabricator.wikimedia.org/T336834) (owner: 10WMDE-Fisch) [11:56:33] (03CR) 10Giuseppe Lavagetto: Make kubernetes::clusters the central place for k8s config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [11:56:59] (03Merged) 10jenkins-bot: Execute all tests if CI changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/922639 (owner: 10Giuseppe Lavagetto) [11:59:35] (03PS7) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [12:00:26] (03PS1) 10KartikMistry: Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922826 (https://phabricator.wikimedia.org/T337290) [12:02:02] (03PS1) 10JMeybohm: Update apiVersion to be compatible with k8s 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922828 [12:02:04] (03PS1) 10JMeybohm: Stop validating against k8s 1.16, add validation against 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922829 [12:03:01] (03CR) 10CI reject: [V: 04-1] Update apiVersion to be compatible with k8s 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922828 (owner: 10JMeybohm) [12:03:30] (03PS1) 10Jcrespo: Add tmpdir removal, now that upload is stable [software/mediabackups] - 10https://gerrit.wikimedia.org/r/922830 (https://phabricator.wikimedia.org/T327157) [12:03:32] (03CR) 10CI reject: [V: 04-1] Stop validating against k8s 1.16, add validation against 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922829 (owner: 10JMeybohm) [12:05:07] (03PS1) 10Gmodena: mw-page-content-change-enrich: revert checkpoint dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) [12:06:31] (03PS2) 10JMeybohm: Update apiVersion to be compatible with k8s 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922828 [12:06:33] (03PS2) 10JMeybohm: Stop validating against k8s 1.16, add validation against 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922829 [12:07:10] (03CR) 10CI reject: [V: 04-1] Update apiVersion to be compatible with k8s 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922828 (owner: 10JMeybohm) [12:12:41] (03PS8) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [12:15:33] (03PS9) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [12:15:54] (03PS4) 10Jbond: gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [12:16:19] (03CR) 10JMeybohm: "This is expected to fail CI as the change is not compatible with k8s 1.16." [deployment-charts] - 10https://gerrit.wikimedia.org/r/922828 (owner: 10JMeybohm) [12:18:35] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove global rules [puppet] - 10https://gerrit.wikimedia.org/r/921248 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [12:18:44] (03PS3) 10Filippo Giunchedi: prometheus: remove global rules [puppet] - 10https://gerrit.wikimedia.org/r/921248 (https://phabricator.wikimedia.org/T288196) [12:18:56] (03PS10) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [12:19:30] (03PS6) 10Jelto: gitlab: add check for running backups in the background [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) [12:20:05] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41301/console" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [12:21:32] (03CR) 10JMeybohm: [C: 03+1] Patch helm defaults in helmfile during CI tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/922563 (owner: 10Giuseppe Lavagetto) [12:22:03] (03PS11) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [12:22:05] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) (owner: 10Jelto) [12:22:35] (03CR) 10Jbond: P:bird::anycast_healthchecker: allow binding to multiple services (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [12:23:18] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41302/console" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [12:23:24] (03CR) 10JMeybohm: [C: 03+1] "We should have this documented somewhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922793 (https://phabricator.wikimedia.org/T337359) (owner: 10Giuseppe Lavagetto) [12:23:46] (03PS1) 10Jbond: DO NOT MERGE: testing empty require [puppet] - 10https://gerrit.wikimedia.org/r/922832 [12:24:46] (03CR) 10JMeybohm: [V: 03+1] Make kubernetes::clusters the central place for k8s config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:25:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41303/console" [puppet] - 10https://gerrit.wikimedia.org/r/922832 (owner: 10Jbond) [12:25:14] (03PS3) 10Gergő Tisza: [beta] GrowthExperiments: Use ActionApiImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920283 (https://phabricator.wikimedia.org/T335641) [12:25:59] (03PS1) 10Jaime Nuche: doc: allow gitlab runners to publish docs only through `doc-gitlab` [puppet] - 10https://gerrit.wikimedia.org/r/922834 (https://phabricator.wikimedia.org/T336168) [12:26:06] (03CR) 10Jelto: [C: 03+2] gitlab: add check for running backups in the background (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) (owner: 10Jelto) [12:27:05] (03PS2) 10Jbond: DO NOT MERGE: testing empty require [puppet] - 10https://gerrit.wikimedia.org/r/922832 [12:28:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41304/console" [puppet] - 10https://gerrit.wikimedia.org/r/922832 (owner: 10Jbond) [12:28:40] (03Merged) 10jenkins-bot: gitlab: add check for running backups in the background [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) (owner: 10Jelto) [12:29:59] (PuppetDisabled) firing: Puppet disabled on cuminunpriv1001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [12:30:19] (03PS3) 10Jbond: DO NOT MERGE: testing empty require [puppet] - 10https://gerrit.wikimedia.org/r/922832 [12:31:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41305/console" [puppet] - 10https://gerrit.wikimedia.org/r/922832 (owner: 10Jbond) [12:32:59] > Notice: Undefined variable: wgGERestbaseUrl in /srv/mediawiki/wmf-config/CommonSettings-labs.php on line 374 [12:33:48] (03CR) 10Ottomata: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [12:33:53] (03CR) 10Jbond: P:bird::anycast_healthchecker: allow binding to multiple services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [12:35:56] (03PS7) 10Jbond: Gitlab: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) [12:36:57] (03CR) 10Jbond: Gitlab: Support OIDC alongside CAS for OmniAuth in Gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [12:37:28] (03Abandoned) 10Jbond: DO NOT MERGE: testing empty require [puppet] - 10https://gerrit.wikimedia.org/r/922832 (owner: 10Jbond) [12:42:40] (03PS1) 10Aklapper: Automate quarterly Phabricator metrics for Tech Community Newsletter [puppet] - 10https://gerrit.wikimedia.org/r/922836 (https://phabricator.wikimedia.org/T337387) [12:42:49] (03CR) 10Gergő Tisza: [C: 04-1] [beta] GrowthExperiments: Use ActionApiImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920283 (https://phabricator.wikimedia.org/T335641) (owner: 10Gergő Tisza) [12:43:06] (03PS4) 10Gergő Tisza: [beta] GrowthExperiments: Use ActionApiImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920283 (https://phabricator.wikimedia.org/T335641) [12:44:58] (03CR) 10CI reject: [V: 04-1] Automate quarterly Phabricator metrics for Tech Community Newsletter [puppet] - 10https://gerrit.wikimedia.org/r/922836 (https://phabricator.wikimedia.org/T337387) (owner: 10Aklapper) [12:45:41] (03PS1) 10JMeybohm: Add README, enhance changelog and switch to source format 3 [debs/envoyproxy] (v1.26) - 10https://gerrit.wikimedia.org/r/922837 [12:46:56] (03PS2) 10JMeybohm: Add README, enhance changelog and switch to source format 3 [debs/envoyproxy] (v1.26) - 10https://gerrit.wikimedia.org/r/922837 (https://phabricator.wikimedia.org/T300324) [12:47:22] !log running changeWikiConfig.php on Growth pilot wikis for T337348 [12:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:27] T337348: Section-level images: Set up beta cluster - https://phabricator.wikimedia.org/T337348 [12:50:22] (03CR) 10Gergő Tisza: [C: 03+2] [beta] GrowthExperiments: Use ActionApiImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920283 (https://phabricator.wikimedia.org/T335641) (owner: 10Gergő Tisza) [12:50:24] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools newtopictool on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922838 (https://phabricator.wikimedia.org/T317375) [12:51:10] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: Use ActionApiImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920283 (https://phabricator.wikimedia.org/T335641) (owner: 10Gergő Tisza) [12:55:51] !log `[samtar@mwmaint1002 ~]$ mwscript findBadBlobs --wiki nowiki --revisions 5227369 --mark T337392` T337392 [12:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:56] T337392: MediaWiki\Revision\RevisionAccessException: Failed to load data blob from {address} for revision {revision}. If this problem persist, use the findBadBlobs maintenance script to investigate the issue and mark bad blobs. - https://phabricator.wikimedia.org/T337392 [12:57:49] (03PS1) 10Ottomata: mw-page-content-change-enrich - deploy in eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922839 (https://phabricator.wikimedia.org/T330507) [12:58:24] (03CR) 10Jbond: [C: 04-1] "lgtm but some minor issues with the overriding, see inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T1300). [13:00:04] herron, dcausse, WMDE-Fisch, sergi0, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:21] (I have a hard stop for a meeting in 1 hour, so if someone else could take this deployment window, that'd be great) [13:00:21] \o [13:00:21] o/ [13:00:48] hi. i have a bit too much of stuff. i'm happy to reschedule if we're out of time [13:01:00] (my fault for not planning things better) [13:01:01] (03CR) 10Herron: [C: 03+1] "🙌" [puppet] - 10https://gerrit.wikimedia.org/r/921349 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [13:01:14] hello [13:01:34] (03PS1) 10Stevemunene: Decommission an-worker1058 from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/922841 [13:02:04] I can deploy I guess [13:02:41] (03PS2) 10Aklapper: Automate quarterly Phabricator metrics for Tech Community Newsletter [puppet] - 10https://gerrit.wikimedia.org/r/922836 (https://phabricator.wikimedia.org/T337387) [13:02:50] dcausse: are you sure? I don't mind starting, but I can't run over today [13:03:25] TheresNoTime: oh please go ahead if you have time :) [13:03:39] (03PS3) 10Samtar: arclamp: switch redis server to arclamp1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920298 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [13:03:58] (03CR) 10Herron: [V: 03+1 C: 03+2] arclamp: switch redis server to arclamp1001 [puppet] - 10https://gerrit.wikimedia.org/r/920299 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [13:04:06] sure - herron, are you available for ^ ? [13:04:13] TheresNoTime: yes ready when you are [13:04:39] herron: I note the patch mentions restarting a service - are you able to do that if needed? [13:04:49] TheresNoTime: yes prepping that now [13:04:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920298 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [13:05:44] (03Merged) 10jenkins-bot: arclamp: switch redis server to arclamp1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920298 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [13:06:18] !log samtar@deploy1002 Started scap: Backport for [[gerrit:920298|arclamp: switch redis server to arclamp1001 (T327277)]] [13:06:23] T327277: Move excimer/arclamp redis from mwlog to arclamp hosts - https://phabricator.wikimedia.org/T327277 [13:06:57] herron: is this testable on mwdebug, or should I sync straight away? [13:07:14] TheresNoTime: straight away please [13:07:25] ack [13:07:31] !log tools.codesearch Deployed https://gerrit.wikimedia.org/r/c/labs/codesearch/+/909258 and also restarted tool instances to core search backend was dead. [13:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:49] !log samtar@deploy1002 herron and samtar: Backport for [[gerrit:920298|arclamp: switch redis server to arclamp1001 (T327277)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:08:02] syncing [13:08:05] TheresNoTime: /me waves in case he can help [13:08:23] urbanecm: thank you, I may need to head off a little early, but I'll work through the queue until then [13:08:32] happy to take over then [13:14:03] TheresNoTime: if i may, a pro tip: `scap backport --yes CHANGEID` skips mwdebug right away :) [13:14:11] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:920298|arclamp: switch redis server to arclamp1001 (T327277)]] (duration: 07m 53s) [13:14:14] urbanecm: ah, thank you! [13:14:16] T327277: Move excimer/arclamp redis from mwlog to arclamp hosts - https://phabricator.wikimedia.org/T327277 [13:14:20] herron: live on prod :) [13:14:23] TheresNoTime: thanks! [13:14:27] dcausse: doing 801792 now [13:14:29] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1022.mgmt.eqiad.wmnet with reboot policy FORCED [13:14:36] thanks! [13:14:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse) [13:15:21] (03PS1) 10Fabfur: WIP: sre.cdn: First commit for a cookbook to switch Varnish/HAProxy listening on port 80 [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 [13:15:40] (03Merged) 10jenkins-bot: [cirrus] Fix typo in config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse) [13:16:09] !log samtar@deploy1002 Started scap: Backport for [[gerrit:801792|[cirrus] Fix typo in config var]] [13:17:07] (03PS4) 10Jameel Kaisar: Set NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/921437 (https://phabricator.wikimedia.org/T335637) [13:17:36] !log samtar@deploy1002 samtar and dcausse: Backport for [[gerrit:801792|[cirrus] Fix typo in config var]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:17:38] dcausse: live on mwdebug, do you need to test? [13:18:17] 10SRE, 10MediaWiki-General, 10Platform Engineering, 10Security-Team, and 4 others: CVE-2023-29141: X-Forwarded-For header allows brute-forcing autoblocked IP addresses - https://phabricator.wikimedia.org/T285159 (10sbassett) [13:18:23] TheresNoTime: yes [13:18:44] ack [13:18:47] (03CR) 10Jameel Kaisar: Set NetworkProbeLimit cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921437 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [13:19:09] (03PS2) 10Samtar: Enable Kartographer Nearby on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922810 (https://phabricator.wikimedia.org/T336834) (owner: 10WMDE-Fisch) [13:20:47] TheresNoTime: all good [13:20:48] 10ops-eqiad: InterfaceSpeedError - https://phabricator.wikimedia.org/T337364 (10Jclark-ctr) a:03Jclark-ctr [13:20:54] syncing [13:21:15] 10ops-eqiad: InterfaceSpeedError - https://phabricator.wikimedia.org/T337364 (10Jclark-ctr) 05Open→03Resolved cable link is showing 1g now [13:22:02] (03CR) 10Elukey: [C: 03+1] API GW: Fix RegEx in config for revertrisk models on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922805 (https://phabricator.wikimedia.org/T337378) (owner: 10Klausman) [13:25:02] (03CR) 10Esanders: [C: 03+1] "Thanks!" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922605 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [13:25:40] (03CR) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [13:26:24] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:801792|[cirrus] Fix typo in config var]] (duration: 10m 15s) [13:26:34] dcausse: live on prod [13:26:38] TheresNoTime: thanks! [13:26:40] WMDE-Fisch: ready for 922810 ? [13:26:49] TheresNoTime: yes! [13:26:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922810 (https://phabricator.wikimedia.org/T336834) (owner: 10WMDE-Fisch) [13:27:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:27:43] (03Merged) 10jenkins-bot: Enable Kartographer Nearby on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922810 (https://phabricator.wikimedia.org/T336834) (owner: 10WMDE-Fisch) [13:28:04] (03CR) 10Andrew Bogott: "aaaaaand still!" [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (https://phabricator.wikimedia.org/T325773) (owner: 10Andrew Bogott) [13:28:12] !log samtar@deploy1002 Started scap: Backport for [[gerrit:922810|Enable Kartographer Nearby on remaining wikis (T336834)]] [13:28:16] T336834: Deploy Nearby feature to remaining wikivoyages - https://phabricator.wikimedia.org/T336834 [13:28:19] 10SRE, 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission labstore100[45].eqiad.wmne - https://phabricator.wikimedia.org/T337269 (10Jclark-ctr) [13:28:26] 10SRE, 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission labstore100[45].eqiad.wmne - https://phabricator.wikimedia.org/T337269 (10Jclark-ctr) 05Open→03Resolved [13:29:41] !log samtar@deploy1002 samtar and wmde-fisch: Backport for [[gerrit:922810|Enable Kartographer Nearby on remaining wikis (T336834)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:29:48] WMDE-Fisch: on mwdebug for testing [13:29:58] TheresNoTime: Yes [13:30:05] sergi0: I'm going to set 922405 merging now FYI [13:30:16] TheresNoTime: cool, ty [13:30:24] (03PS6) 10Elukey: profile::kafka::mirror: add support for PKI certificate [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) [13:30:32] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922405 (https://phabricator.wikimedia.org/T337256) (owner: 10Sergio Gimeno) [13:30:34] TheresNoTime: Works, thank you. Go on. [13:30:41] syncing! [13:31:03] (03CR) 10Elukey: "Thanks for the review John!" [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [13:31:54] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41306/console" [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [13:31:56] (03CR) 10Elukey: profile::kafka::mirror: add support for PKI certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [13:32:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:34:46] urbanecm: after the current deploy (922810) is done, can I hand over to you for the remaining ones? [13:34:54] absolutely [13:34:59] (PuppetDisabled) resolved: Puppet disabled on cuminunpriv1001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:35:06] thanks, I'll ping you :) [13:35:10] (03CR) 10Urbanecm: [C: 03+2] Add maint script to opt out active users from the new topic tool [extensions/DiscussionTools] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920238 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [13:35:28] I just +2'ed MatmaRex's to save time on CI [13:35:30] waiting for ping [13:36:06] RECOVERY - puppet last run on cuminunpriv1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:36:15] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1122.eqiad.wmnet - https://phabricator.wikimedia.org/T336833 (10Jclark-ctr) [13:36:16] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:922810|Enable Kartographer Nearby on remaining wikis (T336834)]] (duration: 08m 04s) [13:36:19] urbanecm: note, i have 3 wmf.9 patches [13:36:21] T336834: Deploy Nearby feature to remaining wikivoyages - https://phabricator.wikimedia.org/T336834 [13:36:22] WMDE-Fisch: live on prod :) [13:36:28] MatmaRex: i only see one? [13:36:28] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1122.eqiad.wmnet - https://phabricator.wikimedia.org/T336833 (10Jclark-ctr) 05Open→03Resolved [13:36:31] urbanecm: all yours, thank you [13:36:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Hghani) Hi contract end date is November 30 2023. Contact: @kzimmerman [13:36:51] TheresNoTime: <3 [13:36:52] https://gerrit.wikimedia.org/r/q/project:mediawiki%252Fextensions%252FDiscussionTools+branch:wmf%252F1.41.0-wmf.9+status:open [13:37:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10ssingh) @Jclark-ctr: Hi John, Traffic has completed its work on the dns hosts in codfw, so whenever you are ready to work on this, please go ahead. All we need from you is to finish the... [13:37:07] * TheresNoTime away [13:37:10] (03PS2) 10Urbanecm: Enable DiscussionTools newtopictool on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922838 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [13:37:12] they can all go together [13:37:16] MatmaRex: ack [13:37:21] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1121.eqiad.wmnet - https://phabricator.wikimedia.org/T336725 (10Jclark-ctr) [13:37:25] (03CR) 10Urbanecm: [C: 03+2] Define $maintClass in maintenance script for compatibility [extensions/DiscussionTools] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920731 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [13:37:27] (03CR) 10Urbanecm: [C: 03+2] NewTopicOptOutActiveUsers: Skip bot users etc. [extensions/DiscussionTools] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920733 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [13:37:53] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1121.eqiad.wmnet - https://phabricator.wikimedia.org/T336725 (10Jclark-ctr) 05Open→03Resolved [13:38:14] MatmaRex: the config patch is blocked by backports, right? [13:38:34] yes [13:38:44] the order is: backports → maintenance → config [13:39:18] (03PS1) 10Effie Mouzeli: ipoid: deployment_server stanzas [puppet] - 10https://gerrit.wikimedia.org/r/922845 (https://phabricator.wikimedia.org/T325147) [13:39:22] ack [13:39:30] (03CR) 10CI reject: [V: 04-1] ipoid: deployment_server stanzas [puppet] - 10https://gerrit.wikimedia.org/r/922845 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [13:39:32] (03PS2) 10Klausman: helmfile.d: Fix regex in api-gateway's config for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/922805 (https://phabricator.wikimedia.org/T337378) [13:39:41] (03CR) 10Klausman: helmfile.d: Fix regex in api-gateway's config for revertrisk (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922805 (https://phabricator.wikimedia.org/T337378) (owner: 10Klausman) [13:40:03] (03PS3) 10Urbanecm: [Growth] Add mediawiki.mentor_dashboard.interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918500 (https://phabricator.wikimedia.org/T325117) [13:40:07] (03PS2) 10Effie Mouzeli: ipoid: deployment_server stanzas [puppet] - 10https://gerrit.wikimedia.org/r/922845 (https://phabricator.wikimedia.org/T325147) [13:40:15] waiting on CI [13:40:44] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:41:10] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:45:48] (03CR) 10Ottomata: [C: 03+2] "Was just talking to Steve about this, we can see no reason not to merge it. Merging it to get it out of our review queue 😊" [puppet] - 10https://gerrit.wikimedia.org/r/779897 (owner: 10Ottomata) [13:47:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918500 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [13:47:54] (03Merged) 10jenkins-bot: [Growth] Add mediawiki.mentor_dashboard.interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918500 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [13:48:21] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:918500|[Growth] Add mediawiki.mentor_dashboard.interaction (T325117)]] [13:48:26] T325117: Personalized Praise: Instrumentation - https://phabricator.wikimedia.org/T325117 [13:48:41] (03CR) 10Btullis: [C: 03+1] "Looks good to me. I take John's point about following up with the change in where we manage /etc/kafka but in essence I think that this is" [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [13:49:44] (03CR) 10Volans: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/922488 (owner: 10Ayounsi) [13:51:39] (03Merged) 10jenkins-bot: MultiPaneDialog: remove attribute hidden instead of class [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922405 (https://phabricator.wikimedia.org/T337256) (owner: 10Sergio Gimeno) [13:51:40] (03Merged) 10jenkins-bot: Add maint script to opt out active users from the new topic tool [extensions/DiscussionTools] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920238 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [13:51:42] (03Merged) 10jenkins-bot: Define $maintClass in maintenance script for compatibility [extensions/DiscussionTools] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920731 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [13:51:45] (03Merged) 10jenkins-bot: NewTopicOptOutActiveUsers: Skip bot users etc. [extensions/DiscussionTools] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920733 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [13:51:53] (03PS1) 10Jbond: udp2log: Add docs, validation and tidy up [puppet] - 10https://gerrit.wikimedia.org/r/922866 (https://phabricator.wikimedia.org/T276623) [13:51:55] (03Merged) 10jenkins-bot: Add Python 3.11 support [cookbooks] - 10https://gerrit.wikimedia.org/r/922488 (owner: 10Ayounsi) [13:51:57] (03PS1) 10Jbond: udp2log: update to take account of systemd updates [puppet] - 10https://gerrit.wikimedia.org/r/922867 (https://phabricator.wikimedia.org/T276623) [13:52:06] urbanecm: do you want to do the script run, or should i reschedule it for the evening? [13:52:21] MatmaRex: i can do it, no worries [13:52:32] ok thanks [13:54:15] (03PS27) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [13:55:27] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:918500|[Growth] Add mediawiki.mentor_dashboard.interaction (T325117)]] (duration: 07m 06s) [13:55:32] T325117: Personalized Praise: Instrumentation - https://phabricator.wikimedia.org/T325117 [13:55:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:56:42] (03PS5) 10Hashar: wm-patch-demo: initial implementation [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922605 (https://phabricator.wikimedia.org/T332474) [13:56:53] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:922405|MultiPaneDialog: remove attribute hidden instead of class (T337256)]], [[gerrit:920238|Add maint script to opt out active users from the new topic tool (T317375)]], [[gerrit:920731|Define $maintClass in maintenance script for compatibility (T317375)]], [[gerrit:920733|NewTopicOptOutActiveUsers: Skip bot users etc. (T317375)]] [13:56:58] T317375: [Config change] Deploy New Topic Tool as opt-out preference at fi.wiki (desktop) - https://phabricator.wikimedia.org/T317375 [13:56:58] T337256: Only first onboarding step is shown - https://phabricator.wikimedia.org/T337256 [13:57:08] (03CR) 10Hashar: "I forgot to git add `test/wm-patch-demo.js` :/" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922605 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [13:57:10] sergi0: MatmaRex: your patches are up to next [13:57:36] ack [13:58:27] !log urbanecm@deploy1002 matmarex and urbanecm and sgimeno: Backport for [[gerrit:922405|MultiPaneDialog: remove attribute hidden instead of class (T337256)]], [[gerrit:920238|Add maint script to opt out active users from the new topic tool (T317375)]], [[gerrit:920731|Define $maintClass in maintenance script for compatibility (T317375)]], [[gerrit:920733|NewTopicOptOutActiveUsers: Skip bot users etc. (T317375)]] synced t [13:58:27] o the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:58:38] sergi0: MatmaRex: your patches are on mwdebug1002, can you test? [13:58:51] sure, testing now [13:58:53] (03CR) 10Jbond: P:bird::anycast_healthchecker: allow binding to multiple services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [13:59:07] mine are just maintenance script changes, nothing to test [13:59:13] okay [13:59:45] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [13:59:57] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/921437 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [14:00:22] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10jbond) >>! In T322145#8876908, @Hghani wrote: > Hi contract end date is November 30 2023. > > Contact: @kzimmerman Thanks Hgjani,... [14:00:31] urbanecm: test ok from my side, you can sync [14:00:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:00:38] great, syncing [14:01:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [14:03:09] MatmaRex: script output: https://phabricator.wikimedia.org/P48502 [14:03:34] thanks! [14:03:44] (looks right) [14:03:49] good [14:04:03] MatmaRex: do i go ahead with the config then? [14:04:40] urbanecm: yes, please do [14:04:53] (03PS3) 10Urbanecm: Enable DiscussionTools newtopictool on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922838 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [14:04:57] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools newtopictool on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922838 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [14:05:22] (03CR) 10Clément Goubert: [C: 03+1] ipoid: deployment_server stanzas [puppet] - 10https://gerrit.wikimedia.org/r/922845 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [14:05:29] (03CR) 10Clément Goubert: [C: 03+1] admin_ng: Add iPoid namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/921704 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [14:05:47] (03Merged) 10jenkins-bot: Enable DiscussionTools newtopictool on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922838 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [14:06:14] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:922405|MultiPaneDialog: remove attribute hidden instead of class (T337256)]], [[gerrit:920238|Add maint script to opt out active users from the new topic tool (T317375)]], [[gerrit:920731|Define $maintClass in maintenance script for compatibility (T317375)]], [[gerrit:920733|NewTopicOptOutActiveUsers: Skip bot users etc. (T317375)]] (duration: 09m 21s) [14:06:21] T317375: [Config change] Deploy New Topic Tool as opt-out preference at fi.wiki (desktop) - https://phabricator.wikimedia.org/T317375 [14:06:21] T337256: Only first onboarding step is shown - https://phabricator.wikimedia.org/T337256 [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:49] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:922838|Enable DiscussionTools newtopictool on fiwiki (T317375)]] [14:07:27] (03PS1) 10Volans: tests: make it compatible with urllib3 v2.0+ [software/cumin] - 10https://gerrit.wikimedia.org/r/922869 [14:07:29] (03PS1) 10Volans: tox: make it compatible with tox 4.0+ [software/cumin] - 10https://gerrit.wikimedia.org/r/922870 [14:08:28] !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:922838|Enable DiscussionTools newtopictool on fiwiki (T317375)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:10:04] (03PS2) 10Jbond: udp2log: Add docs, validation and tidy up [puppet] - 10https://gerrit.wikimedia.org/r/922866 (https://phabricator.wikimedia.org/T276623) [14:10:06] (03PS2) 10Jbond: udp2log: update to take account of systemd updates [puppet] - 10https://gerrit.wikimedia.org/r/922867 (https://phabricator.wikimedia.org/T276623) [14:10:45] (03CR) 10Jbond: [C: 03+1] tox: make it compatible with tox 4.0+ [software/cumin] - 10https://gerrit.wikimedia.org/r/922870 (owner: 10Volans) [14:11:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41308/console" [puppet] - 10https://gerrit.wikimedia.org/r/922867 (https://phabricator.wikimedia.org/T276623) (owner: 10Jbond) [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:34] (03CR) 10Jbond: [C: 03+1] tests: make it compatible with urllib3 v2.0+ [software/cumin] - 10https://gerrit.wikimedia.org/r/922869 (owner: 10Volans) [14:11:35] urbanecm: oh, are you waiting for me? looks good on mwdebug! [14:11:58] MatmaRex: oh sorry, i missed the ping from scap and thought it didn't complete yet. thanks for testing & proceeding! [14:12:12] (03CR) 10Clément Goubert: [C: 03+2] Patch helm defaults in helmfile during CI tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/922563 (owner: 10Giuseppe Lavagetto) [14:12:16] (03CR) 10Hashar: [C: 03+2] wm-patch-demo: initial implementation [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922605 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [14:12:18] (03CR) 10Kamila Součková: [C: 03+2] benthos: create image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/922142 (https://phabricator.wikimedia.org/T336658) (owner: 10Kamila Součková) [14:12:50] (03CR) 10EoghanGaffney: [C: 03+1] doc: allow gitlab runners to publish docs only through `doc-gitlab` [puppet] - 10https://gerrit.wikimedia.org/r/922834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [14:12:56] (03Merged) 10jenkins-bot: wm-patch-demo: initial implementation [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922605 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [14:13:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41309/console" [puppet] - 10https://gerrit.wikimedia.org/r/922866 (https://phabricator.wikimedia.org/T276623) (owner: 10Jbond) [14:13:43] !log hashar@deploy1002 Started deploy [gerrit/gerrit@2d719f3]: wm-patch-demo: initial implementation | T332474 [14:13:47] T332474: [wm-checks-api] Create a new gerrit bot for Patch Demo - https://phabricator.wikimedia.org/T332474 [14:13:49] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@2d719f3]: wm-patch-demo: initial implementation | T332474 (duration: 00m 07s) [14:14:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] udp2log: Add docs, validation and tidy up [puppet] - 10https://gerrit.wikimedia.org/r/922866 (https://phabricator.wikimedia.org/T276623) (owner: 10Jbond) [14:15:12] (03PS1) 10Ottomata: Revert "Bounce keyholder-proxy when keyholder-auth.d group -> key mapping changes" [puppet] - 10https://gerrit.wikimedia.org/r/922850 [14:15:35] (03CR) 10CI reject: [V: 04-1] Revert "Bounce keyholder-proxy when keyholder-auth.d group -> key mapping changes" [puppet] - 10https://gerrit.wikimedia.org/r/922850 (owner: 10Ottomata) [14:15:42] (03PS2) 10Ottomata: Revert "Bounce keyholder-proxy when keyholder-auth.d group -> key mapping changes" [puppet] - 10https://gerrit.wikimedia.org/r/922850 [14:16:05] (03CR) 10CI reject: [V: 04-1] Revert "Bounce keyholder-proxy when keyholder-auth.d group -> key mapping changes" [puppet] - 10https://gerrit.wikimedia.org/r/922850 (owner: 10Ottomata) [14:16:11] (03CR) 10Effie Mouzeli: [C: 03+1] Revert "Bounce keyholder-proxy when keyholder-auth.d group -> key mapping changes" [puppet] - 10https://gerrit.wikimedia.org/r/922850 (owner: 10Ottomata) [14:16:13] (03PS3) 10Ottomata: Revert "Bounce keyholder-proxy when keyholder-auth.d group -> key..." [puppet] - 10https://gerrit.wikimedia.org/r/922850 [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:37] (03CR) 10CI reject: [V: 04-1] Revert "Bounce keyholder-proxy when keyholder-auth.d group -> key..." [puppet] - 10https://gerrit.wikimedia.org/r/922850 (owner: 10Ottomata) [14:17:29] (03PS4) 10Ottomata: Revert "Bounce keyholder-proxy when keyholder-auth.d group -> key..." [puppet] - 10https://gerrit.wikimedia.org/r/922850 [14:17:56] (03CR) 10Ottomata: [C: 03+2] Revert "Bounce keyholder-proxy when keyholder-auth.d group -> key..." [puppet] - 10https://gerrit.wikimedia.org/r/922850 (owner: 10Ottomata) [14:19:00] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:922838|Enable DiscussionTools newtopictool on fiwiki (T317375)]] (duration: 12m 11s) [14:19:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41310/console" [puppet] - 10https://gerrit.wikimedia.org/r/922867 (https://phabricator.wikimedia.org/T276623) (owner: 10Jbond) [14:19:05] T317375: [Config change] Deploy New Topic Tool as opt-out preference at fi.wiki (desktop) - https://phabricator.wikimedia.org/T317375 [14:19:13] (03Merged) 10jenkins-bot: Patch helm defaults in helmfile during CI tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/922563 (owner: 10Giuseppe Lavagetto) [14:19:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:20:01] (03PS2) 10Stevemunene: Decommission an-worker1058 from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/922841 (https://phabricator.wikimedia.org/T317861) [14:20:22] (03CR) 10CI reject: [V: 04-1] Decommission an-worker1058 from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/922841 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [14:21:14] (03CR) 10Volans: [C: 03+2] sre.SREBatchRunnerBase: simplify overriding action [cookbooks] - 10https://gerrit.wikimedia.org/r/922511 (owner: 10Volans) [14:22:47] (03PS2) 10Ottomata: Undeploy flink-operator and uncreate service namespace in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922138 (https://phabricator.wikimedia.org/T333464) [14:23:28] (03Merged) 10jenkins-bot: sre.SREBatchRunnerBase: simplify overriding action [cookbooks] - 10https://gerrit.wikimedia.org/r/922511 (owner: 10Volans) [14:23:40] thanks urbanecm [14:23:46] no problem [14:24:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:24:44] (03PS2) 10Effie Mouzeli: varnish: fix call to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/921617 (https://phabricator.wikimedia.org/T337142) (owner: 10Volans) [14:24:48] thank you urbanecm! [14:24:51] no problem [14:25:05] (03PS3) 10Stevemunene: Decommission an-worker1058 from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/922841 (https://phabricator.wikimedia.org/T317861) [14:25:45] (03CR) 10Effie Mouzeli: [C: 03+2] varnish: fix call to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/921617 (https://phabricator.wikimedia.org/T337142) (owner: 10Volans) [14:26:11] (03PS9) 10Clément Goubert: mediawiki: Change naming scheme for resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) [14:26:34] !log volans@cumin2002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [14:26:37] !log volans@cumin2002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [14:28:01] (03PS6) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 [14:28:47] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922479 (owner: 10EoghanGaffney) [14:29:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:29:37] !log volans@cumin2002 START - Cookbook sre.puppetboard.restart-reboot rolling restart_daemons on P{puppetboard2002.codfw.wmnet} and (A:puppetboard) [14:30:00] !log volans@cumin2002 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet. on all recursors [14:30:03] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet. on all recursors [14:30:47] !log volans@cumin2002 END (PASS) - Cookbook sre.puppetboard.restart-reboot (exit_code=0) rolling restart_daemons on P{puppetboard2002.codfw.wmnet} and (A:puppetboard) [14:31:31] (03CR) 10Kamila Součková: [V: 03+2 C: 03+2] benthos: create image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/922142 (https://phabricator.wikimedia.org/T336658) (owner: 10Kamila Součková) [14:33:49] hashar: FYI I see "Error while fetching results for wm-patch-demo: TypeError: Failed to fetch" on Gerrit patches where the new JS indicator for CI run progress was [14:34:12] (03CR) 10Volans: [C: 03+2] tests: make it compatible with urllib3 v2.0+ [software/cumin] - 10https://gerrit.wikimedia.org/r/922869 (owner: 10Volans) [14:34:18] (03CR) 10Volans: [C: 03+2] tox: make it compatible with tox 4.0+ [software/cumin] - 10https://gerrit.wikimedia.org/r/922870 (owner: 10Volans) [14:34:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:34:45] (03CR) 10EoghanGaffney: [C: 03+2] Remove temporary firewall rule for doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/922479 (owner: 10EoghanGaffney) [14:34:53] (03PS7) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 [14:34:56] (03CR) 10Volans: [C: 03+2] dhcp: cleanup the snippet on refresh failure [software/spicerack] - 10https://gerrit.wikimedia.org/r/920224 (https://phabricator.wikimedia.org/T336696) (owner: 10Volans) [14:35:01] (03CR) 10Jelto: [C: 03+1] "lgtm, I'll test this change on the wmcs test instance" [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [14:35:30] (03PS1) 10Jbond: ssh: do not try to ca sign host keys if ca is not available [puppet] - 10https://gerrit.wikimedia.org/r/922871 (https://phabricator.wikimedia.org/T268344) [14:36:01] (03CR) 10Alexandros Kosiaris: "Couple of answers, I can +1 once we have some comments about the 2 issues left." [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:36:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41315/console" [puppet] - 10https://gerrit.wikimedia.org/r/922871 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond) [14:37:19] (03CR) 10Jelto: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41316/console" [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [14:37:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST csidrivers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:38:19] (03CR) 10Jbond: "-1: instead of creating a new" [puppet] - 10https://gerrit.wikimedia.org/r/922581 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [14:38:24] (03PS1) 10EoghanGaffney: Remove buster hosts from doc rotation [puppet] - 10https://gerrit.wikimedia.org/r/922872 (https://phabricator.wikimedia.org/T319477) [14:38:38] hashar: seconding volans, I thought it was my network crapping out, but it looks like I'm not alone. Console says it's CORS Missing Allow Origin [14:38:47] (03CR) 10CI reject: [V: 04-1] dhcp: cleanup the snippet on refresh failure [software/spicerack] - 10https://gerrit.wikimedia.org/r/920224 (https://phabricator.wikimedia.org/T336696) (owner: 10Volans) [14:40:00] (03CR) 10Majavah: "I think this will still result in the certificate configuration being added to /etc/ssh/sshd_config, is that an issue?" [puppet] - 10https://gerrit.wikimedia.org/r/922871 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond) [14:40:44] (03Merged) 10jenkins-bot: tests: make it compatible with urllib3 v2.0+ [software/cumin] - 10https://gerrit.wikimedia.org/r/922869 (owner: 10Volans) [14:41:25] (03Merged) 10jenkins-bot: tox: make it compatible with tox 4.0+ [software/cumin] - 10https://gerrit.wikimedia.org/r/922870 (owner: 10Volans) [14:42:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (LIST csidrivers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:43:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:44:18] (03CR) 10Fabfur: "This change is ready for review." (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (owner: 10Fabfur) [14:44:42] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922872 (https://phabricator.wikimedia.org/T319477) (owner: 10EoghanGaffney) [14:45:19] (03CR) 10Jbond: [V: 03+1] ssh: do not try to ca sign host keys if ca is not available (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922871 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond) [14:46:40] (03PS8) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 [14:46:56] (03PS1) 10Ottomata: flink-operator - deploy in wikikube eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) [14:48:08] (03PS1) 10Ayounsi: DHCP: allow transit DHCP from 3rd party relays [homer/public] - 10https://gerrit.wikimedia.org/r/922875 (https://phabricator.wikimedia.org/T337345) [14:48:23] (03CR) 10Dzahn: [C: 03+1] "doc.discovery.wmnet is an alias for doc1003.eqiad.wmnet." [puppet] - 10https://gerrit.wikimedia.org/r/922872 (https://phabricator.wikimedia.org/T319477) (owner: 10EoghanGaffney) [14:48:41] (03CR) 10Elukey: [C: 03+2] profile::kafka::mirror: add support for PKI certificate [puppet] - 10https://gerrit.wikimedia.org/r/922795 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [14:48:43] (03PS41) 10JMeybohm: Make kubernetes::clusters the central place for k8s config [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) [14:48:45] (03PS5) 10JMeybohm: Remove profile::kubernetes::deployment_server from role::releases [puppet] - 10https://gerrit.wikimedia.org/r/912785 (https://phabricator.wikimedia.org/T288629) [14:48:47] (03PS15) 10JMeybohm: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [14:48:49] (03PS4) 10JMeybohm: profile::imagecatalog migrate from user token to client cert [puppet] - 10https://gerrit.wikimedia.org/r/912842 (https://phabricator.wikimedia.org/T325268) [14:48:51] (03PS9) 10JMeybohm: prometheus::k8s: Use kubernetes::clusters_defaults [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) [14:48:53] (03PS9) 10JMeybohm: prometheus::k8s switch staging-codfw to client cert auth [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) [14:49:00] (03CR) 10EoghanGaffney: [C: 03+2] Remove buster hosts from doc rotation [puppet] - 10https://gerrit.wikimedia.org/r/922872 (https://phabricator.wikimedia.org/T319477) (owner: 10EoghanGaffney) [14:49:03] (03CR) 10Jbond: "lgtm ill merge now to unblock and send a patch for the comment" [puppet] - 10https://gerrit.wikimedia.org/r/922581 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [14:49:05] (03PS3) 10Effie Mouzeli: ipoid: deployment_server stanzas [puppet] - 10https://gerrit.wikimedia.org/r/922845 (https://phabricator.wikimedia.org/T325147) [14:49:07] (03CR) 10Jbond: [C: 03+2] ssh: do not try to ca sign host keys if ca is not available [puppet] - 10https://gerrit.wikimedia.org/r/922581 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [14:49:30] taavi: fyi merging [14:49:38] eoghan: feel free to merge mine [14:49:40] jbond: Happy for me to merge your puppet changes? [14:49:41] Snap. [14:49:48] :) [14:49:51] Landing now [14:49:52] (03CR) 10JMeybohm: Make kubernetes::clusters the central place for k8s config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:49:54] (03PS2) 10Ottomata: flink-operator - deploy in wikikube eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) [14:49:56] thanks [14:50:04] Done [14:50:15] (03CR) 10Ssingh: "PCC still failing:" [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [14:50:44] (03PS3) 10Ottomata: flink-operator - deploy in wikikube eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) [14:51:19] (03CR) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [14:52:13] (03PS2) 10Ottomata: mw-page-content-change-enrich - deploy in eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922839 (https://phabricator.wikimedia.org/T330507) [14:52:22] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Nice work getting to the bottom of it!" [homer/public] - 10https://gerrit.wikimedia.org/r/922875 (https://phabricator.wikimedia.org/T337345) (owner: 10Ayounsi) [14:53:06] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: deployment_server stanzas [puppet] - 10https://gerrit.wikimedia.org/r/922845 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [14:54:00] (03CR) 10Ayounsi: [C: 03+2] DHCP: allow transit DHCP from 3rd party relays [homer/public] - 10https://gerrit.wikimedia.org/r/922875 (https://phabricator.wikimedia.org/T337345) (owner: 10Ayounsi) [14:54:15] (03CR) 10Ayounsi: [C: 03+2] DHCP: allow transit DHCP from 3rd party relays (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/922875 (https://phabricator.wikimedia.org/T337345) (owner: 10Ayounsi) [14:54:17] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] DHCP: allow transit DHCP from 3rd party relays [homer/public] - 10https://gerrit.wikimedia.org/r/922875 (https://phabricator.wikimedia.org/T337345) (owner: 10Ayounsi) [14:54:19] (03PS3) 10Effie Mouzeli: admin_ng: Add iPoid namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/921704 (https://phabricator.wikimedia.org/T336163) [14:54:37] (03CR) 10Dzahn: [C: 03+1] "This looks good to me. Is it possible to run a httpbb test now to confirm it already works on k8s though?" [puppet] - 10https://gerrit.wikimedia.org/r/922791 (https://phabricator.wikimedia.org/T337041) (owner: 10Jelto) [14:54:39] (03Merged) 10jenkins-bot: DHCP: allow transit DHCP from 3rd party relays [homer/public] - 10https://gerrit.wikimedia.org/r/922875 (https://phabricator.wikimedia.org/T337345) (owner: 10Ayounsi) [14:57:26] (03CR) 10Effie Mouzeli: [C: 03+2] admin_ng: Add iPoid namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/921704 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [14:57:33] (03PS1) 10Jbond: puppetmaster: add new function to check for local files [puppet] - 10https://gerrit.wikimedia.org/r/922877 (https://phabricator.wikimedia.org/T268344) [15:00:53] (03CR) 10Jelto: trafficserver: switch annual.wikimedia.org backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922791 (https://phabricator.wikimedia.org/T337041) (owner: 10Jelto) [15:03:05] (03CR) 10Vgutierrez: WIP: sre.cdn: Minor fixes and lint (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (owner: 10Fabfur) [15:03:22] (03PS9) 10Jbond: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [15:04:55] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10kamila) [15:05:31] 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): Create Benthos docker image - https://phabricator.wikimedia.org/T336658 (10kamila) 05In progress→03Resolved Image built and published. [15:06:49] (03CR) 10Eevans: cassandra: add support for version 4.1.1 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [15:06:58] (03PS16) 10Eevans: cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [15:07:26] (03CR) 10CI reject: [V: 04-1] cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [15:07:56] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:08:10] (03CR) 10Kamila Součková: [C: 03+1] thumbor: remove imagemagick pins [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/922803 (owner: 10Hnowlan) [15:08:16] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:09:03] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:09:19] (03PS10) 10Jbond: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [15:09:37] (03PS2) 10Volans: dhcp: cleanup the snippet on refresh failure [software/spicerack] - 10https://gerrit.wikimedia.org/r/920224 (https://phabricator.wikimedia.org/T336696) [15:09:39] (03PS2) 10Volans: dhcp: reword some exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225 [15:09:41] (03PS1) 10Volans: setup.py: limit prospector upper version [software/spicerack] - 10https://gerrit.wikimedia.org/r/922878 [15:10:24] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:10:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:11:28] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) This [[ https://logstash.wikimedia.org/app/dashboards#/view/f3fefa... [15:11:31] (03CR) 10CI reject: [V: 04-1] P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [15:11:44] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:11:44] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:11:52] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.193 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:12:55] (03PS11) 10Jbond: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [15:13:53] (03CR) 10Volans: [C: 03+2] "Unblock CI, self-merging" [software/spicerack] - 10https://gerrit.wikimedia.org/r/922878 (owner: 10Volans) [15:14:40] (03PS1) 10Volans: setup.py: update upper limit for prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/922879 [15:15:12] (03CR) 10CI reject: [V: 04-1] P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [15:17:13] (03PS12) 10Jbond: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [15:17:43] (03Merged) 10jenkins-bot: setup.py: limit prospector upper version [software/spicerack] - 10https://gerrit.wikimedia.org/r/922878 (owner: 10Volans) [15:17:45] (03Merged) 10jenkins-bot: dhcp: cleanup the snippet on refresh failure [software/spicerack] - 10https://gerrit.wikimedia.org/r/920224 (https://phabricator.wikimedia.org/T336696) (owner: 10Volans) [15:18:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [15:18:43] !log analytics-refinery, about to deploy [15:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:53] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41321/console" [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [15:21:09] 10SRE-swift-storage, 10Wikimedia-Site-requests, 10serviceops: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10MatthewVernon) @dcausse no worries; the account looks (`search:backup` in ms-swift) to have been created in 2014, and is using some storage: ` root@ms-fe100... [15:21:30] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:22:17] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:22:43] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:23:35] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:23:37] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:24:38] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:25:33] !log aqu@deploy1002 Started deploy [analytics/refinery@24ff363]: Regular analytics weekly train [analytics/refinery@24ff363] [15:25:46] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:26:08] (03PS1) 10Hashar: wm-patch-demo: link to other patches [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922882 (https://phabricator.wikimedia.org/T332474) [15:26:09] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:26:34] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:27:34] 10SRE-swift-storage, 10Wikimedia-Site-requests, 10serviceops: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10MatthewVernon) The equivalent account in codfw is empty: ` root@ms-fe2009:~# swift stat Account: AUTH_search Containers: 0 Objects: 0... [15:27:50] (03CR) 10Hashar: "Example: https://phabricator.wikimedia.org/F37031225" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922882 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [15:28:09] (03CR) 10Dzahn: [C: 03+1] "Yep, I just wanted to chat about this because it's a shift from how I am used to test on Ganeti VMs and I noticed what you said, I can't j" [puppet] - 10https://gerrit.wikimedia.org/r/922791 (https://phabricator.wikimedia.org/T337041) (owner: 10Jelto) [15:28:36] (03CR) 10Hashar: [C: 03+1] Remove buster hosts from doc rotation [puppet] - 10https://gerrit.wikimedia.org/r/922872 (https://phabricator.wikimedia.org/T319477) (owner: 10EoghanGaffney) [15:28:43] (03PS1) 10Urbanecm: Personalized praise: Add instrumentation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922851 (https://phabricator.wikimedia.org/T325117) [15:28:54] (03PS1) 10Urbanecm: Personalized praise: Add instrumentation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922852 (https://phabricator.wikimedia.org/T325117) [15:29:39] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) `mw_page_content_change_enrich__dse-k8s-eqiad` is not a valid s3 b... [15:30:09] (03CR) 10Jbond: [C: 03+1] setup.py: update upper limit for prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/922879 (owner: 10Volans) [15:30:22] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:31:26] (03CR) 10Urbanecm: [C: 03+2] Personalized praise: Add instrumentation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922852 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [15:31:29] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:31:30] (03CR) 10Urbanecm: [C: 03+2] Personalized praise: Add instrumentation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922851 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [15:31:47] !log aqu@deploy1002 Finished deploy [analytics/refinery@24ff363]: Regular analytics weekly train [analytics/refinery@24ff363] (duration: 06m 13s) [15:31:56] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:32:13] (03CR) 10Volans: [C: 03+2] setup.py: update upper limit for prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/922879 (owner: 10Volans) [15:32:19] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:32:23] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ayounsi) [15:34:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1022.mgmt.eqiad.wmnet with reboot policy FORCED [15:35:06] (03Merged) 10jenkins-bot: setup.py: update upper limit for prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/922879 (owner: 10Volans) [15:35:12] (03PS1) 10Matthias Mullie: Change maint script to do work via jobs [extensions/ImageSuggestions] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922853 (https://phabricator.wikimedia.org/T322872) [15:35:24] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1023.mgmt.eqiad.wmnet with reboot policy FORCED [15:37:40] !log aqu@deploy1002 Started deploy [analytics/refinery@24ff363] (thin): Regular analytics weekly train THIN [analytics/refinery@24ff363] [15:37:44] !log aqu@deploy1002 Finished deploy [analytics/refinery@24ff363] (thin): Regular analytics weekly train THIN [analytics/refinery@24ff363] (duration: 00m 04s) [15:37:50] !log aqu@deploy1002 Started deploy [analytics/refinery@24ff363] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@24ff363] [15:38:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922852 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [15:38:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922851 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [15:39:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:39:25] !log aqu@deploy1002 Finished deploy [analytics/refinery@24ff363] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@24ff363] (duration: 01m 35s) [15:41:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10kzimmerman) @jbond Yes, confirmed, Hamid's contract end date is November 30, 2023. Thanks! [15:42:54] (03CR) 10Jbond: "lgtm, some minor nits" [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (owner: 10Fabfur) [15:44:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:44:27] (03PS2) 10Jbond: admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) [15:45:14] (03CR) 10CI reject: [V: 04-1] admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) (owner: 10Jbond) [15:48:09] (03CR) 10Volans: "Thanks for you first cookbook! Some comments inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (owner: 10Fabfur) [15:50:42] (03CR) 10Dzahn: [C: 03+2] "we forgot about jenkins-releases.devtools.eqiad1.wikimedia.cloud being affected by this. puppet failed because it can't change the UID whi" [puppet] - 10https://gerrit.wikimedia.org/r/917919 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [15:52:13] (03Merged) 10jenkins-bot: Personalized praise: Add instrumentation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922852 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [15:52:16] (03Merged) 10jenkins-bot: Personalized praise: Add instrumentation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922851 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [15:52:46] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:922852|Personalized praise: Add instrumentation (T325117)]], [[gerrit:922851|Personalized praise: Add instrumentation (T325117)]] [15:52:53] T325117: Personalized Praise: Instrumentation - https://phabricator.wikimedia.org/T325117 [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:25] (03CR) 10Dzahn: gerrit: remove lfs_dir parameter, use hardcoded new default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920765 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [15:53:36] 10SRE, 10serviceops-collab, 10Release-Engineering-Team (Radar): Redirect revisions from svn.wikimedia.org to https://static-codereview.wikimedia.org - https://phabricator.wikimedia.org/T119846 (10thcipriani) [15:54:19] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:922852|Personalized praise: Add instrumentation (T325117)]], [[gerrit:922851|Personalized praise: Add instrumentation (T325117)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [15:54:35] 10SRE, 10serviceops-collab, 10Release-Engineering-Team (Seen): URL shortener subdomains for useful Wikimedia infrastructure - https://phabricator.wikimedia.org/T223319 (10thcipriani) [15:55:17] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/920765/41322/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/920765 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [15:56:28] !log move kafka mirror on kafka jumbo brokers to PKI - T337248 [15:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:32] T337248: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 [15:57:14] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "/srv/gerrit/data/lfs should be the path before and after, so noop" [puppet] - 10https://gerrit.wikimedia.org/r/920765 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [15:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:01:20] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:922852|Personalized praise: Add instrumentation (T325117)]], [[gerrit:922851|Personalized praise: Add instrumentation (T325117)]] (duration: 08m 33s) [16:01:27] T325117: Personalized Praise: Instrumentation - https://phabricator.wikimedia.org/T325117 [16:01:37] (03CR) 10Hnowlan: [C: 03+2] thumbor: remove imagemagick pins [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/922803 (owner: 10Hnowlan) [16:01:54] (03PS1) 10Dzahn: doc: replace doc1002 with doc1003 in test examples [puppet] - 10https://gerrit.wikimedia.org/r/922893 (https://phabricator.wikimedia.org/T319477) [16:04:59] (03PS2) 10Dzahn: doc: update test example command to use 2 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/922893 (https://phabricator.wikimedia.org/T319477) [16:05:37] !log move kafka mirror on kafka main brokers to PKI - T337248 [16:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:41] T337248: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 [16:05:52] (03Merged) 10jenkins-bot: thumbor: remove imagemagick pins [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/922803 (owner: 10Hnowlan) [16:07:45] (03CR) 10Dzahn: "@Eoghan did this because I know the decom cookbook will be like "omg, this is still in the repos" if it finds a host name string that it i" [puppet] - 10https://gerrit.wikimedia.org/r/922893 (https://phabricator.wikimedia.org/T319477) (owner: 10Dzahn) [16:12:36] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10ayounsi) Copying the commit message as it have the RFO and fix details: The modern DHCP implementation on Juniper devices forwards ALL DHCP packets to the co... [16:12:40] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10serviceops: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10elukey) Rolled out the new keystores to all clusters! Next steps: * Clean up kafka mirror's classes as suggested in https://gerrit.wikime... [16:13:27] (03CR) 10Dzahn: "hey Andre, so I just realized reviewing/merging/testing this is a good example for "random" things I do that I should hand-over to someone" [puppet] - 10https://gerrit.wikimedia.org/r/922820 (https://phabricator.wikimedia.org/T337382) (owner: 10Aklapper) [16:24:09] (03PS1) 10Andrew Bogott: enforce_policy_scope: false in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/922899 (https://phabricator.wikimedia.org/T330759) [16:24:31] (03CR) 10CI reject: [V: 04-1] enforce_policy_scope: false in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/922899 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:26:11] (03PS2) 10Andrew Bogott: enforce_policy_scope: false in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/922899 (https://phabricator.wikimedia.org/T330759) [16:27:17] (03CR) 10Aklapper: Phabricator monthly email: Improve Differential user activity stats (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922820 (https://phabricator.wikimedia.org/T337382) (owner: 10Aklapper) [16:27:53] (03CR) 10Andrew Bogott: [C: 03+2] enforce_policy_scope: false in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/922899 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:28:40] (03CR) 10Dzahn: "ah, nice! yea, down with Differential!" [puppet] - 10https://gerrit.wikimedia.org/r/922820 (https://phabricator.wikimedia.org/T337382) (owner: 10Aklapper) [16:30:55] 10SRE, 10Release-Engineering-Team, 10serviceops, 10Continuous-Integration-Config, 10Test-Coverage: Add pcov PHP extension to wikimedia apt (and upgrade from 1.0.6-4+wmf1~buster1 to 1.0.11) so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Jdforrester-WMF) [16:31:33] 10SRE, 10Release-Engineering-Team, 10serviceops, 10Continuous-Integration-Config, 10Test-Coverage: Add pcov PHP extension to wikimedia apt (and upgrade from 1.0.6-4+wmf1~buster1 to 1.0.11) so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Jdforrester-WMF) >>! In T243847#828... [16:36:12] (03PS17) 10Eevans: cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [16:36:40] (03CR) 10CI reject: [V: 04-1] cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [16:41:27] (03PS3) 10Dzahn: microsites: remove 15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/761060 (https://phabricator.wikimedia.org/T300171) [16:46:38] 10SRE, 10SRE-OnFire, 10serviceops-collab, 10Release-Engineering-Team (Radar), 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10thcipriani) [16:46:50] (03PS1) 10Andrew Bogott: enforce_new_policy_defaults: false in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/922904 (https://phabricator.wikimedia.org/T330759) [16:47:24] (03CR) 10Andrew Bogott: [C: 03+2] enforce_new_policy_defaults: false in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/922904 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:49:32] 10SRE, 10SRE-OnFire, 10serviceops-collab, 10Release-Engineering-Team (Radar), 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10hashar) This was long forgotten. The problem is when a `Scap::Target` is removed from Puppet, it is not necessarily c... [16:51:19] (03CR) 10Hashar: jenkins: switch to fixed uid/gid 924 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917919 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [16:54:07] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@1603ecf]: Deploying T336800 on platform_eng Airflow instance [16:54:12] T336800: platform_eng Airflow instance Spark jobs failing after Iceberg changes - https://phabricator.wikimedia.org/T336800 [16:54:16] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@1603ecf]: Deploying T336800 on platform_eng Airflow instance (duration: 00m 09s) [16:55:05] (03PS1) 10Stevemunene: Add the refinery-cache/revs directory to git safe list [puppet] - 10https://gerrit.wikimedia.org/r/922905 (https://phabricator.wikimedia.org/T334493) [16:58:00] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922905 (https://phabricator.wikimedia.org/T334493) (owner: 10Stevemunene) [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T1700) [17:04:53] (03PS21) 10Hokwelum: add documentation on commands to run for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [17:12:59] (03PS1) 10Jcrespo: Increase unit test coverage for File, MySQLMedia and MySQLMetadata [software/mediabackups] - 10https://gerrit.wikimedia.org/r/922907 (https://phabricator.wikimedia.org/T327157) [17:19:14] (03PS9) 10Jelto: gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) [17:19:37] (03CR) 10CI reject: [V: 04-1] gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [17:20:44] (03PS10) 10Jelto: gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) [17:21:38] (03PS22) 10Hokwelum: add documentation on commands to run for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [17:28:26] (03CR) 10Jelto: gitlab: use sshkey for git-ssh public keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [17:33:04] (03CR) 10Hashar: [C: 03+1] "Yes that looks good. On WMCS Gerrit would be set with the same layout on `/srv` partition on the extended disk space :) Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/920765 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [17:39:00] (03PS23) 10Hokwelum: add documentation on commands to run for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [17:42:47] (03CR) 10Dzahn: [C: 03+2] microsites: remove 15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/761060 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [17:43:52] (03CR) 10Dzahn: [C: 03+2] "This has moved to k8s." [puppet] - 10https://gerrit.wikimedia.org/r/761060 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [17:47:51] (03CR) 10Dzahn: [C: 03+2] "first deployed on miscweb2003. puppet deletes virtual host, and it's actually gone from running apache ( sudo apache2ctl -S | grep 15)." [puppet] - 10https://gerrit.wikimedia.org/r/761060 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [17:49:06] (ProbeDown) firing: Service miscweb2003:443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:49:32] (03PS1) 10Jbond: puppet-merge: implement Lock out, tag out [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) [17:49:50] jouncebot: nowandnext [17:49:50] For the next 0 hour(s) and 10 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T1700) [17:49:50] In 0 hour(s) and 10 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T1800) [17:49:50] In 0 hour(s) and 10 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T1800) [17:50:07] (03CR) 10Dzahn: [C: 03+2] "same on miscweb1003 (prod)." [puppet] - 10https://gerrit.wikimedia.org/r/761060 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [17:50:10] (03CR) 10CI reject: [V: 04-1] puppet-merge: implement Lock out, tag out [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [17:52:19] (03PS1) 10Samtar: ipInfo.hooks: Use wgRelevantUserName [extensions/WikimediaMessages] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922854 (https://phabricator.wikimedia.org/T337373) [17:54:58] (03CR) 10Hokwelum: [C: 03+1] "Checks out!" [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [17:56:34] (03PS2) 10Jbond: puppet-merge: implement Lock out, tag out [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) [17:57:23] (03PS1) 10Dzahn: microsites: remove http blackbox monitor for 15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/922918 (https://phabricator.wikimedia.org/T300171) [17:58:12] (03CR) 10Dzahn: "also see the same thing that will alert once we remove annual.wm.org" [puppet] - 10https://gerrit.wikimedia.org/r/922918 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [18:00:05] ^demon and dancy: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T1800). [18:00:05] ^demon and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T1800). nyaa~ [18:00:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for Manuel - https://phabricator.wikimedia.org/T336841 (10Ottomata) Approved! [18:03:56] (03PS2) 10Cathal Mooney: Add class-of-service parent interface shaper for sub-rated services [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220) [18:18:11] (03PS1) 10Urbanecm: [Growth] Deploy Personalized praise to pilot wikis with notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922921 (https://phabricator.wikimedia.org/T334630) [18:22:13] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Add network devices fingerprints to known_hosts - https://phabricator.wikimedia.org/T327643 (10Volans) With T336485 almost completed, we could consider integrating the two things, getting this one off exported in some place and then have the `sre.... [18:25:26] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "so this does change the path on gerrit2002 (replica) but we have established the files are irrelevant on the replica ( https://puppet-comp" [puppet] - 10https://gerrit.wikimedia.org/r/920765 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [18:28:08] (03CR) 10Dzahn: [V: 03+1 C: 03+2] gerrit: remove lfs_dir parameter, use hardcoded new default [puppet] - 10https://gerrit.wikimedia.org/r/920765 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [18:29:49] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "double checked it was complete noop on gerrit1003 (prod)" [puppet] - 10https://gerrit.wikimedia.org/r/920765 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [18:32:05] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1149.mgmt.eqiad.wmnet with reboot policy FORCED [18:34:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [18:35:32] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "on gerrit2002 - the actual gerrit config was changed by this:" [puppet] - 10https://gerrit.wikimedia.org/r/920765 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [18:36:18] (03PS2) 10Hashar: wm-patch-demo: link to other patches [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922882 (https://phabricator.wikimedia.org/T332474) [18:41:01] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1149.mgmt.eqiad.wmnet with reboot policy FORCED [18:41:20] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922922 (https://phabricator.wikimedia.org/T330216) [18:41:22] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922922 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [18:42:04] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922922 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [18:43:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [18:47:31] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1023.mgmt.eqiad.wmnet with reboot policy FORCED [18:48:09] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1024.mgmt.eqiad.wmnet with reboot policy FORCED [18:48:10] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1025.mgmt.eqiad.wmnet with reboot policy FORCED [18:49:12] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.10 refs T330216 [18:49:16] T330216: 1.41.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T330216 [18:50:27] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) [18:50:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:54:28] (03PS2) 10Dzahn: site: remove gerrit1001 from gerrit role, rm hiera host data [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) [18:55:12] !log demon@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.10 refs T330216 (duration: 06m 00s) [18:55:17] T330216: 1.41.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T330216 [18:55:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:55:53] (03CR) 10Dzahn: site: remove gerrit1001 from gerrit role, rm hiera host data [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [18:56:23] (03CR) 10Dzahn: "you already confirmed this can be removed, you just wanted the file system" [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [18:58:12] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-main_4492: Servers kubernetes1008.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:58:31] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922923 (https://phabricator.wikimedia.org/T330216) [18:58:33] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922923 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [18:59:17] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922923 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [19:06:19] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.9 refs T330216 [19:06:25] T330216: 1.41.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T330216 [19:06:54] (03PS2) 10Dzahn: microsites: remove http blackbox monitor for 15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/922918 (https://phabricator.wikimedia.org/T300171) [19:08:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:09:40] (03CR) 10Dzahn: [C: 03+2] "https://phabricator.wikimedia.org/T337424" [puppet] - 10https://gerrit.wikimedia.org/r/761060 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [19:10:02] (03CR) 10Gmodena: "This change is ready for review." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [19:10:09] (03CR) 10Dzahn: "resolves https://phabricator.wikimedia.org/T337424" [puppet] - 10https://gerrit.wikimedia.org/r/922918 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [19:11:40] (03PS4) 10Gmodena: mw-page-content-change-enrich: revert checkpoint dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) [19:12:20] !log demon@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.9 refs T330216 (duration: 06m 00s) [19:12:25] T330216: 1.41.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T330216 [19:15:10] (03PS5) 10Gmodena: mw-page-content-change-enrich: revert checkpoint dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) [19:18:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:22:07] (03PS6) 10Gmodena: mw-page-content-change-enrich: revert checkpoint dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) [19:24:06] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:24:09] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:26:53] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (Radar), 10Test-Coverage: Add pcov PHP extension to wikimedia apt (and upgrade from 1.0.6-4+wmf1~buster1 to 1.0.11) so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10hashar) [19:29:11] (03PS1) 10CDanis: add kerberos for manuel-wmde [puppet] - 10https://gerrit.wikimedia.org/r/922926 (https://phabricator.wikimedia.org/T336841) [19:30:49] (03CR) 10CDanis: [C: 03+2] add kerberos for manuel-wmde [puppet] - 10https://gerrit.wikimedia.org/r/922926 (https://phabricator.wikimedia.org/T336841) (owner: 10CDanis) [19:31:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs2011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:31:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wcqs2003:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:32:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for Manuel - https://phabricator.wikimedia.org/T336841 (10CDanis) 05Open→03Resolved Apologies for omitting Kerberos originally! You should now have an email with instructions on how to configure it. [19:32:03] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:33:43] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/921105 (https://phabricator.wikimedia.org/T336701) (owner: 10Dzahn) [19:33:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [19:34:16] (03PS7) 10Gmodena: mw-page-content-change-enrich: revert checkpoint dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) [19:35:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1024.mgmt.eqiad.wmnet with reboot policy FORCED [19:35:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1025.mgmt.eqiad.wmnet with reboot policy FORCED [19:36:07] (03PS8) 10Gmodena: mw-page-content-change-enrich: revert checkpoint dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) [19:36:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:36:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wcqs2001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:37:03] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:37:15] (03CR) 10Gmodena: mw-page-content-change-enrich: revert checkpoint dir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [19:38:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Add network devices fingerprints to known_hosts - https://phabricator.wikimedia.org/T327643 (10ayounsi) My initial guess was to add them to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common.y... [19:43:33] (03PS9) 10Gmodena: mw-page-content-change-enrich: enable checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) [19:44:08] (03CR) 10Ayounsi: Add class-of-service parent interface shaper for sub-rated services (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220) (owner: 10Cathal Mooney) [19:45:22] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1026.mgmt.eqiad.wmnet with reboot policy FORCED [19:45:24] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1027.mgmt.eqiad.wmnet with reboot policy FORCED [19:49:54] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1026.mgmt.eqiad.wmnet with reboot policy FORCED [19:49:57] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1027.mgmt.eqiad.wmnet with reboot policy FORCED [19:50:18] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10Jclark-ctr) @ayounsi the provisioning script is still failing in row e/f. dbproxy1026 dbproxy1027 [19:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (3) wcqs2001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:55:05] (03PS10) 10Gmodena: mw-page-content-change-enrich: enable checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) [19:56:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:56:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Your horoscope predicts another unfortunate UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230524T2000). [20:00:04] kimberly_sarabia and TheresNoTime: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] * TheresNoTime can deploy [20:00:55] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922854 (https://phabricator.wikimedia.org/T337373) (owner: 10Samtar) [20:01:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (5) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:03:10] (03CR) 10Cathal Mooney: Add class-of-service parent interface shaper for sub-rated services (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220) (owner: 10Cathal Mooney) [20:03:23] I am waiting for kimberly_sarabia so https://gerrit.wikimedia.org/r/922564's merge conflict can be resolved :) jan_drewniak - courtesy ping as I know you were involved in the last patch [20:06:18] (03CR) 10CDanis: [C: 03+1] "Implementation looks good to me. Probably we should also add a helper command to lock out the system -- possibly a flag or subcommand tha" [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [20:06:56] TheresNoTime: hey, sorry for the wait, but we’re not going to be deploying that one today (forgot to take it off the schedule) [20:07:04] no worries :) [20:08:26] TheresNoTime: quiet night for you :) [20:08:50] *not the Q word..!* [20:08:56] !log ayounsi@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1027.mgmt.eqiad.wmnet with reboot policy FORCED [20:10:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922854 (https://phabricator.wikimedia.org/T337373) (owner: 10Samtar) [20:14:30] (03PS1) 10Samtar: ipInfo.hooks: Use wgRelevantUserName [extensions/WikimediaMessages] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922855 (https://phabricator.wikimedia.org/T337373) [20:15:01] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1027.mgmt.eqiad.wmnet with reboot policy FORCED [20:16:02] (03Merged) 10jenkins-bot: ipInfo.hooks: Use wgRelevantUserName [extensions/WikimediaMessages] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/922854 (https://phabricator.wikimedia.org/T337373) (owner: 10Samtar) [20:16:36] !log samtar@deploy1002 Started scap: Backport for [[gerrit:922854|ipInfo.hooks: Use wgRelevantUserName (T337373)]] [20:16:40] T337373: IP information tool: url's to global contributions and XTools do not work but link to user "null" - https://phabricator.wikimedia.org/T337373 [20:18:05] !log samtar@deploy1002 samtar: Backport for [[gerrit:922854|ipInfo.hooks: Use wgRelevantUserName (T337373)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:18:06] * TheresNoTime testing [20:19:23] * TheresNoTime syncing [20:23:28] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:07] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:922854|ipInfo.hooks: Use wgRelevantUserName (T337373)]] (duration: 08m 31s) [20:25:12] T337373: IP information tool: url's to global contributions and XTools do not work but link to user "null" - https://phabricator.wikimedia.org/T337373 [20:25:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:26:00] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:27:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:30:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:31:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922855 (https://phabricator.wikimedia.org/T337373) (owner: 10Samtar) [20:39:14] TheresNoTime: hi, would you mind pinging me once done? [20:39:32] urbanecm: sure, just waiting on 922855 [20:39:38] awesome [20:46:35] (03Merged) 10jenkins-bot: ipInfo.hooks: Use wgRelevantUserName [extensions/WikimediaMessages] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922855 (https://phabricator.wikimedia.org/T337373) (owner: 10Samtar) [20:47:04] !log samtar@deploy1002 Started scap: Backport for [[gerrit:922855|ipInfo.hooks: Use wgRelevantUserName (T337373)]] [20:47:10] T337373: IP information tool: url's to global contributions and XTools do not work but link to user "null" - https://phabricator.wikimedia.org/T337373 [20:48:34] !log samtar@deploy1002 samtar: Backport for [[gerrit:922855|ipInfo.hooks: Use wgRelevantUserName (T337373)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:48:44] * TheresNoTime testing [20:49:33] * TheresNoTime syncing [20:53:37] (03CR) 10Ottomata: mw-page-content-change-enrich: enable checkpointing (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [20:55:20] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:922855|ipInfo.hooks: Use wgRelevantUserName (T337373)]] (duration: 08m 15s) [20:55:25] T337373: IP information tool: url's to global contributions and XTools do not work but link to user "null" - https://phabricator.wikimedia.org/T337373 [20:56:05] urbanecm: all yours [20:56:10] thanks [20:56:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:56:35] (03PS2) 10Urbanecm: [Growth] Deploy Personalized praise to pilot wikis with notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922921 (https://phabricator.wikimedia.org/T334630) [20:56:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922921 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [21:01:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:07:07] (03CR) 10JHathaway: puppet-merge: implement Lock out, tag out (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [21:08:22] okay, gerrit didn't automerge it. submitted, continuing. [21:08:39] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:922921|[Growth] Deploy Personalized praise to pilot wikis with notifications (T334630)]] [21:08:44] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [21:10:10] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:922921|[Growth] Deploy Personalized praise to pilot wikis with notifications (T334630)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:16:48] 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on backup2010 - https://phabricator.wikimedia.org/T337174 (10Jhancock.wm) 05Open→03Resolved Nothing unusual in the logs. gonna close this one up. thank you too! [21:18:20] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:922921|[Growth] Deploy Personalized praise to pilot wikis with notifications (T334630)]] (duration: 09m 40s) [21:18:24] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [21:18:26] * urbanecm done [21:18:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:19:48] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:20:21] (03PS1) 10Cwhite: profile: remove varnish-aggregate-client-status-codes resource [puppet] - 10https://gerrit.wikimedia.org/r/922534 (https://phabricator.wikimedia.org/T288196) [21:20:38] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:15] (03PS2) 10Cwhite: profile: ensure varnish-aggregate-client-status-codes absent [puppet] - 10https://gerrit.wikimedia.org/r/922534 (https://phabricator.wikimedia.org/T288196) [21:21:35] (03CR) 10CI reject: [V: 04-1] profile: ensure varnish-aggregate-client-status-codes absent [puppet] - 10https://gerrit.wikimedia.org/r/922534 (https://phabricator.wikimedia.org/T288196) (owner: 10Cwhite) [21:23:02] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:23:02] (03PS3) 10Cwhite: profile: ensure varnish-aggregate-client-status-codes absent [puppet] - 10https://gerrit.wikimedia.org/r/922534 (https://phabricator.wikimedia.org/T288196) [21:23:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:24:36] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:26:30] 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10Jhancock.wm) [21:26:32] (03CR) 10Cwhite: [C: 03+1] prometheus: de-provision 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921349 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [21:26:43] 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10Jhancock.wm) 05Open→03Resolved [21:29:06] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:29:08] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.289 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:29:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:33:18] (03CR) 10JHathaway: "This pattern seems a bit strange, are there any other options? Would exported resources work? Could the puppet server export the host key " [puppet] - 10https://gerrit.wikimedia.org/r/922877 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond) [22:09:49] (03CR) 10EoghanGaffney: [C: 03+1] "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/922893 (https://phabricator.wikimedia.org/T319477) (owner: 10Dzahn) [22:50:30] (03PS3) 10Samtar: going through the tox as stated in the readme T337044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610 (owner: 10Robertsky) [23:44:38] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:45:52] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:47] (03CR) 10Dzahn: [C: 03+2] doc: update test example command to use 2 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/922893 (https://phabricator.wikimedia.org/T319477) (owner: 10Dzahn) [23:51:34] (03CR) 10Dzahn: [C: 03+2] "comments only" [puppet] - 10https://gerrit.wikimedia.org/r/922893 (https://phabricator.wikimedia.org/T319477) (owner: 10Dzahn)