[00:08:25] (03CR) 10Pppery: "I finally had a chance to test the testable part of this properly. Steps to test:" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) (owner: 10Pppery) [00:21:54] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T360862 (10ops-monitoring-bot) 03NEW [00:32:24] (03PS1) 10Tim Starling: block: Fix exception in ApiQueryBlocks when specified users are not blocked [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013634 (https://phabricator.wikimedia.org/T360088) [00:38:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1013362 [00:38:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1013362 (owner: 10TrainBranchBot) [00:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:56:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1002 using scap backport" [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013634 (https://phabricator.wikimedia.org/T360088) (owner: 10Tim Starling) [01:03:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1013362 (owner: 10TrainBranchBot) [01:16:06] (03Merged) 10jenkins-bot: block: Fix exception in ApiQueryBlocks when specified users are not blocked [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013634 (https://phabricator.wikimedia.org/T360088) (owner: 10Tim Starling) [01:16:34] !log tstarling@deploy1002 Started scap: Backport for [[gerrit:1013634|block: Fix exception in ApiQueryBlocks when specified users are not blocked (T360088)]] [01:16:38] T360088: Slow query in ApiQueryBlocks with new schema - https://phabricator.wikimedia.org/T360088 [01:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:33:06] !log tstarling@deploy1002 tstarling: Backport for [[gerrit:1013634|block: Fix exception in ApiQueryBlocks when specified users are not blocked (T360088)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [01:33:10] T360088: Slow query in ApiQueryBlocks with new schema - https://phabricator.wikimedia.org/T360088 [01:33:57] !log tstarling@deploy1002 tstarling: Continuing with sync [01:45:25] !log tstarling@deploy1002 Finished scap: Backport for [[gerrit:1013634|block: Fix exception in ApiQueryBlocks when specified users are not blocked (T360088)]] (duration: 28m 51s) [01:45:29] T360088: Slow query in ApiQueryBlocks with new schema - https://phabricator.wikimedia.org/T360088 [01:48:00] (03PS1) 10Tim Starling: Switch block schema to read-new/write-both mode (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013635 (https://phabricator.wikimedia.org/T355034) [01:48:43] (03PS2) 10Tim Starling: Switch block schema to read-new/write-both mode (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013635 (https://phabricator.wikimedia.org/T355034) [01:50:36] (03CR) 10Tim Starling: [C:03+2] Switch block schema to read-new/write-both mode (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013635 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [01:51:18] (03Merged) 10jenkins-bot: Switch block schema to read-new/write-both mode (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013635 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [02:06:31] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: Switch block schema to read-new/write-both mode T355034 (duration: 12m 53s) [02:06:35] T355034: Deploy new block_target schema - https://phabricator.wikimedia.org/T355034 [02:37:19] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:19] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:14:47] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:14:54] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:17:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:17:45] (ProbeDown) firing: (2) Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:20:41] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:41:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:51:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:11:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:57:30] (ProbeDown) firing: (4) Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:02:30] (ProbeDown) firing: (4) Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:17:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:41] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:39:02] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003452 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [07:39:52] (03CR) 10Brouberol: [C:03+2] Superset: migrate external services egress to Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009290 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [07:42:23] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [07:42:28] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [07:45:32] (03PS2) 10Urbanecm: Add CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013608 (https://phabricator.wikimedia.org/T357766) [07:45:41] (03PS2) 10Urbanecm: Add wmgUseCommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013609 (https://phabricator.wikimedia.org/T357766) [08:00:05] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T0800). [08:00:05] Jhs, Ammar, and Ammar: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:16:12] Amir1, urbanecm i'm present (a bit late, forgot about this) [08:17:15] (03PS3) 10Jon Harald Søby: Remove Nearby extension and Minerva donate button for nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013564 (https://phabricator.wikimedia.org/T360782) [08:17:49] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1013585 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [08:18:30] Jhs: I will deploy :) [08:18:37] hello [08:18:42] hi! [08:20:21] that looks straightforward to me [08:20:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013564 (https://phabricator.wikimedia.org/T360782) (owner: 10Jon Harald Søby) [08:20:46] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T360862#9656692 (10fgiunchedi) @Jclark-ctr it looks like one of the new SSDs from {T359452} isn't happy, I've located the drive so it should be blinking; could we replace it ASAP? please ping me on IRC when you can, thank y... [08:21:04] which also leads me to believe all chapters wikis should have nearby and the donation link disabled [08:21:05] :) [08:21:47] (03Merged) 10jenkins-bot: Remove Nearby extension and Minerva donate button for nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013564 (https://phabricator.wikimedia.org/T360782) (owner: 10Jon Harald Søby) [08:22:09] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1013564|Remove Nearby extension and Minerva donate button for nowikimedia (T360782 T360783)]] [08:22:14] T360782: Disable Nearby extension on no.wikimedia.org - https://phabricator.wikimedia.org/T360782 [08:22:14] T360783: Disable MinervaDonateLink on no.wikimedia.org - https://phabricator.wikimedia.org/T360783 [08:25:17] of course it is failling [08:25:17] pff [08:25:24] (03PS1) 10Brouberol: superset: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013947 [08:25:45] timeout parsing Barack Obama on mw2001 and mw2002 [08:26:02] (03PS2) 10Brouberol: superset: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013947 (https://phabricator.wikimedia.org/T359411) [08:26:07] ERRORS: 128 requests attempted to each of 4 hosts. Errors connecting to 2 hosts. [08:29:02] holy hell [08:29:07] (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: use per tap interface dhcrelay [puppet] - 10https://gerrit.wikimedia.org/r/1003452 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:29:16] (03CR) 10Filippo Giunchedi: [C:04-1] "See inline, LGTM overall though" [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [08:29:27] (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: use per tap interface dhcrelay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003452 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:29:34] !log hashar@deploy1002 jhsoby and hashar: Backport for [[gerrit:1013564|Remove Nearby extension and Minerva donate button for nowikimedia (T360782 T360783)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:29:39] !log hashar@deploy1002 jhsoby and hashar: Continuing with sync [08:29:42] T360782: Disable Nearby extension on no.wikimedia.org - https://phabricator.wikimedia.org/T360782 [08:29:44] T360783: Disable MinervaDonateLink on no.wikimedia.org - https://phabricator.wikimedia.org/T360783 [08:29:52] hashar, works as expected on mwdebug1001 [08:30:07] yeah that is quite magic :) [08:30:29] hehe [08:31:12] ah I have found the issue [08:31:22] read timeout=10 versus "Parsing Barack Obama was slow, took 12.71 seconds" [08:31:36] no idea why one thought it would be a good idea to parse that page :) [08:31:37] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1003.wikimedia.org with reason: Gerrit update [08:31:42] anyway, I will file it [08:31:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1003.wikimedia.org with reason: Gerrit update [08:32:04] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit2002.wikimedia.org with reason: Gerrit update [08:32:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: Gerrit update [08:36:01] (03CR) 10Brouberol: [C:03+1] role::aqs: deploy the PKI-enabled TLS bundle and use it on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1013566 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [08:36:02] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10observability: Improve automation for the vendor maintenance calendar - https://phabricator.wikimedia.org/T357630#9656714 (10jcrespo) [08:37:16] (03CR) 10JMeybohm: [C:03+1] role::docker_registry_ha::registry: increase tmpfs size in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1013541 (https://phabricator.wikimedia.org/T360637) (owner: 10Elukey) [08:37:24] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: scrape envoy on k8s metrics with 'usedonly' (take #2) [puppet] - 10https://gerrit.wikimedia.org/r/1013515 (https://phabricator.wikimedia.org/T359633) (owner: 10Filippo Giunchedi) [08:38:49] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [08:38:50] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [08:39:20] T360867 [08:39:20] T360867: httpbb appserver test breaks deployment of the week due to a timeout parsing page - https://phabricator.wikimedia.org/T360867 [08:39:37] Jhs: thank you for the patch, I guess you can resolve both tasks as a result :) [08:39:48] (it is still deploying=) [08:40:44] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [08:40:48] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1013564|Remove Nearby extension and Minerva donate button for nowikimedia (T360782 T360783)]] (duration: 18m 38s) [08:40:54] T360782: Disable Nearby extension on no.wikimedia.org - https://phabricator.wikimedia.org/T360782 [08:40:55] T360783: Disable MinervaDonateLink on no.wikimedia.org - https://phabricator.wikimedia.org/T360783 [08:41:34] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [08:41:34] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:41:34] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [08:41:38] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [08:42:04] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [08:42:38] hashar, thanks! [08:42:55] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [08:46:13] (03PS2) 10Hashar: Set wgUploadNavigationUrl for is.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013295 (https://phabricator.wikimedia.org/T360431) (owner: 10Ammarpad) [08:48:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013295 (https://phabricator.wikimedia.org/T360431) (owner: 10Ammarpad) [08:48:39] (03PS1) 10Giuseppe Lavagetto: Alert separately for api gateway backend errors [alerts] - 10https://gerrit.wikimedia.org/r/1013948 (https://phabricator.wikimedia.org/T360597) [08:49:30] (03Merged) 10jenkins-bot: Set wgUploadNavigationUrl for is.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013295 (https://phabricator.wikimedia.org/T360431) (owner: 10Ammarpad) [08:49:47] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1013295|Set wgUploadNavigationUrl for is.wikibooks (T360431)]] [08:49:51] T360431: Set wgUploadNavigationUrl for is.wikibooks - https://phabricator.wikimedia.org/T360431 [08:50:28] (03CR) 10CI reject: [V:04-1] Alert separately for api gateway backend errors [alerts] - 10https://gerrit.wikimedia.org/r/1013948 (https://phabricator.wikimedia.org/T360597) (owner: 10Giuseppe Lavagetto) [08:52:07] !log hashar@deploy1002 ammarpad and hashar: Backport for [[gerrit:1013295|Set wgUploadNavigationUrl for is.wikibooks (T360431)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:52:13] !log hashar@deploy1002 ammarpad and hashar: Continuing with sync [08:53:05] (03PS3) 10Urbanecm: Add CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013608 (https://phabricator.wikimedia.org/T357766) [08:53:08] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS bookworm [08:53:09] (03PS2) 10Urbanecm: [beta] eswiki: Enable CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013610 (https://phabricator.wikimedia.org/T357766) [08:53:13] (03PS3) 10Urbanecm: Add wmgUseCommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013609 (https://phabricator.wikimedia.org/T357766) [08:53:16] (03PS2) 10Urbanecm: [beta] eswiki: Use CommunityConfiguration extension for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013611 (https://phabricator.wikimedia.org/T357766) [08:53:19] (03PS3) 10Urbanecm: [beta] eswiki: Enable CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013610 (https://phabricator.wikimedia.org/T357766) [08:53:23] (03PS3) 10Urbanecm: [beta] eswiki: Use CommunityConfiguration extension for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013611 (https://phabricator.wikimedia.org/T357766) [08:53:41] (03CR) 10Sergio Gimeno: [C:03+1] Add CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013608 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [08:53:47] (03CR) 10Sergio Gimeno: [C:03+1] Add wmgUseCommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013609 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [08:53:55] (03CR) 10Sergio Gimeno: [C:03+1] [beta] eswiki: Enable CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013610 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [08:54:01] (03CR) 10Sergio Gimeno: [C:03+1] [beta] eswiki: Use CommunityConfiguration extension for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013611 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [08:54:30] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013947 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [08:54:40] (03CR) 10Brouberol: [C:03+2] superset: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013947 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [08:56:54] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [08:56:57] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [08:58:02] (03CR) 10Hashar: [C:03+1] throttle: Add throttle rule for editathon at Illinois Tech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013319 (https://phabricator.wikimedia.org/T358494) (owner: 10Ammarpad) [08:58:05] 06SRE, 07Epic: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9656748 (10jcrespo) [08:59:45] (03CR) 10Ammarpad: throttle: Add throttle rule for editathon at Illinois Tech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013319 (https://phabricator.wikimedia.org/T358494) (owner: 10Ammarpad) [09:00:04] hashar: Deploy window Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T0900) [09:00:09] (03CR) 10Hashar: [C:03+1] "Note that per the inline comment in `wmf-config/throttle.php` and from https://wikitech.wikimedia.org/wiki/Increasing_account_creation_thr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013319 (https://phabricator.wikimedia.org/T358494) (owner: 10Ammarpad) [09:03:17] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1013295|Set wgUploadNavigationUrl for is.wikibooks (T360431)]] (duration: 13m 29s) [09:03:21] T360431: Set wgUploadNavigationUrl for is.wikibooks - https://phabricator.wikimedia.org/T360431 [09:03:23] I have moved the deployment of the throttle to this afternoon [09:04:48] 06SRE, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Update maxmind download to pull databases from new url - https://phabricator.wikimedia.org/T358268#9656751 (10jcrespo) @Gehel Any update since Feb? Everything progressing ok? Still no dependency/request? Otherwise I will mark it as acked from our side. [09:05:01] (03CR) 10Ammarpad: "Ack." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013319 (https://phabricator.wikimedia.org/T358494) (owner: 10Ammarpad) [09:05:32] hashar OK, thank you [09:15:37] (03CR) 10Jelto: [C:03+1] "looks good to me beside unhappy jenkins. I like splitting the generic backend error alert by envoy cluster name. Queries look good as far " [alerts] - 10https://gerrit.wikimedia.org/r/1013948 (https://phabricator.wikimedia.org/T360597) (owner: 10Giuseppe Lavagetto) [09:16:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9656784 (10jcrespo) @MPhamWMF Please read carefully and, if agreed, sign L3 to proceed with the access request. Even if you are not asking for ssh access, there is a "Handling se... [09:22:59] (03PS1) 10Slyngshede: IDP: Switch to new Bookworm host. [dns] - 10https://gerrit.wikimedia.org/r/1013949 (https://phabricator.wikimedia.org/T357748) [09:23:27] (03PS1) 10Brouberol: rbac: allow deploy users to perform actions on Calico NetworkPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013950 (https://phabricator.wikimedia.org/T331894) [09:27:15] (03CR) 10JMeybohm: rbac: allow deploy users to perform actions on Calico NetworkPolicies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013950 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:29:26] (03PS2) 10Brouberol: rbac: allow deploy users to perform actions on Calico NetworkPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013950 (https://phabricator.wikimedia.org/T331894) [09:30:31] (03CR) 10Brouberol: rbac: allow deploy users to perform actions on Calico NetworkPolicies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013950 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:32:07] !log Cancelling the Gerrit 3.8 upgrade [09:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:15] (03PS3) 10Brouberol: rbac: allow deploy users to perform actions on Calico NetworkPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013950 (https://phabricator.wikimedia.org/T331894) [09:34:36] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [09:39:13] (03CR) 10JMeybohm: [C:03+1] rbac: allow deploy users to perform actions on Calico NetworkPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013950 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:39:22] (03CR) 10Brouberol: [C:03+2] rbac: allow deploy users to perform actions on Calico NetworkPolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013950 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:41:15] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:41:21] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:42:37] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [09:42:50] (03PS1) 10Hashar: Merge tag 'v3.8.4' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1013953 (https://phabricator.wikimedia.org/T354886) [09:43:05] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [09:43:31] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:43:49] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:43:59] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:44:21] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [09:44:27] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [09:44:38] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:46:26] (03CR) 10EoghanGaffney: [C:03+2] [gitlab] Move backup script locking out of main script root [puppet] - 10https://gerrit.wikimedia.org/r/1013585 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [09:47:07] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [09:48:44] (03CR) 10CI reject: [V:04-1] Merge tag 'v3.8.4' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1013953 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [09:49:39] (03PS1) 10Brouberol: external_services: assume the feature is disabled by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013954 (https://phabricator.wikimedia.org/T331894) [09:54:28] (03CR) 10Filippo Giunchedi: [C:03+1] "What Jelto said 😊" [alerts] - 10https://gerrit.wikimedia.org/r/1013948 (https://phabricator.wikimedia.org/T360597) (owner: 10Giuseppe Lavagetto) [09:55:48] (03CR) 10JMeybohm: [C:03+2] external_services: assume the feature is disabled by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013954 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:56:15] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9656852 (10jcrespo) Also, and as a recommendation, but not a hard requirement, please consider linking your LDAP account to your Phabricator account for faster admin request mana... [09:56:56] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9656853 (10jcrespo) [09:57:11] (03PS1) 10Btullis: Update the ssl_provider for the YARN ui to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013955 (https://phabricator.wikimedia.org/T360412) [09:57:13] (03PS1) 10Btullis: Update the ssl_provider for the 2cwanalytics webserver to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013956 (https://phabricator.wikimedia.org/T360412) [09:57:14] (03PS1) 10Btullis: Update the ssl_provider for matomo to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013957 (https://phabricator.wikimedia.org/T360412) [09:57:16] (03PS1) 10Btullis: Update the ssl_provider for turnilo to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013958 (https://phabricator.wikimedia.org/T360412) [09:57:18] (03PS1) 10Btullis: Update the ssl_provider for hue to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013959 (https://phabricator.wikimedia.org/T360412) [09:57:22] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [09:57:26] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [09:57:33] (03PS1) 10JMeybohm: Update admin_ng/_example_/helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013961 [09:57:34] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:58:00] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9656860 (10MPhamWMF) Thanks @jcrespo . Andre helped me reset my MFA and I was able to sign the L3, and link my LDAP account. [09:58:26] (03PS2) 10Btullis: Update the ssl_provider for the analytics webserver to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013956 (https://phabricator.wikimedia.org/T360412) [09:58:26] (03PS2) 10Btullis: Update the ssl_provider for matomo to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013957 (https://phabricator.wikimedia.org/T360412) [09:58:26] (03PS2) 10Btullis: Update the ssl_provider for turnilo to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013958 (https://phabricator.wikimedia.org/T360412) [09:58:26] (03PS2) 10Btullis: Update the ssl_provider for hue to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013959 (https://phabricator.wikimedia.org/T360412) [09:58:49] (03CR) 10Giuseppe Lavagetto: [C:03+1] restbase: Start moving mwapi calls to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005756 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [09:58:50] (03Merged) 10jenkins-bot: external_services: assume the feature is disabled by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013954 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [10:00:02] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:00:27] !log brouberol@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:00:39] !log brouberol@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:00:41] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:01:17] !log brouberol@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:01:29] !log brouberol@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:01:40] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:01:52] !log brouberol@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:01:52] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:01:59] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:02:00] !log brouberol@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:02:07] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:02:55] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9656867 (10jcrespo) [10:07:00] (03CR) 10Paladox: Merge tag 'v3.8.4' into wmf/stable-3.8 (031 comment) [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1013953 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [10:09:40] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9656894 (10jcrespo) p:05Triage→03High I can see it, thank you! So now only pending approval from #data-engineering 's list of people that can approve that access: @odimitrije... [10:10:09] (03PS2) 10Hashar: Merge tag 'v3.8.4' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1013953 (https://phabricator.wikimedia.org/T354886) [10:11:13] (03PS2) 10Jcrespo: Add fabfur to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1013529 (https://phabricator.wikimedia.org/T359561) (owner: 10Btullis) [10:12:07] (03CR) 10Hashar: Merge tag 'v3.8.4' into wmf/stable-3.8 (032 comments) [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1013953 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [10:14:31] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561#9656907 (10jcrespo) I can take over unless @BTullis or @Fabfur wants to deploy (?). [10:16:48] (03CR) 10CI reject: [V:04-1] Merge tag 'v3.8.4' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1013953 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [10:20:01] (03CR) 10Majavah: [C:03+2] dynamicproxy: use http 1.1 for backend connections [puppet] - 10https://gerrit.wikimedia.org/r/1012728 (https://phabricator.wikimedia.org/T354116) (owner: 10Majavah) [10:21:12] (03CR) 10Jcrespo: [C:03+1] Add fabfur to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1013529 (https://phabricator.wikimedia.org/T359561) (owner: 10Btullis) [10:23:29] (03CR) 10Kamila Součková: [C:03+1] kubernetes: move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1013536 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [10:23:57] (03PS2) 10Giuseppe Lavagetto: Alert separately for api gateway backend errors [alerts] - 10https://gerrit.wikimedia.org/r/1013948 (https://phabricator.wikimedia.org/T360597) [10:24:26] (03PS3) 10Hashar: Merge tag 'v3.8.4' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1013953 (https://phabricator.wikimedia.org/T354886) [10:24:30] (03CR) 10Hashar: "I have forgot to `git add plugins/rename-project` ..." [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1013953 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [10:25:15] !log Depooling mw2336.codfw.wmnet,mw2337.codfw.wmnet,mw2386.codfw.wmnet,mw2387.codfw.wmnet,mw2388.codfw.wmnet,mw2389.codfw.wmnet - T351074 [10:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:19] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [10:27:53] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9656943 (10jcrespo) Reminder to @Jrbranaa that this is blocked on getting an answer to the question in the previous comment. As an addendum please note that modifying th... [10:28:01] (03CR) 10Giuseppe Lavagetto: [C:03+2] Alert separately for api gateway backend errors [alerts] - 10https://gerrit.wikimedia.org/r/1013948 (https://phabricator.wikimedia.org/T360597) (owner: 10Giuseppe Lavagetto) [10:29:06] (03Merged) 10jenkins-bot: Alert separately for api gateway backend errors [alerts] - 10https://gerrit.wikimedia.org/r/1013948 (https://phabricator.wikimedia.org/T360597) (owner: 10Giuseppe Lavagetto) [10:36:25] (03CR) 10Clément Goubert: [C:03+2] kubernetes: move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1013536 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [10:37:21] (03CR) 10Btullis: [C:03+2] Add fabfur to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1013529 (https://phabricator.wikimedia.org/T359561) (owner: 10Btullis) [10:37:52] (03PS1) 10Majavah: P:wmcs::metricsinfra: reduce repeat interval [puppet] - 10https://gerrit.wikimedia.org/r/1013963 [10:38:10] (03PS2) 10Majavah: P:wmcs::metricsinfra: increase repeat interval [puppet] - 10https://gerrit.wikimedia.org/r/1013963 [10:38:21] (03PS1) 10Brouberol: global_config: fix druid and presto configuration [puppet] - 10https://gerrit.wikimedia.org/r/1013964 (https://phabricator.wikimedia.org/T331894) [10:41:45] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2336.codfw.wmnet with OS bullseye [10:42:13] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2337.codfw.wmnet with OS bullseye [10:42:40] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2386.codfw.wmnet with OS bullseye [10:42:51] (03CR) 10FNegri: "I thought about this a few times, but I'm not sure what's the rationale behind the default value. I mean: what's the downside? is there an" [puppet] - 10https://gerrit.wikimedia.org/r/1013963 (owner: 10Majavah) [10:42:59] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968#9656993 (10jcrespo) 05Open→03Stalled Hola, @darthmon_wmde ! A ver si conseguimos cerrar esto de una vez por todas :-D. Si podrías actualizar la clave con tu cuenta con una edi... [10:43:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2387.codfw.wmnet with OS bullseye [10:43:18] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Add user fabfur to analytics-privatedata-users - https://phabricator.wikimedia.org/T359561#9656989 (10BTullis) Thanks for the kind offer @jcrespo - I've picked this up again now. The puppet change is now depl... [10:43:38] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2388.codfw.wmnet with OS bullseye [10:43:55] (03CR) 10Majavah: "In my mind someone either spots the initial alert on irc/email when it first comes, or someone sees it on the alertmanager dashboard, so I" [puppet] - 10https://gerrit.wikimedia.org/r/1013963 (owner: 10Majavah) [10:44:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2389.codfw.wmnet with OS bullseye [10:44:28] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1694/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013956 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [10:44:30] (03PS1) 10Brouberol: Fix pod label selector for external-services network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013965 (https://phabricator.wikimedia.org/T331894) [10:44:52] (03PS2) 10Fabfur: benthos: enable benthos instance on upload host (cp4045) [puppet] - 10https://gerrit.wikimedia.org/r/1013526 (https://phabricator.wikimedia.org/T358109) [10:45:45] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [10:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:33] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1695/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013955 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [10:46:36] (03CR) 10Brouberol: [C:03+2] Update admin_ng/_example_/helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013961 (owner: 10JMeybohm) [10:46:43] (03CR) 10Brouberol: [C:03+1] Update admin_ng/_example_/helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013961 (owner: 10JMeybohm) [10:46:53] (03CR) 10Btullis: [V:03+1 C:03+2] Use a routable sender address for email from Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1011342 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [10:47:02] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1013526 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:47:32] jouncebot: nowandnext [10:47:33] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [10:47:33] In 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T1100) [10:47:44] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate changeprop to mw-api-int - https://phabricator.wikimedia.org/T360767#9657017 (10Clement_Goubert) 05Open→03In progress [10:48:54] (03PS3) 10Ladsgroup: Set four more wikis to read new in pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013010 (https://phabricator.wikimedia.org/T351237) [10:49:06] (03CR) 10Ladsgroup: [C:03+2] Set four more wikis to read new in pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013010 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [10:49:15] (03CR) 10Btullis: [C:03+1] "Got it, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1013964 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [10:49:21] (03PS1) 10Brouberol: Superset: Fix pod label selector for external-services networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013966 [10:49:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013010 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [10:49:47] (03Merged) 10jenkins-bot: Set four more wikis to read new in pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013010 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [10:50:06] (03CR) 10Btullis: [C:03+1] "Good catch." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013965 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [10:50:06] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1013010|Set four more wikis to read new in pagelinks migration (T351237)]] [10:50:10] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [10:50:53] (03CR) 10FNegri: "The default is 4h according to the prometheus docs [1]" [puppet] - 10https://gerrit.wikimedia.org/r/1013963 (owner: 10Majavah) [10:51:15] (03PS2) 10Brouberol: global_config: fix druid and presto configuration [puppet] - 10https://gerrit.wikimedia.org/r/1013964 (https://phabricator.wikimedia.org/T331894) [10:51:21] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013964 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [10:52:32] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1013010|Set four more wikis to read new in pagelinks migration (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:52:44] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [10:53:28] (03PS3) 10Majavah: P:wmcs::metricsinfra: increase repeat interval [puppet] - 10https://gerrit.wikimedia.org/r/1013963 [10:54:23] (03CR) 10Majavah: "24h seems ok to me as a start. I don't think we need to split it for different values for IRC/email." [puppet] - 10https://gerrit.wikimedia.org/r/1013963 (owner: 10Majavah) [10:55:35] (03CR) 10Brouberol: [C:03+2] global_config: fix druid and presto configuration [puppet] - 10https://gerrit.wikimedia.org/r/1013964 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [10:56:21] (03PS1) 10Fabfur: haproxy: fixed typo in log-format [puppet] - 10https://gerrit.wikimedia.org/r/1013967 (https://phabricator.wikimedia.org/T358109) [10:56:35] (03PS10) 10TheDJ: Remove X-Webkit-CSP-Report-Only response header from foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003108 (https://phabricator.wikimedia.org/T357479) [10:57:25] (SystemdUnitFailed) firing: ferm.service on mw2357:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:57:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2386.codfw.wmnet with reason: host reimage [10:57:52] (03CR) 10Brouberol: [C:03+2] Fix pod label selector for external-services network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013965 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [10:57:55] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2336.codfw.wmnet with reason: host reimage [10:57:56] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2337.codfw.wmnet with reason: host reimage [10:58:13] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2387.codfw.wmnet with reason: host reimage [10:58:44] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2388.codfw.wmnet with reason: host reimage [10:58:54] (03CR) 10FNegri: [C:03+1] P:wmcs::metricsinfra: increase repeat interval [puppet] - 10https://gerrit.wikimedia.org/r/1013963 (owner: 10Majavah) [10:59:05] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013966 (owner: 10Brouberol) [10:59:25] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2389.codfw.wmnet with reason: host reimage [10:59:32] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra: increase repeat interval [puppet] - 10https://gerrit.wikimedia.org/r/1013963 (owner: 10Majavah) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T1100) [11:00:05] claime: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2386.codfw.wmnet with reason: host reimage [11:00:54] !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:01:07] on hold until I find out why staging isn't connecting to kafka [11:01:13] (03CR) 10Fabfur: [C:03+2] haproxy: fixed typo in log-format [puppet] - 10https://gerrit.wikimedia.org/r/1013967 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [11:01:18] !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:01:43] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:01:55] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:02:00] (03CR) 10Brouberol: [C:03+2] Superset: Fix pod label selector for external-services networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013966 (owner: 10Brouberol) [11:02:25] (SystemdUnitFailed) firing: (2) ferm.service on kubernetes2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:02:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2388.codfw.wmnet with reason: host reimage [11:02:45] (ProbeDown) firing: (2) Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:03:01] (03CR) 10TheDJ: "I've scheduled this for todays' UTC Late back port window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003108 (https://phabricator.wikimedia.org/T357479) (owner: 10TheDJ) [11:03:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1697/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013957 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [11:03:19] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1013010|Set four more wikis to read new in pagelinks migration (T351237)]] (duration: 13m 13s) [11:04:02] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [11:04:17] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [11:04:45] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [11:04:51] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2336.codfw.wmnet with reason: host reimage [11:05:07] claime: I am seeing a latency spike in the ORES FetchScoreJob (talking to kafka), staring about 15 minutes ago, I suspect it might be related [11:05:33] klausman: No, I was an idiot :) [11:05:38] They're old messages (4 days old) [11:05:44] Ah. Well. [11:07:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2387.codfw.wmnet with reason: host reimage [11:07:17] (03PS1) 10Majavah: P:microsites: fix SPDX header [puppet] - 10https://gerrit.wikimedia.org/r/1013968 [11:07:25] (SystemdUnitFailed) firing: (3) ferm.service on kubernetes2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:08] !log depooling cp4045 to install && test benthos (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013526) (T358109) [11:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:11] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [11:09:10] (03CR) 10Clément Goubert: [C:03+2] changeprop: Move staging to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013532 (https://phabricator.wikimedia.org/T360767) (owner: 10Clément Goubert) [11:09:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2389.codfw.wmnet with reason: host reimage [11:09:16] (03PS4) 10Cparle: MachineVision extension is being sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) [11:10:09] (03Merged) 10jenkins-bot: changeprop: Move staging to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013532 (https://phabricator.wikimedia.org/T360767) (owner: 10Clément Goubert) [11:10:29] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4045.ulsfo.wmnet [11:11:21] !log Migrating changeprop staging to mw-api-int - T360767 [11:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:24] T360767: Migrate changeprop to mw-api-int - https://phabricator.wikimedia.org/T360767 [11:11:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2337.codfw.wmnet with reason: host reimage [11:11:30] (03PS5) 10Cparle: MachineVision extension is being sunsetted, so stop doing dumps [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) [11:11:32] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [11:12:07] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [11:12:14] (03PS1) 10Brouberol: fix template/include in networkpolicy template scaffoloding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013969 (https://phabricator.wikimedia.org/T331894) [11:12:20] (03CR) 10Fabfur: [V:03+1 C:03+2] benthos: enable benthos instance on upload host (cp4045) [puppet] - 10https://gerrit.wikimedia.org/r/1013526 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [11:17:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:17:25] (SystemdUnitFailed) resolved: (3) ferm.service on kubernetes2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:17:49] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: 14Add user fabfur to analytics-privatedata-users - 14https://phabricator.wikimedia.org/T359561#9657149 (10BTullis) 05Open→03Resolved [11:18:00] claime: can you confirm that you're working on non-staging changeprop? [11:18:05] not yet [11:18:11] I'm working on staging right now [11:18:26] weird. Latency shot up from seconds to almost an hour for the OREs jobs, and I have no clue why [11:18:45] s/latency/backlog/ [11:18:59] !log *repooling* cp4045 with Benthos (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013526) (T358109) [11:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:03] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [11:19:15] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4045.ulsfo.wmnet [11:19:19] (03PS1) 10Majavah: P:wmcs::metricsinfra: alertmanager: fix ordering issues [puppet] - 10https://gerrit.wikimedia.org/r/1013970 (https://phabricator.wikimedia.org/T360630) [11:19:48] klausman: do you mean in jobqueue? [11:19:54] Yep [11:20:00] yeah I'm on changeprop proper [11:20:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2386.codfw.wmnet with OS bullseye [11:20:06] ack [11:20:15] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1698/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013958 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [11:20:33] Given my puzzlement, I am clearly missing something, so I was guessing :) [11:20:41] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:21:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2388.codfw.wmnet with OS bullseye [11:22:17] (03PS1) 10JMeybohm: deployment_server: Add rdb instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [11:22:31] !log klausman@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:22:54] (03CR) 10JMeybohm: [C:03+2] Update admin_ng/_example_/helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013961 (owner: 10JMeybohm) [11:23:06] !log bumping concurrency of ORESFetchScoreJob up to help with removing backlog [11:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:14] !log klausman@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:23:22] (03CR) 10JMeybohm: [C:03+2] fix template/include in networkpolicy template scaffoloding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013969 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [11:24:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2336.codfw.wmnet with OS bullseye [11:25:03] (03PS2) 10JMeybohm: deployment_server: Add rdb instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [11:25:07] (03CR) 10Clément Goubert: [C:03+2] changeprop: Move production to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013533 (https://phabricator.wikimedia.org/T360767) (owner: 10Clément Goubert) [11:25:17] (03Merged) 10jenkins-bot: Update admin_ng/_example_/helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013961 (owner: 10JMeybohm) [11:25:30] (03PS2) 10Clément Goubert: changeprop: Move production to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013533 (https://phabricator.wikimedia.org/T360767) [11:25:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2387.codfw.wmnet with OS bullseye [11:26:38] (03Merged) 10jenkins-bot: fix template/include in networkpolicy template scaffoloding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013969 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [11:27:36] (03CR) 10Clément Goubert: changeprop: Move production to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013533 (https://phabricator.wikimedia.org/T360767) (owner: 10Clément Goubert) [11:27:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2389.codfw.wmnet with OS bullseye [11:27:57] (03CR) 10Clément Goubert: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013533 (https://phabricator.wikimedia.org/T360767) (owner: 10Clément Goubert) [11:28:54] (03CR) 10Clément Goubert: [C:03+2] changeprop: Move production to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013533 (https://phabricator.wikimedia.org/T360767) (owner: 10Clément Goubert) [11:29:48] (03Merged) 10jenkins-bot: changeprop: Move production to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013533 (https://phabricator.wikimedia.org/T360767) (owner: 10Clément Goubert) [11:30:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2337.codfw.wmnet with OS bullseye [11:30:31] !log Scaling mw-api-int up for changeprop migration - T360767 [11:30:37] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:30:56] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:31:04] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:31:20] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:31:49] !log Migrating codfw changeprop to mw-api-int - T360767 [11:32:16] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [11:33:06] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [11:43:36] (GatewayBackendErrorsHigh) firing: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [11:43:47] Hello wikifeeds o. [11:43:54] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra: alertmanager: fix ordering issues [puppet] - 10https://gerrit.wikimedia.org/r/1013970 (https://phabricator.wikimedia.org/T360630) (owner: 10Majavah) [11:44:19] at least we have a nice, non-generic alert now :) [11:45:21] (03PS3) 10JMeybohm: deployment_server: Add rdb instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [11:45:21] (03PS1) 10JMeybohm: Move redis instances hiera key to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/1013974 (https://phabricator.wikimedia.org/T360612) [11:46:17] (03CR) 10Matthias Mullie: [C:03+1] Sunsetting MachineVision extension, so remove config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013284 (https://phabricator.wikimedia.org/T352884) (owner: 10Cparle) [11:46:18] (03PS4) 10Jforrester: Be able to disable MobileFrontend and drop the secondary domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010268 (https://phabricator.wikimedia.org/T349408) [11:46:19] (03PS3) 10Jforrester: [BETA CLUSTER] Disable MobileFrontend for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010269 (https://phabricator.wikimedia.org/T358329) [11:46:19] (03PS3) 10Jforrester: [wikifunctionswiki] Disable MobileFrontend in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) [11:46:35] (03PS2) 10JMeybohm: Move redis instances hiera key to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/1013974 (https://phabricator.wikimedia.org/T360612) [11:46:35] (03PS4) 10JMeybohm: deployment_server: Add rdb instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [11:46:48] (03CR) 10CI reject: [V:04-1] Move redis instances hiera key to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/1013974 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [11:48:01] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1703/console" [puppet] - 10https://gerrit.wikimedia.org/r/1013974 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [11:49:46] (03PS2) 10Jforrester: Remove 'changetags' from default's user group, grant to +sysop and +bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992763 (https://phabricator.wikimedia.org/T355639) [11:51:46] (03PS5) 10JMeybohm: deployment_server: Add rdb instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [11:51:48] ok I'm not seeing obvious errors in changeprop codfw after moving to mw-api-int, migrating eqiad [11:52:18] !log Migrating eqiad changeprop to mw-api-int - T360767 [11:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:22] T360767: Migrate changeprop to mw-api-int - https://phabricator.wikimedia.org/T360767 [11:52:32] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [11:52:52] (03CR) 10Matthias Mullie: [C:04-1] "Also remove the invocation to this class from modules/profile/manifests/mediawiki/maintenance.pp" [puppet] - 10https://gerrit.wikimedia.org/r/1013329 (https://phabricator.wikimedia.org/T352884) (owner: 10Cparle) [11:53:05] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [11:53:32] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1706/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013959 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [11:53:41] (03CR) 10Btullis: [V:03+1] "stsve" [puppet] - 10https://gerrit.wikimedia.org/r/1013955 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [11:57:07] (03CR) 10Matthias Mullie: [C:03+1] MachineVision extension is being sunsetted, so stop doing dumps [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [11:57:46] (03PS3) 10Jforrester: Clean up wiks' permissions for 'changetags' to align with new defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992763 (https://phabricator.wikimedia.org/T355639) [11:57:46] (03CR) 10Jforrester: Clean up wiks' permissions for 'changetags' to align with new defaults (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992763 (https://phabricator.wikimedia.org/T355639) (owner: 10Jforrester) [11:57:46] (03PS1) 10Jforrester: Remove 'changetags' from default's user group, grant to +sysop and +bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013975 (https://phabricator.wikimedia.org/T355639) [11:58:45] (03PS6) 10JMeybohm: deployment_server: Add rdb instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [11:59:46] (03PS2) 10Cparle: MachineVision is being sunsetted, so remove job [puppet] - 10https://gerrit.wikimedia.org/r/1013329 (https://phabricator.wikimedia.org/T352884) [12:00:30] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host testvm2006.codfw.wmnet with OS bookworm [12:00:30] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host testvm2006.codfw.wmnet [12:00:44] (03CR) 10Matthias Mullie: [C:03+1] MachineVision is being sunsetted, so remove job [puppet] - 10https://gerrit.wikimedia.org/r/1013329 (https://phabricator.wikimedia.org/T352884) (owner: 10Cparle) [12:01:24] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS bookworm [12:01:59] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1707/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [12:04:40] (03PS1) 10Majavah: Disallow changing email on Wikitech directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013976 (https://phabricator.wikimedia.org/T360883) [12:05:27] (03CR) 10CI reject: [V:04-1] Disallow changing email on Wikitech directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013976 (https://phabricator.wikimedia.org/T360883) (owner: 10Majavah) [12:06:02] (03PS2) 10Majavah: Disallow changing email on Wikitech directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013976 (https://phabricator.wikimedia.org/T360883) [12:06:19] 06SRE, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Update maxmind download to pull databases from new url - https://phabricator.wikimedia.org/T358268#9657355 (10BTullis) a:03BTullis Thanks again @jcrespo. I'll be working on this ticket today and let you know if there is anything that doesn't seem straight... [12:06:46] (03PS1) 10Btullis: Update the ssl_provider for the eventschema service to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013977 (https://phabricator.wikimedia.org/T360412) [12:08:08] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1708/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013977 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [12:10:21] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host testvm2006.codfw.wmnet with OS bookworm [12:10:39] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts testvm2006.codfw.wmnet [12:12:41] (03PS4) 10Urbanecm: Add wmgUseCommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013609 (https://phabricator.wikimedia.org/T357766) [12:12:45] 06SRE, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Phase out cergen for Data Platform services - https://phabricator.wikimedia.org/T360412#9657381 (10BTullis) [12:12:55] (03PS4) 10Urbanecm: Add CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013608 (https://phabricator.wikimedia.org/T357766) [12:13:00] (03PS5) 10Urbanecm: Add wmgUseCommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013609 (https://phabricator.wikimedia.org/T357766) [12:13:04] (03PS4) 10Urbanecm: [beta] eswiki: Enable CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013610 (https://phabricator.wikimedia.org/T357766) [12:13:08] (03PS4) 10Urbanecm: [beta] eswiki: Use CommunityConfiguration extension for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013611 (https://phabricator.wikimedia.org/T357766) [12:13:14] (03CR) 10Ayounsi: [C:03+1] IDP: Switch to new Bookworm host. [dns] - 10https://gerrit.wikimedia.org/r/1013949 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [12:13:21] (03CR) 10Urbanecm: [C:04-2] "after the branch cut" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013608 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [12:14:32] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [12:14:55] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate changeprop to mw-api-int - https://phabricator.wikimedia.org/T360767#9657385 (10Clement_Goubert) `mw-api-int` is now receiving all calls to `mwapi_uri` from changeprop {F43323601} There are still calls coming from the `ChangePropagation/WM... [12:15:56] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:15:57] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts testvm2006.codfw.wmnet [12:16:08] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152#9657390 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `testvm2006.codfw.wmnet` - testvm2006.codfw.wmnet (**FAIL**) - Do... [12:17:21] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [12:18:08] 06SRE, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Phase out cergen for Data Platform services - https://phabricator.wikimedia.org/T360412#9657392 (10BTullis) I have created separate CRs for each of the services that used these cergen certificates and th... [12:18:13] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: 14Migrate changeprop to mw-api-int - 14https://phabricator.wikimedia.org/T360767#9657393 (10Clement_Goubert) 05In progress→03Resolved [12:18:44] 06SRE, 10MW-on-K8s, 10RESTBase, 06serviceops, 13Patch-For-Review: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213#9657395 (10Clement_Goubert) [12:18:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:19:34] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [12:19:35] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [12:22:05] !log Switch IDP/SSO-servers to Bookworm [12:22:07] (03CR) 10Sergio Gimeno: [C:03+1] Add wmgUseCommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013609 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [12:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:17] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:22:25] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2006.codfw.wmnet [12:23:07] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [12:23:08] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [12:23:14] (03CR) 10Slyngshede: [C:03+2] IDP: Switch to new Bookworm host. [dns] - 10https://gerrit.wikimedia.org/r/1013949 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [12:24:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:24:31] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [12:24:34] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [12:25:01] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [12:25:06] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9657411 (10Clement_Goubert) [12:25:35] !log Running homer 'cr*codfw*' commit 'T351074' [12:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:39] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [12:25:53] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [12:26:09] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS bookworm [12:27:28] (03CR) 10Brouberol: [C:03+1] Update the ssl_provider for the eventschema service to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013977 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [12:27:50] (03CR) 10Brouberol: [C:03+1] Update the ssl_provider for the YARN ui to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013955 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [12:32:10] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [12:32:37] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [12:34:41] !log Pooling and uncordoning mw2336.codfw.wmnet,mw2337.codfw.wmnet,mw2386.codfw.wmnet,mw2387.codfw.wmnet,mw2388.codfw.wmnet,mw2389.codfw.wmnet - T351074 [12:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:45] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [12:34:51] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2336.codfw.wmnet|mw2337.codfw.wmnet|mw2386.codfw.wmnet|mw2387.codfw.wmnet|mw2388.codfw.wmnet|mw2389.codfw.wmnet),cluster=kubernetes,service=kubesvc [12:34:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:43:41] (03CR) 10Brouberol: [C:03+1] Update the ssl_provider for matomo to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013957 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [12:43:42] (03CR) 10Brouberol: [C:03+1] Update the ssl_provider for turnilo to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013958 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [12:43:48] (03CR) 10Brouberol: [C:03+1] Update the ssl_provider for hue to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013959 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [12:43:52] (03PS7) 10JMeybohm: deployment_server: Add rdb instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [12:43:56] (03Abandoned) 10JMeybohm: Move redis instances hiera key to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/1013974 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [12:44:04] (03PS8) 10JMeybohm: deployment_server: Add rdb instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [12:44:20] (03CR) 10Btullis: [V:03+1 C:03+2] Update the ssl_provider for hue to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013959 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [12:44:34] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1709/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [12:44:42] (03PS3) 10Btullis: Update the ssl_provider for hue to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013959 (https://phabricator.wikimedia.org/T360412) [12:46:00] (03PS3) 10Majavah: Disallow changing email on Wikitech directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013976 (https://phabricator.wikimedia.org/T360883) [12:47:01] (03PS1) 10Brouberol: spark-history: add external-services egress network policy template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013989 (https://phabricator.wikimedia.org/T359423) [12:47:05] (03PS1) 10Brouberol: spark-history: replace hardcoded CIDRs by service names to generate egress policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013990 (https://phabricator.wikimedia.org/T359423) [12:47:21] (03PS2) 10Brouberol: spark-history: add external-services egress network policy template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013989 (https://phabricator.wikimedia.org/T359423) [12:47:25] (03PS2) 10Brouberol: spark-history: replace hardcoded CIDRs by service names to generate egress policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013990 (https://phabricator.wikimedia.org/T359423) [12:47:31] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1013959 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [12:47:41] (03CR) 10CI reject: [V:04-1] spark-history: add external-services egress network policy template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013989 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [12:47:45] (03CR) 10CI reject: [V:04-1] spark-history: replace hardcoded CIDRs by service names to generate egress policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013990 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [12:48:47] Big increase in proton a4 requests [12:48:52] queue is full [12:49:26] I'll manually and temporarily increase replicas [12:49:41] !log doubling replicas for proton in eqiad [12:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:30] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [12:50:31] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [12:50:33] (03PS3) 10Brouberol: spark-history: add external-services egress network policy template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013989 (https://phabricator.wikimedia.org/T359423) [12:50:33] (03PS3) 10Brouberol: spark-history: replace hardcoded CIDRs by service names to generate egress policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013990 (https://phabricator.wikimedia.org/T359423) [12:50:42] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [12:51:28] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [12:51:36] (GatewayBackendErrorsHigh) firing: rest-gateway: elevated 5xx errors from proton_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [12:51:45] !incidents [12:51:46] 4538 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [12:52:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [12:53:06] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9657484 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [12:56:21] (03CR) 10Brouberol: [C:03+2] Superset: remove all resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1013048 (https://phabricator.wikimedia.org/T358570) (owner: 10Brouberol) [12:56:36] (GatewayBackendErrorsHigh) firing: (2) rest-gateway: elevated 5xx errors from proton_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [12:56:36] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013989 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [12:56:50] Doubling resources was not enough apparently [12:57:00] lots of pdf requests from zhwiki seems to be the cause [12:57:03] (03CR) 10Btullis: [C:03+1] spark-history: replace hardcoded CIDRs by service names to generate egress policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013990 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [12:57:27] !log klausman@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:57:51] (03CR) 10Brouberol: [C:03+2] spark-history: add external-services egress network policy template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013989 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [12:58:11] !log klausman@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:59:42] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T1300). [13:00:04] tgr, Ammar, hashar, and James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] !log doubling replicas for proton in eqiad again [13:00:12] o/ [13:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:14] Heya. [13:00:27] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [13:00:43] where I have discoverd IPUtils supports ranges such as '127.0.0.1 - 127.0.0.255' [13:01:28] I am doing Ammar patch which I have postponed from this morning [13:01:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013319 (https://phabricator.wikimedia.org/T358494) (owner: 10Ammarpad) [13:02:01] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [13:02:38] (03Merged) 10jenkins-bot: throttle: Add throttle rule for editathon at Illinois Tech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013319 (https://phabricator.wikimedia.org/T358494) (owner: 10Ammarpad) [13:02:54] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1013319|throttle: Add throttle rule for editathon at Illinois Tech (T358494)]] [13:03:04] T358494: Request temporary lift of IP cap for edit-a-thon (Illinois Tech) on 2024-03-27 - https://phabricator.wikimedia.org/T358494 [13:03:50] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [13:04:00] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [13:04:18] pff [13:04:22] syntax change [13:04:52] (03PS1) 10Clément Goubert: proton: double replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013995 [13:05:02] !incidents [13:05:03] 4538 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [13:05:12] (03CR) 10Hnowlan: [C:03+1] proton: double replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013995 (owner: 10Clément Goubert) [13:05:19] (03PS2) 10Clément Goubert: proton: triple replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013995 [13:06:13] (03CR) 10Brouberol: [C:03+2] spark-history: replace hardcoded CIDRs by service names to generate egress policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013990 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [13:06:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446#9657533 (10Papaul) @klausman hello please see @Jhancock.wm comment above. Thank you. [13:06:54] !log hashar@deploy1002 hashar and ammarpad: Backport for [[gerrit:1013319|throttle: Add throttle rule for editathon at Illinois Tech (T358494)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:07:12] !log hashar@deploy1002 hashar and ammarpad: Continuing with sync [13:07:50] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [13:08:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446#9657536 (10klausman) >>! In T360446#9649946, @Jhancock.wm wrote: > Found the drive as absent in iDRAC. Physically, the drive is there... [13:08:30] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [13:10:42] 06SRE, 06Infrastructure-Foundations, 10netops: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw - https://phabricator.wikimedia.org/T360772#9657554 (10ayounsi) > So we need to decide if this imbalance for local queries is going to be an issue. I think load is the main thing to loo... [13:11:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2005.codfw.wmnet with OS bullseye [13:11:53] (03PS1) 10Brouberol: spark-history: fix egress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013997 (https://phabricator.wikimedia.org/T359423) [13:12:13] (03PS9) 10JMeybohm: deployment_server: Add redis misc instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [13:16:46] I wonder about optimizing versus human clarity [13:16:51] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1710/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [13:17:01] $foo = $bar === 'something' ? 'yes' : null; [13:17:03] (03PS1) 10Arnaudb: auto_schema: add a test on Db to check column types [software] - 10https://gerrit.wikimedia.org/r/1013364 (https://phabricator.wikimedia.org/T360332) [13:17:12] that takes some brain cyclles to parse :D [13:17:15] !log bounce prometheus@k8s on prometheus2005 to diagnose OOM - T354399 [13:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:19] T354399: Prometheus @ k8s OOM loop - https://phabricator.wikimedia.org/T354399 [13:18:22] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1013319|throttle: Add throttle rule for editathon at Illinois Tech (T358494)]] (duration: 15m 28s) [13:18:26] T358494: Request temporary lift of IP cap for edit-a-thon (Illinois Tech) on 2024-03-27 - https://phabricator.wikimedia.org/T358494 [13:18:37] (03PS2) 10Brouberol: spark-history: fix egress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013997 (https://phabricator.wikimedia.org/T359423) [13:19:13] (03CR) 10Btullis: [V:03+1 C:03+2] Update the ssl_provider for the analytics webserver to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013956 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [13:19:20] (03PS3) 10Btullis: Update the ssl_provider for the analytics webserver to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013956 (https://phabricator.wikimedia.org/T360412) [13:20:00] (03CR) 10Btullis: [C:03+1] spark-history: fix egress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013997 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [13:20:23] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1013956 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [13:21:30] (03CR) 10Hashar: Use more compact PHP7 syntax where possible (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [13:21:35] next [13:22:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [13:22:43] (03CR) 10Brouberol: [C:03+2] spark-history: fix egress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013997 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [13:22:46] (03Merged) 10jenkins-bot: Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [13:23:03] !log hashar@deploy1002 Started scap: Backport for [[gerrit:737859|Use more compact PHP7 syntax where possible]] [13:23:48] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [13:25:43] !log hashar@deploy1002 thiemowmde and hashar: Backport for [[gerrit:737859|Use more compact PHP7 syntax where possible]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:27:43] how to fully test that ... [13:27:56] Is the site up? Then it works. ;-) [13:28:19] !log bounce prometheus@k8s on prometheus2006 to diagnose OOM - T354399 [13:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:29] T354399: Prometheus @ k8s OOM loop - https://phabricator.wikimedia.org/T354399 [13:28:42] !log hashar@deploy1002 thiemowmde and hashar: Continuing with sync [13:29:17] hashar: tested other than the maintenance and throttle parts [13:29:32] (I think; it's not very obvious when exactly they are called) [13:29:38] (03PS1) 10Btullis: Update the from address of all email from refinery jobs. [puppet] - 10https://gerrit.wikimedia.org/r/1014001 (https://phabricator.wikimedia.org/T358675) [13:29:40] magically is my guess [13:29:41] :) [13:29:46] those two are relatively low risk [13:29:51] I gave it a "careful" review as well [13:29:57] so I guess that is enough pair of eyes [13:30:02] and good morning! [13:30:16] isn't like 6am for you tgr? [13:30:31] no, I am in Austria [13:31:05] you so need timezone aware icknames [13:31:23] or IRC needs an extension to let one set their time zone and expose it [13:31:28] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1711/co" [puppet] - 10https://gerrit.wikimedia.org/r/1014001 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [13:32:03] James_F: I have just found out the next two patches are from you [13:32:10] Yes. Can you deploy? [13:32:14] adn apparently we can only do 4 patches per window now :( [13:32:17] Should be no-ops except for Beta. [13:32:18] yes I will deploy them [13:32:21] Joy. [13:32:22] ah excellent [13:32:30] the delay is well [13:32:35] !log jgiannelos@deploy1002 Started deploy [restbase/deploy@897fc7e]: (no justification provided) [13:32:37] or maybe we need to stop using IRC and switch to a modern chat system, pretty much any one of which has such a feature :) [13:32:47] (03CR) 10Ladsgroup: "I might be wrong but I think this is not needed. Let me grab the examples" [software] - 10https://gerrit.wikimedia.org/r/1013364 (https://phabricator.wikimedia.org/T360332) (owner: 10Arnaudb) [13:32:52] the delay is well, driven by php restarts (~ 3 minutes) and kubernetes deploy (~ 7 minutes) [13:33:00] tgr: +1 [13:33:15] (03CR) 10Stevemunene: [C:03+1] Update the ssl_provider for the eventschema service to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013977 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [13:33:16] good luck on us A) picking a new system B) actually committing to migrate to it :-\\\ [13:33:35] (03CR) 10Stevemunene: [C:03+1] Update the ssl_provider for the YARN ui to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013955 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [13:33:46] (03CR) 10Stevemunene: [C:03+1] Update the ssl_provider for turnilo to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013958 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [13:33:51] !log jgiannelos@deploy1002 Finished deploy [restbase/deploy@897fc7e]: (no justification provided) (duration: 01m 16s) [13:33:57] (03CR) 10Stevemunene: [C:03+1] Update the ssl_provider for matomo to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013957 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [13:34:33] (03CR) 10Ladsgroup: "yup, I think this is not needed? https://wikitech.wikimedia.org/wiki/Auto_schema/examples" [software] - 10https://gerrit.wikimedia.org/r/1013364 (https://phabricator.wikimedia.org/T360332) (owner: 10Arnaudb) [13:37:18] (03CR) 10Majavah: [C:03+2] P:toolforge::legacy_redirector: Drop configuration [puppet] - 10https://gerrit.wikimedia.org/r/1013522 (https://phabricator.wikimedia.org/T311909) (owner: 10Majavah) [13:39:10] (03CR) 10Btullis: [V:03+1 C:03+2] Update the ssl_provider for turnilo to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013958 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [13:39:15] (03PS3) 10Btullis: Update the ssl_provider for turnilo to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013958 (https://phabricator.wikimedia.org/T360412) [13:39:18] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:737859|Use more compact PHP7 syntax where possible]] (duration: 16m 15s) [13:39:23] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1013958 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [13:39:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:40:40] well apparently it is still up [13:41:36] (GatewayBackendErrorsHigh) resolved: rest-gateway: elevated 5xx errors from proton_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [13:41:38] (03CR) 10Hashar: [C:03+2] Be able to disable MobileFrontend and drop the secondary domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010268 (https://phabricator.wikimedia.org/T349408) (owner: 10Jforrester) [13:42:07] (03CR) 10Hashar: [C:03+2] [BETA CLUSTER] Disable MobileFrontend for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010269 (https://phabricator.wikimedia.org/T358329) (owner: 10Jforrester) [13:42:30] James_F: so MobileFrontend will eventually be phased out? [13:42:38] hashar: Certainly for WF. [13:42:40] (03Merged) 10jenkins-bot: Be able to disable MobileFrontend and drop the secondary domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010268 (https://phabricator.wikimedia.org/T349408) (owner: 10Jforrester) [13:42:42] Maybe everywhere. [13:42:51] WF == Wikimedia Foundation ? [13:43:13] Wikifunctions. [13:43:19] (03Merged) 10jenkins-bot: [BETA CLUSTER] Disable MobileFrontend for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010269 (https://phabricator.wikimedia.org/T358329) (owner: 10Jforrester) [13:43:50] !log hashar@deploy1002 Started scap: (no justification provided) [13:44:00] I forgot the message [13:44:36] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [13:44:59] (03PS2) 10Arnaudb: auto_schema: add a test on Db to check column types [software] - 10https://gerrit.wikimedia.org/r/1013364 (https://phabricator.wikimedia.org/T360332) [13:45:03] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [13:45:57] (03PS1) 10Jelto: gitlab_runner: unregister gitlab-runner2004 for dockerfile conversion [puppet] - 10https://gerrit.wikimedia.org/r/1014005 (https://phabricator.wikimedia.org/T357612) [13:46:20] (03PS3) 10Arnaudb: auto_schema: add a test on Db to check column types [software] - 10https://gerrit.wikimedia.org/r/1013364 (https://phabricator.wikimedia.org/T360332) [13:46:51] (GatewayBackendErrorsHigh) firing: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [13:46:59] Ha [13:47:08] Hello again wikifeeds :p [13:48:06] (GatewayBackendErrorsHigh) resolved: (2) rest-gateway: elevated 5xx errors from proton_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [13:48:20] James_F: https://phabricator.wikimedia.org/T360597 [13:48:29] (the wikifeeds issue( [13:48:49] claime: Oh dear. [13:49:47] Pile-up of already-bad requests to one DB triggering the error threshold? Fun. [13:49:57] (03CR) 10Arnaudb: "ah!" [software] - 10https://gerrit.wikimedia.org/r/1013364 (https://phabricator.wikimedia.org/T360332) (owner: 10Arnaudb) [13:50:01] (03Abandoned) 10Arnaudb: auto_schema: add a test on Db to check column types [software] - 10https://gerrit.wikimedia.org/r/1013364 (https://phabricator.wikimedia.org/T360332) (owner: 10Arnaudb) [13:50:57] Dropping RB can't come quickly enough. [13:51:43] (03PS1) 10Brouberol: spark-history: bypass Kerberos principal hostname reverse DNS check for namenode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014010 (https://phabricator.wikimedia.org/T359423) [13:52:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 16 hosts with reason: Maint T352010 [13:52:32] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:52:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 16 hosts with reason: Maint T352010 [13:53:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 16 hosts with reason: Maint T352010 [13:53:13] 06SRE, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Phase out cergen for Data Platform services - https://phabricator.wikimedia.org/T360412#9657652 (10BTullis) [13:53:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 16 hosts with reason: Maint T352010 [13:55:29] (03CR) 10Brouberol: [C:03+1] Update the from address of all email from refinery jobs. [puppet] - 10https://gerrit.wikimedia.org/r/1014001 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [13:55:47] !log finish rolling out rsyslog-exporter to remaining hosts in codfw and eqiad - T357616 [13:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:51] T357616: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616 [13:56:37] !log hashar@deploy1002 Finished scap: (no justification provided) (duration: 12m 46s) [13:58:08] !log UTC afternoon backport window completed [13:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446#9657681 (10Jhancock.wm) reseating the drive did not fix the issue. server is in warranty. created a ticket with Dell to get it replace... [14:05:43] (03PS4) 10Majavah: P:toolforge::legacy_redirector: Use Apache on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1013523 (https://phabricator.wikimedia.org/T311909) [14:08:49] (03CR) 10Andrew Bogott: [C:03+1] P:toolforge::legacy_redirector: Use Apache on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1013523 (https://phabricator.wikimedia.org/T311909) (owner: 10Majavah) [14:10:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS bookworm [14:10:18] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9657687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudbackup2003.codfw.... [14:10:29] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup2003.codfw.wmnet with OS bookworm [14:10:40] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9657688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudbackup2003.codfw.wmne... [14:11:29] (03CR) 10Majavah: [C:03+2] P:toolforge::legacy_redirector: Use Apache on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1013523 (https://phabricator.wikimedia.org/T311909) (owner: 10Majavah) [14:13:28] (03CR) 10Dreamy Jazz: [C:03+1] Schedule weekly purge of global_block_whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) (owner: 10Tchanders) [14:17:09] (03PS1) 10EoghanGaffney: [gitlab] Switch gitlab-replica and gitlab-replica-old [puppet] - 10https://gerrit.wikimedia.org/r/1014016 [14:20:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2006.codfw.wmnet with OS bookworm [14:20:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2006.codfw.wmnet [14:21:01] (03PS1) 10Majavah: P:toolforge: fix duplicate statement [puppet] - 10https://gerrit.wikimedia.org/r/1014017 [14:21:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: provisionning db2214.codfw.wmnet - T355422 [14:21:59] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [14:22:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: provisionning db2214.codfw.wmnet - T355422 [14:22:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: provisionning db2214.codfw.wmnet - T355422 [14:22:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS bookworm [14:22:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: provisionning db2214.codfw.wmnet - T355422 [14:22:32] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9657711 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudbackup2003.codfw.... [14:22:36] (GatewayBackendErrorsHigh) firing: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [14:23:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2114 in db2214 for T355422', diff saved to https://phabricator.wikimedia.org/P58909 and previous config saved to /var/cache/conftool/dbconfig/20240325-142344-arnaudb.json [14:25:44] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2114.codfw.wmnet onto db2214.codfw.wmnet [14:25:57] (03CR) 10Majavah: [C:03+2] P:toolforge: fix duplicate statement [puppet] - 10https://gerrit.wikimedia.org/r/1014017 (owner: 10Majavah) [14:26:27] (03CR) 10Elukey: "Janis/Alex - when you have a moment lemme know if the current patch looks ok to merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [14:27:23] 10ops-eqiad, 10Observability-Metrics: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687#9657730 (10herron) [14:28:07] (03CR) 10JMeybohm: [C:03+1] "Looks good from my POV" [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [14:28:17] 10ops-codfw, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895 (10herron) 03NEW [14:32:42] (03CR) 10Ssingh: [C:03+1] Decommission aqs records [dns] - 10https://gerrit.wikimedia.org/r/1013500 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [14:33:03] (03PS2) 10Brouberol: Decommission aqs records [dns] - 10https://gerrit.wikimedia.org/r/1013500 (https://phabricator.wikimedia.org/T358793) [14:33:06] (03PS1) 10Majavah: P:toolforge::legacy_redirector: fix cert path [puppet] - 10https://gerrit.wikimedia.org/r/1014020 [14:33:06] (03PS1) 10Majavah: P:toolforge::legacy_redirector: drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1014021 (https://phabricator.wikimedia.org/T311909) [14:33:54] (03CR) 10Btullis: [V:03+1 C:03+2] Update the ssl_provider for the YARN ui to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013955 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [14:34:53] (03CR) 10Brouberol: [C:03+2] Decommission aqs records [dns] - 10https://gerrit.wikimedia.org/r/1013500 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [14:35:24] (03CR) 10Ssingh: Decommission aqs realserver pool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013501 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [14:37:19] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:00] (03CR) 10Majavah: [C:03+2] P:toolforge::legacy_redirector: fix cert path [puppet] - 10https://gerrit.wikimedia.org/r/1014020 (owner: 10Majavah) [14:39:03] (03PS10) 10JMeybohm: deployment_server: Add redis misc instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [14:40:40] !log eoghan@cumin1002 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [14:41:37] (03PS2) 10Brouberol: Set state of aqs service to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1013501 (https://phabricator.wikimedia.org/T358793) [14:41:37] (03PS1) 10Brouberol: aqs: remove the relserver pool from host [puppet] - 10https://gerrit.wikimedia.org/r/1014023 (https://phabricator.wikimedia.org/T358793) [14:41:51] !log eoghan@cumin1002 END (FAIL) - Cookbook sre.gitlab.failover (exit_code=93) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [14:42:19] (03CR) 10Ssingh: [C:03+1] "LGTM! We will run `sudo cumin 'A:dnsbox' run-puppet-agent` after this merged." [puppet] - 10https://gerrit.wikimedia.org/r/1013501 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [14:42:36] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1712/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [14:43:09] (03CR) 10Ssingh: "Here, the LVS state should also be changed to state: service_setup in addition to the change already made (as in the previous commit)." [puppet] - 10https://gerrit.wikimedia.org/r/1014023 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [14:43:35] (03PS1) 10JMeybohm: Enable external-services on all wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014024 (https://phabricator.wikimedia.org/T331894) [14:45:10] (03PS2) 10Brouberol: aqs: remove the relserver pool from host [puppet] - 10https://gerrit.wikimedia.org/r/1014023 (https://phabricator.wikimedia.org/T358793) [14:45:43] (03CR) 10Brouberol: "This change will be released once https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013501/2? is released, so at that point, the servi" [puppet] - 10https://gerrit.wikimedia.org/r/1014023 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [14:45:51] (03CR) 10Brouberol: [C:03+2] Set state of aqs service to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1013501 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [14:47:06] (03CR) 10Giuseppe Lavagetto: [C:03+1] role::docker_registry_ha::registry: increase tmpfs size in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1013541 (https://phabricator.wikimedia.org/T360637) (owner: 10Elukey) [14:47:32] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [14:48:59] jouncebot: next [14:48:59] In 0 hour(s) and 41 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T1530) [14:49:19] (03CR) 10Ssingh: "LGTM, one thing I forgot mentioned inline:" [puppet] - 10https://gerrit.wikimedia.org/r/1014023 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [14:50:34] (03PS3) 10Brouberol: aqs: remove the relserver pool from host [puppet] - 10https://gerrit.wikimedia.org/r/1014023 (https://phabricator.wikimedia.org/T358793) [14:50:46] (03CR) 10Brouberol: aqs: remove the relserver pool from host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014023 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [14:51:31] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on registry1003.eqiad.wmnet with reason: Increase tmpfs for nginx [14:51:47] (03CR) 10Ssingh: [C:03+1] aqs: remove the relserver pool from host [puppet] - 10https://gerrit.wikimedia.org/r/1014023 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [14:51:56] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on registry1003.eqiad.wmnet with reason: Increase tmpfs for nginx [14:52:05] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on registry1004.eqiad.wmnet with reason: Increase tmpfs for nginx [14:52:10] (03CR) 10Brouberol: [C:03+2] aqs: remove the relserver pool from host [puppet] - 10https://gerrit.wikimedia.org/r/1014023 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [14:52:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on registry1004.eqiad.wmnet with reason: Increase tmpfs for nginx [14:52:54] (03CR) 10Elukey: [V:03+1 C:03+2] role::docker_registry_ha::registry: increase tmpfs size in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1013541 (https://phabricator.wikimedia.org/T360637) (owner: 10Elukey) [14:54:44] (03CR) 10Xcollazo: MachineVision extension is being sunsetted, so stop doing dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [14:56:53] (03CR) 10Btullis: [V:03+1 C:03+2] Update the ssl_provider for matomo to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013957 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [14:57:12] (03PS1) 10Majavah: P:toolforge::legacy_redirector: add www.toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/1014025 (https://phabricator.wikimedia.org/T311909) [14:57:13] (03PS1) 10Majavah: P:toolforge::legacy_redirector: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1014026 (https://phabricator.wikimedia.org/T311909) [14:57:15] (03PS1) 10Majavah: Remove old toolserver_legacy code [puppet] - 10https://gerrit.wikimedia.org/r/1014027 [14:57:54] !log increase tmpfs for /var/lib/nginx on registry100[3,4] and restart nginx - T360637 [14:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:58] T360637: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637 [14:58:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2006.codfw.wmnet with OS bullseye [14:59:07] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9657852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2006.codfw.wmnet with OS bullseye [14:59:33] (03PS6) 10Ladsgroup: Make af_actor and afh_actor accessible in Wiki Replicas [puppet] - 10https://gerrit.wikimedia.org/r/991923 (https://phabricator.wikimedia.org/T337921) (owner: 10Zabe) [14:59:36] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Make af_actor and afh_actor accessible in Wiki Replicas [puppet] - 10https://gerrit.wikimedia.org/r/991923 (https://phabricator.wikimedia.org/T337921) (owner: 10Zabe) [15:00:06] !log eoghan@cumin1002 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [15:00:19] !log eoghan@cumin1002 END (FAIL) - Cookbook sre.gitlab.failover (exit_code=93) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [15:00:31] !log restarting pybal on lvs2014.codfw.wmnet - T358793 [15:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:39] T358793: Decommission AQS 1.0 - https://phabricator.wikimedia.org/T358793 [15:01:11] 06SRE, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Phase out cergen for Data Platform services - https://phabricator.wikimedia.org/T360412#9657870 (10BTullis) [15:01:30] 06SRE, 10MediaWiki-Email, 10Observability-Alerting: Mail sent out by MediaWiki should have the Auto-Submitted header set to 'auto-generated' (RFC 3834) - https://phabricator.wikimedia.org/T18799#9657871 (10andrea.denisse) [15:01:45] (03CR) 10CI reject: [V:04-1] P:toolforge::legacy_redirector: add www.toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/1014025 (https://phabricator.wikimedia.org/T311909) (owner: 10Majavah) [15:02:17] (03CR) 10Btullis: [V:03+1 C:03+2] Update the ssl_provider for the eventschema service to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013977 (https://phabricator.wikimedia.org/T360412) (owner: 10Btullis) [15:02:19] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:45] (ProbeDown) firing: (2) Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:02:56] !log eoghan@cumin1002 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [15:03:33] (03CR) 10Elukey: profile::prometheus::k8s: move istio metrics to a separate job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:05:10] (03PS1) 10Ladsgroup: mediawiki: Absent FR purge systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1014028 (https://phabricator.wikimedia.org/T359529) [15:06:08] (03CR) 10Elukey: [C:03+2] profile::prometheus::k8s: move istio metrics to a separate job [puppet] - 10https://gerrit.wikimedia.org/r/1012404 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:06:51] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776#9657884 (10Papaul) [15:12:19] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:20] (03PS1) 10Jdrewniak: Guard against undefined $container element in initMobile.js [skins/MinervaNeue] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013645 (https://phabricator.wikimedia.org/T360781) [15:12:33] (03PS2) 10Ladsgroup: mediawiki: Absent FR purge systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1014028 (https://phabricator.wikimedia.org/T359529) [15:12:40] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1014028 (https://phabricator.wikimedia.org/T359529) (owner: 10Ladsgroup) [15:13:06] !log restarting pybal on lvs2013.codfw.wmnet - T358793 [15:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:10] T358793: Decommission AQS 1.0 - https://phabricator.wikimedia.org/T358793 [15:14:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 13 hosts with reason: Maint T343718 [15:14:22] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [15:14:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 13 hosts with reason: Maint T343718 [15:17:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:06] anyone working on elastic2037.codfw.wmnet? [15:20:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [15:20:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [15:20:32] PYBAL CRITICAL - CRITICAL - search_9200: Servers elastic2037.codfw.wmnet are marked down but pooled [15:20:41] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:20:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [15:20:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [15:21:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:21:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:21:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [15:21:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [15:21:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:21:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:21:35] sukhe: perhaps related to T358882 (Brian is out today so can't confirm) [15:21:35] T358882: Decommission elastic2037-2054 - https://phabricator.wikimedia.org/T358882 [15:21:40] 06SRE, 10MediaWiki-Email, 10Observability-Alerting: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9657947 (10andrea.denisse) [15:21:43] oh thanks [15:21:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:21:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:22:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:22:03] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:22:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [15:22:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [15:22:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [15:22:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [15:22:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance [15:22:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance [15:23:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:24:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2038.codfw.wmnet [15:24:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:24:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:24:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:24:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [15:24:30] !log depool elastic2037: host is in insetup and in process of being decomissioned [15:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [15:24:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [15:24:43] 06SRE, 10MediaWiki-Email, 10Observability-Alerting: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9657949 (10andrea.denisse) [15:24:45] 06SRE, 10MediaWiki-Email, 10Observability-Alerting: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9657962 (10andrea.denisse) [15:24:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [15:24:58] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [15:25:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [15:25:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [15:25:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [15:25:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [15:25:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [15:25:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [15:26:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [15:26:01] (03PS6) 10Dzahn: prometheus/apache_exporter: drop argument parameter [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) [15:26:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [15:26:15] (03CR) 10Dzahn: prometheus/apache_exporter: drop argument parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [15:26:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [15:26:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2114.codfw.wmnet onto db2214.codfw.wmnet [15:26:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [15:26:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [15:27:27] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [15:27:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [15:27:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: provisionning db2216.codfw.wmnet - T355422 [15:27:54] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [15:28:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: provisionning db2216.codfw.wmnet - T355422 [15:28:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: provisionning db2216.codfw.wmnet - T355422 [15:28:16] 06SRE, 10MediaWiki-Email: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9657970 (10andrea.denisse) [15:28:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: provisionning db2216.codfw.wmnet - T355422 [15:28:32] 06SRE, 10MediaWiki-Email: Mail sent out by MediaWiki should have the Auto-Submitted header set to 'auto-generated' (RFC 3834) - https://phabricator.wikimedia.org/T18799#9657989 (10andrea.denisse) [15:29:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [15:29:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [15:29:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2116 in db2216 for T355422', diff saved to https://phabricator.wikimedia.org/P58910 and previous config saved to /var/cache/conftool/dbconfig/20240325-152958-arnaudb.json [15:30:05] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T1530). [15:31:58] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2116.codfw.wmnet onto db2216.codfw.wmnet [15:33:06] 10SRE-Access-Requests, 06Fundraising-Backlog: Can we please add our vendor to Google Postmaster Tools - https://phabricator.wikimedia.org/T360907 (10DBu-WMF) 03NEW [15:33:37] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus/apache_exporter: drop argument parameter [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [15:34:45] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9658030 (10CDanis) [15:34:48] (03PS3) 10Ladsgroup: mediawiki: Absent FR purge systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1014028 (https://phabricator.wikimedia.org/T359529) [15:34:54] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Absent FR purge systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1014028 (https://phabricator.wikimedia.org/T359529) (owner: 10Ladsgroup) [15:36:54] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup2003.codfw.wmnet with OS bookworm [15:37:13] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9658042 (10CDanis) Just to make sure I understand, the request here is an easy-to-automate way of dbctl to change the instance IP address? It... [15:40:29] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014033 (https://phabricator.wikimedia.org/T128546) [15:41:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 25%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58911 and previous config saved to /var/cache/conftool/dbconfig/20240325-154107-arnaudb.json [15:43:51] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014033 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:44:33] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014033 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:53:00] (03PS1) 10Elukey: profile::prometheus::k8s: restrict targets for k8s-pods-istio [puppet] - 10https://gerrit.wikimedia.org/r/1014035 (https://phabricator.wikimedia.org/T351390) [15:53:37] 10SRE-swift-storage: Swift server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 (10MatthewVernon) 03NEW [15:54:17] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#9658155 (10Volans) Yes, correct. [15:55:45] (03PS2) 10Elukey: profile::prometheus::k8s: restrict targets for k8s-pods-istio [puppet] - 10https://gerrit.wikimedia.org/r/1014035 (https://phabricator.wikimedia.org/T351390) [15:56:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 50%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58912 and previous config saved to /var/cache/conftool/dbconfig/20240325-155613-arnaudb.json [15:57:17] /11 [15:57:20] err [15:57:21] (03CR) 10Filippo Giunchedi: [C:03+1] profile::prometheus::k8s: restrict targets for k8s-pods-istio [puppet] - 10https://gerrit.wikimedia.org/r/1014035 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:59:10] (03CR) 10Elukey: [C:03+2] profile::prometheus::k8s: restrict targets for k8s-pods-istio [puppet] - 10https://gerrit.wikimedia.org/r/1014035 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:59:28] 10SRE-swift-storage: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913#9658160 (10MatthewVernon) p:05Triage→03Medium [16:01:17] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1014033| Bumping portals to master (T128546)]] (duration: 13m 00s) [16:01:24] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:05:54] !log restarting pybal on lvs1020.eqiad.wmnet - T358793 [16:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:09] T358793: Decommission AQS 1.0 - https://phabricator.wikimedia.org/T358793 [16:08:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 35.74% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:09:46] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2006.codfw.wmnet with OS bullseye [16:11:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 75%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58913 and previous config saved to /var/cache/conftool/dbconfig/20240325-161121-arnaudb.json [16:11:30] !log depooling restbase10[34-42] — T360597 [16:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:33] T360597: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597 [16:12:27] !log restarting pybal on lvs1019.eqiad.wmnet - T358793 [16:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:31] T358793: Decommission AQS 1.0 - https://phabricator.wikimedia.org/T358793 [16:13:15] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 35.74% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:13:32] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1014033| Bumping portals to master (T128546)]] (duration: 12m 15s) [16:13:36] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:15:14] 10SRE-swift-storage: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913#9658289 (10MatthewVernon) Reported upstream as [[ https://bugs.launchpad.net/swift/+bug/2058945 | Bug #2058945 ]]. [16:18:55] 06SRE, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Phase out cergen for Data Platform services - https://phabricator.wikimedia.org/T360412#9658308 (10BTullis) [16:24:57] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9658326 (10Jelto) [16:25:54] 06SRE, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: 14Phase out cergen for Data Platform services - 14https://phabricator.wikimedia.org/T360412#9658322 (10BTullis) 05Open→03Resolved 14I have now removed the obsolete cergen material for all of the... [16:26:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 100%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58914 and previous config saved to /var/cache/conftool/dbconfig/20240325-162627-arnaudb.json [16:27:01] (03PS3) 10Ladsgroup: mediawiki: Get rid of purge flaggedrevs [puppet] - 10https://gerrit.wikimedia.org/r/1013022 (https://phabricator.wikimedia.org/T359529) [16:29:39] (03PS1) 10Brouberol: aqs: Remove conftool data and service entry [puppet] - 10https://gerrit.wikimedia.org/r/1014042 (https://phabricator.wikimedia.org/T358793) [16:30:46] (03CR) 10Ssingh: [C:03+1] "Looks good but if desired, there are also references in site.pp that can be removed/updated." [puppet] - 10https://gerrit.wikimedia.org/r/1014042 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [16:33:36] (03CR) 10Btullis: spark-history: bypass Kerberos principal hostname reverse DNS check for namenode (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014010 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [16:34:53] Hey all, something went wrong with the portal deploy earlier. Is now a good time to do a revert? [16:35:45] (03PS1) 10Jdrewniak: Revert "Bumping portals to master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014066 [16:36:13] !log pooling restbase10[19-21] — T360597 [16:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:17] T360597: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597 [16:37:18] jouncebot: now [16:37:18] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [16:37:25] jan_drewniak: yeah looks fine I guess ? :) [16:37:25] hashar: [16:37:39] ok thanks, this won't take long [16:37:49] (03CR) 10Hashar: [C:03+1] Revert "Bumping portals to master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014066 (owner: 10Jdrewniak) [16:38:04] you can self serve deploy can't you? [16:38:21] yes! I will do that now. meanwile don't look at www.wikipedia.org for the next 15 minutes 🙈 [16:38:30] * hashar opens link [16:38:37] * hashar freaks out [16:38:47] user has disconnected [16:39:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2006.codfw.wmnet with OS bullseye [16:39:05] seriously [16:39:14] cd /srv/mediawiki-staging && git status [16:39:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9658412 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2006.codfw.wmnet with OS bullseye [16:39:24] it could have been worse, I see title, globe search box, a looking glass blue button [16:39:27] other crap at bottom [16:39:33] so that looks very MVP :) [16:39:46] monday mornings... I've created the revert [16:39:58] (03CR) 10Jdrewniak: [C:03+2] Revert "Bumping portals to master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014066 (owner: 10Jdrewniak) [16:40:39] (03Merged) 10jenkins-bot: Revert "Bumping portals to master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014066 (owner: 10Jdrewniak) [16:43:18] 06SRE, 10MediaWiki-Email: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9658445 (10andrea.denisse) [16:45:49] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: Enable API access for Mailman3 - https://phabricator.wikimedia.org/T351202#9658447 (10Jelto) p:05Triage→03Medium a:03Dzahn [16:47:23] !log pooling restbase10[31-33] — T360597 [16:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:27] T360597: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597 [16:47:32] !log correction: depooling restbase10[31-33] — T360597 [16:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:14] (03PS57) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [16:48:14] (03PS1) 10AOkoth: trafficserver: miscweb(security) failover to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1014044 (https://phabricator.wikimedia.org/T350796) [16:49:07] (03PS2) 10AOkoth: trafficserver: miscweb(security) failover to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1014044 (https://phabricator.wikimedia.org/T350796) [16:53:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2006.codfw.wmnet with reason: host reimage [16:55:20] (03CR) 10Jelto: trafficserver: miscweb(security) failover to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014044 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [16:55:56] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1014066| Bumping portals to master (T128546)]] (duration: 13m 42s) [16:56:02] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:56:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2006.codfw.wmnet with reason: host reimage [16:57:02] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776#9658513 (10Papaul) [16:59:30] (03PS3) 10AOkoth: trafficserver: miscweb(security) failover to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1014044 (https://phabricator.wikimedia.org/T350796) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T1700) [17:00:05] ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T1700). [17:01:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS bookworm [17:01:29] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9658550 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudbackup2003.codfw.... [17:02:45] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803#9658554 (10Jhancock.wm) sretest2003 and 2004 have been renamed to their original server names and been offlined (including ssd removal). [17:04:15] (03CR) 10Elukey: "One thing to check - in our blubber images we copy site-packages to /opt/lib/python/site-packages, meanwhile in this case pip installs und" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [17:06:17] !log restarting restbase service, restbase1024 — T360597 [17:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:35] T360597: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597 [17:08:44] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1014066| Bumping portals to master (T128546)]] (duration: 12m 47s) [17:08:48] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [17:09:05] !log restarting restbase service, restbase1031 — T360597 [17:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:42] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:13:53] !log jgiannelos@deploy1002 Started deploy [restbase/deploy@897fc7e]: Deploy latest restbase commit to restbase1024 [17:15:03] 06SRE, 06Data Products, 06Traffic: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9658600 (10Milimetric) @VirginiaPoundstone: Looks like Giuseppe patched varnish to send more requestctls, so maybe that completely or partially solves the problem. I'd have to look throug... [17:15:15] !log jgiannelos@deploy1002 Finished deploy [restbase/deploy@897fc7e]: Deploy latest restbase commit to restbase1024 (duration: 01m 22s) [17:16:09] * jan_drewniak hashar: you can rest easy, the portal fix has been deployed ;P (we had some GSoC contributions recently, so I'll have to dig a little deeper into what caused the issue). [17:17:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:17:58] (03CR) 10David Martin: Update the WikiLambda instrumentation to use core interaction events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [17:19:55] (03CR) 10David Martin: Update the WikiLambda instrumentation to use core interaction events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [17:21:21] !log jgiannelos@deploy1002 Started deploy [restbase/deploy@897fc7e]: Deploy latest restbase commit to restbase1031 [17:22:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:22:48] !log jgiannelos@deploy1002 Finished deploy [restbase/deploy@897fc7e]: Deploy latest restbase commit to restbase1031 (duration: 01m 26s) [17:24:18] (03CR) 10Iniquity: "There's some error, I can't update the patch :( 503... Need to add for" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992763 (https://phabricator.wikimedia.org/T355639) (owner: 10Jforrester) [17:31:27] 06SRE, 06Traffic, 10Data Products (Data Products Sprint 13): Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9658684 (10VirginiaPoundstone) [17:32:20] (03PS3) 10EoghanGaffney: [gitlab] Failover test of gitlab replica hosts [dns] - 10https://gerrit.wikimedia.org/r/1009300 (https://phabricator.wikimedia.org/T358559) [17:32:45] (03CR) 10Dzahn: [C:03+1] [gitlab] Switch gitlab-replica and gitlab-replica-old [puppet] - 10https://gerrit.wikimedia.org/r/1014016 (owner: 10EoghanGaffney) [17:33:05] (03CR) 10EoghanGaffney: [C:03+2] [gitlab] Switch gitlab-replica and gitlab-replica-old [puppet] - 10https://gerrit.wikimedia.org/r/1014016 (owner: 10EoghanGaffney) [17:43:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2116.codfw.wmnet onto db2216.codfw.wmnet [17:49:13] (03PS4) 10Jforrester: Clean up wiks' permissions for 'changetags' to align with new defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992763 (https://phabricator.wikimedia.org/T355639) [17:50:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup2003.codfw.wmnet with OS bookworm [17:51:25] (SystemdUnitFailed) firing: cassandra-a.service on restbase1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:51:28] (03CR) 10Andrew Bogott: "If you want to do the merge, then you can manage the dns/puppet test :)" [puppet] - 10https://gerrit.wikimedia.org/r/1013382 (owner: 10Andrew Bogott) [17:54:19] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9658794 (10Jrbranaa) Sorry for the delay. The renewal/expiry date is June 30, 2025. [17:54:39] (03CR) 10Iniquity: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992763 (https://phabricator.wikimedia.org/T355639) (owner: 10Jforrester) [17:58:55] (03CR) 10Jforrester: "Probably should deploy this before the branch cut so the job doesn't try to run a script that doesn't exist any more?" [puppet] - 10https://gerrit.wikimedia.org/r/1013022 (https://phabricator.wikimedia.org/T359529) (owner: 10Ladsgroup) [18:00:07] (03CR) 10Btullis: [V:03+1] "I'm cautionus about deploying this, because I don't know whether refinery-source will strictly enforce the RFC822 email address." [puppet] - 10https://gerrit.wikimedia.org/r/1014001 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [18:00:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:00:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2006.codfw.wmnet with OS bullseye [18:00:50] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9658799 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2006.codfw.wmnet with OS bullseye completed:... [18:01:20] (03CR) 10Ladsgroup: "it's already absented and I think it's monthly I get it deployed soon anyway." [puppet] - 10https://gerrit.wikimedia.org/r/1013022 (https://phabricator.wikimedia.org/T359529) (owner: 10Ladsgroup) [18:01:25] (SystemdUnitFailed) resolved: cassandra-a.service on restbase1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:02:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9658803 (10jcrespo) a:03jcrespo [18:04:16] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9658807 (10Volans) Will you take care also of debian packaging it and any... [18:21:21] (03PS1) 10Andrew Bogott: cloud-vps mail exchange: change to service names [puppet] - 10https://gerrit.wikimedia.org/r/1014094 [18:22:31] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9658856 (10bd808) [18:25:25] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps mail exchange: change to service names [puppet] - 10https://gerrit.wikimedia.org/r/1014094 (owner: 10Andrew Bogott) [18:26:14] (03CR) 10AOkoth: trafficserver: miscweb(security) failover to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014044 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [18:26:21] (03CR) 10Majavah: [C:03+2] P:toolforge::legacy_redirector: drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1014021 (https://phabricator.wikimedia.org/T311909) (owner: 10Majavah) [18:26:51] !log bearloga@deploy1002 Started deploy [airflow-dags/analytics_product@5e40c6f]: (no justification provided) [18:27:00] !log bearloga@deploy1002 Finished deploy [airflow-dags/analytics_product@5e40c6f]: (no justification provided) (duration: 00m 08s) [18:27:36] (03PS3) 10Ebernhardson: cirrus: Transition remaining cloudelastic wikis to streaming updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006570 (https://phabricator.wikimedia.org/T358518) [18:27:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [18:27:53] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov2005.codfw.wmnet with OS bullseye [18:28:03] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9658875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [18:28:05] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9658878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye executed w... [18:29:57] (03CR) 10Dzahn: "this is called a failure but for no real reason: https://puppet-compiler.wmflabs.org/output/1009775/1713/" [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [18:31:01] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9658907 (10Papaul) [18:32:20] (03CR) 10Ssingh: [C:03+1] "Thanks I can do that now." [puppet] - 10https://gerrit.wikimedia.org/r/1013382 (owner: 10Andrew Bogott) [18:34:12] !log sudo cumin "A:dnsbox" "disable-puppet 'merging CR 1013382'" [18:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.decommission for hosts dbprov2005.codfw.wmnet [18:37:01] (03CR) 10Ssingh: [C:03+2] base: remove profile::base::manage_timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/1013382 (owner: 10Andrew Bogott) [18:38:24] (03CR) 10EoghanGaffney: [C:03+2] [gitlab] Failover test of gitlab replica hosts [dns] - 10https://gerrit.wikimedia.org/r/1009300 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [18:40:03] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:42:44] !log eoghan@cumin1002 START - Cookbook sre.dns.wipe-cache 'https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/' on all recursors [18:42:47] !log eoghan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/' on all recursors [18:44:07] !log sudo cumin -b1 -s60 "A:dns-rec and not P{dns6001*}" "run-puppet-agent --enable 'merging CR 1013382'" [18:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:03] !log eoghan@cumin1002 END (FAIL) - Cookbook sre.gitlab.failover (exit_code=93) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [18:45:48] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbprov2005.codfw.wmnet decommissioned, removing all IPs except the asset tag one - pt1979@cumin2002" [18:46:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbprov2005.codfw.wmnet decommissioned, removing all IPs except the asset tag one - pt1979@cumin2002" [18:46:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:46:40] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dbprov2005.codfw.wmnet [18:47:01] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9658952 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by pt1979@cumin2002 for hosts: `dbprov2005.codfw.wmnet` - dbprov2005.codfw.wmnet (... [18:48:32] 10SRE-tools, 10Cloud-VPS, 10Spicerack: Support downtiming metricsinfra alerts in wmcs-cookbooks - https://phabricator.wikimedia.org/T360932 (10taavi) 03NEW [18:48:50] (03CR) 10Brouberol: spark-history: bypass Kerberos principal hostname reverse DNS check for namenode (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014010 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [18:50:14] (03CR) 10Brouberol: "The issue here is that the aqs hosts were running the aqs service and a cassandra server. The service is being deprecated but the Cassandr" [puppet] - 10https://gerrit.wikimedia.org/r/1014042 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [18:53:25] (03PS1) 10Majavah: alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) [18:53:46] 10SRE-tools, 10Cloud-VPS, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Support downtiming metricsinfra alerts in wmcs-cookbooks - https://phabricator.wikimedia.org/T360932#9659005 (10taavi) a:03taavi [18:56:06] (03PS1) 10Brouberol: aqs: fix puppet compilation error [puppet] - 10https://gerrit.wikimedia.org/r/1014100 (https://phabricator.wikimedia.org/T358793) [18:56:22] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1014100 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [18:56:34] (03CR) 10CI reject: [V:04-1] aqs: fix puppet compilation error [puppet] - 10https://gerrit.wikimedia.org/r/1014100 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [18:58:16] (03PS2) 10Majavah: P:toolforge::legacy_redirector: add www.toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/1014025 (https://phabricator.wikimedia.org/T311909) [18:58:16] (03PS2) 10Majavah: P:toolforge::legacy_redirector: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1014026 (https://phabricator.wikimedia.org/T311909) [18:58:16] (03PS2) 10Majavah: Remove old toolserver_legacy code [puppet] - 10https://gerrit.wikimedia.org/r/1014027 [18:58:45] (03CR) 10Brouberol: [C:03+2] aqs: Remove conftool data and service entry [puppet] - 10https://gerrit.wikimedia.org/r/1014042 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [19:00:42] (03CR) 10CI reject: [V:04-1] alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [19:01:42] (03CR) 10CI reject: [V:04-1] P:toolforge::legacy_redirector: add www.toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/1014025 (https://phabricator.wikimedia.org/T311909) (owner: 10Majavah) [19:02:23] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1025.eqiad.wmnet with reason: Decommissioning — T354561 [19:02:27] T354561: Decommission restbase10[19-27] - https://phabricator.wikimedia.org/T354561 [19:02:37] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1025.eqiad.wmnet with reason: Decommissioning — T354561 [19:02:46] (ProbeDown) firing: (2) Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:03:24] (03PS1) 10EoghanGaffney: [gitlab] Restart wmf_auto_restart_ssh on rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/1014102 [19:03:24] (03PS1) 10EoghanGaffney: [gitlab] Lock backups/restores on switch_from host after backup creation [cookbooks] - 10https://gerrit.wikimedia.org/r/1014103 [19:07:41] (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_codfw_aqs.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:08:19] (03CR) 10CI reject: [V:04-1] [gitlab] Lock backups/restores on switch_from host after backup creation [cookbooks] - 10https://gerrit.wikimedia.org/r/1014103 (owner: 10EoghanGaffney) [19:08:35] brouberol: ^ probably stale files [19:08:58] sudo rm /var/run/confd-template/.cloudceph*.err [19:09:02] https://wikitech.wikimedia.org/wiki/Confd#Stale_template_error_files_present [19:09:21] in this case, on puppetmaster1001 and 2001: sudo rm /var/run/confd-template/.aqs*.err [19:09:59] back after baby duty, I was. trying to wrap up the removal of the aqs service. Thanks, on it [19:10:34] done on puppetmaster1001. and no such file on puppetmaster1002 [19:10:34] np, let me know if you want me to take care of it too but since you were doing it [19:12:19] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:12:41] what I'm trying to figure out atm is how to resolve the puppet compilation errors we're seeing on the aqs hosts (ex: https://puppetboard.wikimedia.org/report/aqs2008.codfw.wmnet/7cd6b89e098a1f33778dd68d20cfdab1a5e4ab18). I've opened this WIP patch (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014100) to play with PCC, and I can get it back [19:12:41] to compiling, but I think I need to get the VIP removed from those hosts before [19:13:15] let's see [19:15:07] this definitely seems right [19:15:37] (03PS3) 10Ssingh: aqs: fix puppet compilation error [puppet] - 10https://gerrit.wikimedia.org/r/1014100 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [19:15:37] (03CR) 10Ssingh: [C:03+1] "Sorry for missing this but yes it does look right. +1" [puppet] - 10https://gerrit.wikimedia.org/r/1014100 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [19:16:13] this puppet patch won't remove the VIP from the hosts though, so I guess we'll need to do that manually via cumin [19:16:29] yes, that should be manual [19:16:52] but since these are not behind LVS anymore but the host is still there for Cassandra, your patch is correct [19:16:54] for context, we won't decommission those hosts. Each had 2 services runnnig on them: aqs and cassandra. While aqs is now retired, cassndra stays [19:17:00] yeah, makes sense [19:17:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:17:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:17:17] (03CR) 10Brouberol: [V:03+1 C:03+2] aqs: fix puppet compilation error [puppet] - 10https://gerrit.wikimedia.org/r/1014100 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [19:17:19] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:17:41] (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_codfw_aqs.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:18:46] (03PS3) 10Majavah: P:toolforge::legacy_redirector: add www.toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/1014025 (https://phabricator.wikimedia.org/T311909) [19:18:46] (03PS3) 10Majavah: P:toolforge::legacy_redirector: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1014026 (https://phabricator.wikimedia.org/T311909) [19:18:47] (03PS3) 10Majavah: Remove old toolserver_legacy code [puppet] - 10https://gerrit.wikimedia.org/r/1014027 [19:19:34] alright, puppet is back to a working state on aqs hosts [19:19:41] nice! that should wrap it up [19:20:17] so I think we only have to remove the VIP from the hosts now, and I guess (#2) I should probably remove the VIP from netbox as well (but if so, that'll be tomorrow) [19:20:41] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:22:04] is there anything else than `ip addr del 10.2.2.12/32 dev lo` on aqs hosts via cumin? [19:22:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:22:24] s/else/more to it/ [19:23:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 35.35% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:24:45] brouberol: the VIP will vary by site; 10.2.1.12 for codfw and 10.2.2.12 for eqiad [19:26:39] (03PS2) 10Majavah: alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) [19:26:39] (03PS1) 10Majavah: k8s: Remove use of @staticmethod in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014106 [19:27:30] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.74% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:28:32] !log removing VIP from AQS hosts - T358793 [19:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:37] T358793: Decommission AQS 1.0 - https://phabricator.wikimedia.org/T358793 [19:28:49] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2037.codfw.wmnet [19:29:29] !depool elastic2037: host is pooled but decommed [19:29:29] for s in nginx varnish-fe varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_text,service=$s --action set/pooled=no cp1053.eqiad.wmnet; done [19:29:35] ha [19:29:42] !log depool elastic2037: host is pooled but decommed [19:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:21] (03PS10) 10Gmodena: Add webrequest.frontend.rc0 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [19:31:50] 06SRE, 10procurement: 14add contract end dates to the ops maint & contract gcal - 14https://phabricator.wikimedia.org/T84585#9659175 (10RobH) 05Open→03Invalid 14I no longer track contracts, those are handled via Coupa which has end date tracking. This old request is now invalid. SLA note: Please pl... [19:32:41] (ConfdResourceFailed) resolved: (4) confd resource _srv_config-master_pybal_codfw_aqs.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:37:32] (03PS2) 10Zoranzoki21: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014073 (https://phabricator.wikimedia.org/T360533) [19:39:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 35.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:42:11] 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9659204 (10DBu-WMF) p:05Medium→03High Can I please have access to DMARC Digests as soon as possible. We are starting to see deliverability issues at Google Postmaster. Whoever has the DMARC Diges... [19:44:24] (03CR) 10Hashar: [C:03+1] Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014073 (https://phabricator.wikimedia.org/T360533) (owner: 10Zoranzoki21) [19:47:28] 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9659214 (10DBu-WMF) looks like I do not have access to ticket T330944. Can someone please grant me access. [19:49:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:54:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 34.05% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:54:29] !log Remove wikibase-otherprojects from user preferences (user_properties) # T342264 [19:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:34] T342264: Remove wikibase-otherprojects from user preferences (user_properties) - https://phabricator.wikimedia.org/T342264 [19:55:15] (03PS3) 10Zoranzoki21: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014073 (https://phabricator.wikimedia.org/T360533) [19:59:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.53% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:59:52] (03CR) 10Dzahn: [C:03+1] "manually tested the change on phab2002 as well.. restarts fine." [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T2000). [20:00:05] thedj, kimberly_sarabia , and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] (03CR) 10Dzahn: [C:03+2] prometheus/apache_exporter: drop argument parameter [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [20:00:52] o/ [20:00:56] i can deploy [20:01:26] if anyone is around lol [20:01:39] * urbanecm waves too [20:01:54] * cjming waves to urbanecm [20:02:19] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:02:21] Hello [20:02:45] !log deploying change to prometheus-apache-exporter that will make it work on all distro versions incl bookworm, due to changed argument syntax [20:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:12] thedj: are you around? otherwise I'll start with Kim's patch [20:03:24] (03PS11) 10Clare Ming: Remove X-Webkit-CSP-Report-Only response header from foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003108 (https://phabricator.wikimedia.org/T357479) (owner: 10TheDJ) [20:04:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013645 (https://phabricator.wikimedia.org/T360781) (owner: 10Jdrewniak) [20:04:21] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:05:59] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2024.codfw.wmnet on all recursors [20:06:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2024.codfw.wmnet on all recursors [20:06:40] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove restbase node IPv6 dns records - cmooney@cumin1002" [20:07:11] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2024.codfw.wmnet on all recursors [20:07:14] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2024.codfw.wmnet on all recursors [20:07:16] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2025.codfw.wmnet on all recursors [20:07:19] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2025.codfw.wmnet on all recursors [20:07:20] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2026.codfw.wmnet on all recursors [20:07:23] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2026.codfw.wmnet on all recursors [20:07:24] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2027.codfw.wmnet on all recursors [20:07:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2027.codfw.wmnet on all recursors [20:07:29] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2028.codfw.wmnet on all recursors [20:07:32] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove restbase node IPv6 dns records - cmooney@cumin1002" [20:07:32] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:07:32] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2028.codfw.wmnet on all recursors [20:07:33] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2029.codfw.wmnet on all recursors [20:07:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2029.codfw.wmnet on all recursors [20:07:38] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2030.codfw.wmnet on all recursors [20:07:39] \o [20:07:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2030.codfw.wmnet on all recursors [20:07:42] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2031.codfw.wmnet on all recursors [20:07:45] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2031.codfw.wmnet on all recursors [20:07:47] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2032.codfw.wmnet on all recursors [20:07:50] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2032.codfw.wmnet on all recursors [20:07:51] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2033.codfw.wmnet on all recursors [20:07:55] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2033.codfw.wmnet on all recursors [20:07:56] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2034.codfw.wmnet on all recursors [20:07:59] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2034.codfw.wmnet on all recursors [20:08:01] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase2035.codfw.wmnet on all recursors [20:08:04] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase2035.codfw.wmnet on all recursors [20:08:05] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase1031.eqiad.wmnet on all recursors [20:08:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1031.eqiad.wmnet on all recursors [20:08:11] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase1032.eqiad.wmnet on all recursors [20:08:14] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1032.eqiad.wmnet on all recursors [20:08:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:08:15] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase1033.eqiad.wmnet on all recursors [20:08:18] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1033.eqiad.wmnet on all recursors [20:08:20] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase1034.eqiad.wmnet on all recursors [20:08:23] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1034.eqiad.wmnet on all recursors [20:08:24] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase1035.eqiad.wmnet on all recursors [20:08:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1035.eqiad.wmnet on all recursors [20:08:29] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase1036.eqiad.wmnet on all recursors [20:08:32] (03PS1) 10Dzahn: Revert "Revert "planet: add prometheus apache exporter to role"" [puppet] - 10https://gerrit.wikimedia.org/r/1014074 [20:08:32] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1036.eqiad.wmnet on all recursors [20:08:33] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase1037.eqiad.wmnet on all recursors [20:08:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1037.eqiad.wmnet on all recursors [20:08:38] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase1038.eqiad.wmnet on all recursors [20:08:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1038.eqiad.wmnet on all recursors [20:08:42] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase1039.eqiad.wmnet on all recursors [20:08:45] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1039.eqiad.wmnet on all recursors [20:08:47] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase1040.eqiad.wmnet on all recursors [20:08:50] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1040.eqiad.wmnet on all recursors [20:08:51] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase1041.eqiad.wmnet on all recursors [20:08:54] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1041.eqiad.wmnet on all recursors [20:08:56] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache restbase1042.eqiad.wmnet on all recursors [20:08:59] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1042.eqiad.wmnet on all recursors [20:11:35] !log pool restbase10[31-33] — T360597 [20:11:36] (03CR) 10Dzahn: [C:03+2] Revert "Revert "planet: add prometheus apache exporter to role"" [puppet] - 10https://gerrit.wikimedia.org/r/1014074 (owner: 10Dzahn) [20:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:42] T360597: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597 [20:15:27] (03PS1) 10MusikAnimal: [officewiki, testwiki]: enable CodeMirrorV6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014113 (https://phabricator.wikimedia.org/T357795) [20:16:09] !log zabe@mwmaint1002:~$ mwscript namespaceDupes.php --wiki thwikibooks --move-talk --fix # T360715 [20:16:11] (03CR) 10CI reject: [V:04-1] [officewiki, testwiki]: enable CodeMirrorV6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014113 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [20:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:14] T360715: Run namespaceDupes.php on thwikibooks (talk pages with Wikijunior namespace prefix) - https://phabricator.wikimedia.org/T360715 [20:16:49] (03PS2) 10MusikAnimal: [officewiki, testwiki]: enable CodeMirrorV6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014113 (https://phabricator.wikimedia.org/T357795) [20:23:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 32.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:23:45] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 34.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:24:47] (03Merged) 10jenkins-bot: Guard against undefined $container element in initMobile.js [skins/MinervaNeue] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1013645 (https://phabricator.wikimedia.org/T360781) (owner: 10Jdrewniak) [20:25:03] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1013645|Guard against undefined $container element in initMobile.js (T360781)]] [20:25:08] T360781: [MinervaNeue] Guard against undefined in initMediaViewer() - https://phabricator.wikimedia.org/T360781 [20:27:31] !log cjming@deploy1002 cjming and jdrewniak: Backport for [[gerrit:1013645|Guard against undefined $container element in initMobile.js (T360781)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:27:37] kimberly_sarabia: can you test? [20:27:49] Yep, one moment [20:28:30] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:31:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.15% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:31:36] cjming: LGTM [20:31:41] cool - syncing [20:31:44] !log cjming@deploy1002 cjming and jdrewniak: Continuing with sync [20:33:49] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9659379 (10Tgr) >>! In T360596#9652082, @Krinkle wrote: > In MediaWiki (as deployed at WMF), there exists 1 use of Redis,... [20:33:54] !log pool restbase10[34-42] — T360597 [20:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:58] thedj: lmk if you are around, happy to do your patch but will continue with queue [20:34:00] T360597: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597 [20:34:57] (03PS2) 10Clare Ming: Cirrus: testcommonswiki only needs 1 shard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013069 (owner: 10Ebernhardson) [20:35:00] hi ebernhardson: i'll do yours next [20:35:40] (03PS1) 10Zabe: CONTRIBUTORS: Add Framawiki [puppet] - 10https://gerrit.wikimedia.org/r/1014115 [20:36:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.27% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:39:02] (03PS4) 10Ladsgroup: mediawiki: Get rid of purge flaggedrevs [puppet] - 10https://gerrit.wikimedia.org/r/1013022 (https://phabricator.wikimedia.org/T359529) [20:39:18] (03PS5) 10Ladsgroup: mediawiki: Get rid of purge flaggedrevs [puppet] - 10https://gerrit.wikimedia.org/r/1013022 (https://phabricator.wikimedia.org/T359529) [20:39:28] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013022 (https://phabricator.wikimedia.org/T359529) (owner: 10Ladsgroup) [20:39:54] hello. I have a registered nick again ;) [20:42:48] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1013645|Guard against undefined $container element in initMobile.js (T360781)]] (duration: 17m 45s) [20:42:53] T360781: [MinervaNeue] Guard against undefined in initMediaViewer() - https://phabricator.wikimedia.org/T360781 [20:43:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013069 (owner: 10Ebernhardson) [20:43:41] (03CR) 10Ladsgroup: [C:03+2] mediawiki: Get rid of purge flaggedrevs [puppet] - 10https://gerrit.wikimedia.org/r/1013022 (https://phabricator.wikimedia.org/T359529) (owner: 10Ladsgroup) [20:44:08] (03Merged) 10jenkins-bot: Cirrus: testcommonswiki only needs 1 shard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013069 (owner: 10Ebernhardson) [20:44:22] thedj: hi ! i will do your patch in a bit - just finishing up the rest of the queue [20:44:25] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1013069|Cirrus: testcommonswiki only needs 1 shard]] [20:45:21] ebernhardson: are either of your patches testable? [20:45:52] !log depooling restbase10[19-21].eqiad.wmnet — T360597 [20:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:56] T360597: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597 [20:46:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 30.5% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:46:57] !log cjming@deploy1002 cjming and ebernhardson: Backport for [[gerrit:1013069|Cirrus: testcommonswiki only needs 1 shard]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:47:24] ebernhardson: i'm inclined to just sync your patches - are you still around? [20:48:31] (03PS1) 10Dzahn: prometheus/ops: add config for scraping apache metrics on planet servers [puppet] - 10https://gerrit.wikimedia.org/r/1014116 (https://phabricator.wikimedia.org/T359556) [20:48:32] !log cjming@deploy1002 cjming and ebernhardson: Continuing with sync [20:48:58] (03PS4) 10Clare Ming: cirrus: Transition remaining cloudelastic wikis to streaming updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006570 (https://phabricator.wikimedia.org/T358518) (owner: 10Ebernhardson) [20:49:02] (03CR) 10Majavah: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1014115 (owner: 10Zabe) [20:49:04] (03CR) 10Majavah: [C:03+2] CONTRIBUTORS: Add Framawiki [puppet] - 10https://gerrit.wikimedia.org/r/1014115 (owner: 10Zabe) [20:51:35] (03CR) 10CI reject: [V:04-1] prometheus/ops: add config for scraping apache metrics on planet servers [puppet] - 10https://gerrit.wikimedia.org/r/1014116 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [20:52:10] (03PS2) 10Dzahn: prometheus/ops: add config for scraping apache metrics on planet servers [puppet] - 10https://gerrit.wikimedia.org/r/1014116 (https://phabricator.wikimedia.org/T359556) [20:53:36] kimberly_sarabia: your patch should be live! [20:58:34] 06SRE, 10MediaWiki-Email: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9659518 (10andrea.denisse) [20:59:20] (03CR) 10Majavah: [C:03+2] P:toolforge::legacy_redirector: add www.toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/1014025 (https://phabricator.wikimedia.org/T311909) (owner: 10Majavah) [20:59:31] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1013069|Cirrus: testcommonswiki only needs 1 shard]] (duration: 15m 05s) [20:59:35] ebernhardson: your 1st patch is live, i can go ahead and do your 2nd one if you're still around? [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240325T2100). [21:00:42] thedj: i'll do yours real quick if you are still around? [21:01:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.46% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:01:28] cjming: sorry! looking now [21:01:28] cjming i'm around [21:02:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003108 (https://phabricator.wikimedia.org/T357479) (owner: 10TheDJ) [21:02:07] cjming: all looks reasonable [21:02:18] thedj: lmk if yours can be tested [21:02:51] ebernhardson: i can do your 2nd patch here shortly - can that one be tested? or should i just sync it [21:03:07] (03Merged) 10jenkins-bot: Remove X-Webkit-CSP-Report-Only response header from foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003108 (https://phabricator.wikimedia.org/T357479) (owner: 10TheDJ) [21:03:21] cjming: it only changes the job queue, but it should be fine we've deployed this same config a few weeks ago and reverted it for unrelated reasons [21:03:23] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1003108|Remove X-Webkit-CSP-Report-Only response header from foundationwiki (T357479)]] [21:03:25] so it's untestable [21:03:28] T357479: Stop sending X-Webkit-CSP and X-Webkit-CSP-Report-Only headers - https://phabricator.wikimedia.org/T357479 [21:03:53] ebernhardson: sounds good [21:05:20] cjming we should see that header disappear from simple page requests on foundation wiki [21:05:27] easy to test [21:05:37] (03PS4) 10Majavah: P:toolforge::legacy_redirector: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1014026 (https://phabricator.wikimedia.org/T311909) [21:05:37] (03PS4) 10Majavah: Remove old toolserver_legacy code [puppet] - 10https://gerrit.wikimedia.org/r/1014027 [21:05:53] !log cjming@deploy1002 hartman and cjming: Backport for [[gerrit:1003108|Remove X-Webkit-CSP-Report-Only response header from foundationwiki (T357479)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:06:00] thedj: shall i sync? [21:06:06] yep [21:06:10] !log cjming@deploy1002 hartman and cjming: Continuing with sync [21:06:20] !log Phabricator - added @Arian_Bozorg and @Fring to WMF-NDA group after confirming they have an NDA on file but had to be added to the legal spreadsheet (T358578) [21:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:27] T358578: Add WMDE staff who have signed the NDA with the WMF to the WMF-NDA phabricator policy group - https://phabricator.wikimedia.org/T358578 [21:08:14] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1014116 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [21:09:24] security deployers: hope it's ok the backport window is going a little over -- just one more config change to do if that's ok [21:13:24] (03PS5) 10Majavah: P:toolforge::legacy_redirector: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1014026 (https://phabricator.wikimedia.org/T311909) [21:13:24] (03PS5) 10Majavah: Remove old toolserver_legacy code [puppet] - 10https://gerrit.wikimedia.org/r/1014027 [21:14:16] (03CR) 10Dzahn: [C:03+1] [gitlab] Restart wmf_auto_restart_ssh on rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/1014102 (owner: 10EoghanGaffney) [21:14:46] (03CR) 10Dzahn: [C:03+2] prometheus/ops: add config for scraping apache metrics on planet servers [puppet] - 10https://gerrit.wikimedia.org/r/1014116 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [21:15:57] @cjming confirmed [21:16:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.3% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:17:34] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1003108|Remove X-Webkit-CSP-Report-Only response header from foundationwiki (T357479)]] (duration: 14m 10s) [21:17:38] T357479: Stop sending X-Webkit-CSP and X-Webkit-CSP-Report-Only headers - https://phabricator.wikimedia.org/T357479 [21:18:17] thedj: cool - should be live [21:18:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006570 (https://phabricator.wikimedia.org/T358518) (owner: 10Ebernhardson) [21:19:38] (03PS1) 10Majavah: P:toolforge: drop grid shutdown from MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1014118 [21:19:40] (03Merged) 10jenkins-bot: cirrus: Transition remaining cloudelastic wikis to streaming updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006570 (https://phabricator.wikimedia.org/T358518) (owner: 10Ebernhardson) [21:19:56] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1006570|cirrus: Transition remaining cloudelastic wikis to streaming updater (T358518)]] [21:20:00] T358518: Deploy streaming updater for 100% of writes to cloudelastic - https://phabricator.wikimedia.org/T358518 [21:20:29] 06SRE, 10MediaWiki-Email: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9659584 (10andrea.denisse) [21:21:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.3% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:22:15] !log cjming@deploy1002 ebernhardson and cjming: Backport for [[gerrit:1006570|cirrus: Transition remaining cloudelastic wikis to streaming updater (T358518)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:22:28] !log cjming@deploy1002 ebernhardson and cjming: Continuing with sync [21:24:56] (03CR) 10EoghanGaffney: [C:03+1] gitlab_runner: unregister gitlab-runner2004 for dockerfile conversion [puppet] - 10https://gerrit.wikimedia.org/r/1014005 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto) [21:30:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:32:06] hmm, checking [21:32:25] it could just be that it's clearing out jobs in the queue that are no-longer valid bceause the cluster is removed [21:33:27] yes indeed, the spike in messages was 'Received cirrusSearchElasticaWrite job with pages updates for an unwritable cluster'. They ran momentarily, and have subsided as no new jobs are being enqueuede [21:33:41] * ebernhardson fails spelling [21:33:44] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1006570|cirrus: Transition remaining cloudelastic wikis to streaming updater (T358518)]] (duration: 13m 48s) [21:33:48] ebernhardson: 2nd patch should be live now [21:33:49] T358518: Deploy streaming updater for 100% of writes to cloudelastic - https://phabricator.wikimedia.org/T358518 [21:33:57] (03PS1) 10Cwhite: opensearch: define elasticsearch curator version for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1014047 (https://phabricator.wikimedia.org/T352517) [21:34:01] cjming: it looks all good, thanks! [21:34:06] np! [21:34:15] !log end of UTC late backport window [21:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:52] (03CR) 10RLazarus: [C:03+2] mediawiki: Add a comment annotation for mwscript jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012802 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [21:35:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:36:38] (03Merged) 10jenkins-bot: mediawiki: Add a comment annotation for mwscript jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012802 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [21:39:21] (03CR) 10RLazarus: [C:03+2] deployment_server: Label and annotation improvements for mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1012803 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [21:46:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:49:18] (03PS1) 10Cwhite: logstash: remove configuration for logstash101[012] [puppet] - 10https://gerrit.wikimedia.org/r/1014048 (https://phabricator.wikimedia.org/T360950) [21:54:22] !log cwhite@cumin2002 START - Cookbook sre.hosts.decommission for hosts logstash1011.eqiad.wmnet [21:54:46] (03PS1) 10Tim Starling: Special:BlockList: apply simpler conditions when listing user blocks [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1014075 (https://phabricator.wikimedia.org/T360864) [21:59:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1002 using scap backport" [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1014075 (https://phabricator.wikimedia.org/T360864) (owner: 10Tim Starling) [22:00:39] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [22:02:52] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1011.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [22:03:16] !log phabricator - added DBu-WMF (Danny Bu) to WMF-NDA - T356920 [22:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:19] T356920: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920 [22:04:12] 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9659677 (10Dzahn) >>! In T356920#9659214, @DBu-WMF wrote: > looks like I do not have access to ticket T330944. Can someone please grant me access. I tried this on February 8 by asking on this ticket... [22:04:48] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1011.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [22:04:48] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:04:49] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts logstash1011.eqiad.wmnet [22:10:52] (03PS1) 10Cwhite: beta-logs: configure logging-opensearch-hdd-01 [puppet] - 10https://gerrit.wikimedia.org/r/1014049 (https://phabricator.wikimedia.org/T352517) [22:12:16] (03CR) 10Cwhite: [C:03+2] beta-logs: configure logging-opensearch-hdd-01 [puppet] - 10https://gerrit.wikimedia.org/r/1014049 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite) [22:18:36] (03CR) 10Cwhite: [C:03+2] "PCC OK https://puppet-compiler.wmflabs.org/output/1014047/1717/" [puppet] - 10https://gerrit.wikimedia.org/r/1014047 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite) [22:20:52] (03CR) 10Dzahn: [C:03+2] "I see _some_ data trickling in now here: https://grafana.wikimedia.org/d/xcoGtTASz/planet?orgId=1&var-node=planet1003" [puppet] - 10https://gerrit.wikimedia.org/r/1014116 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [22:21:59] (03Merged) 10jenkins-bot: Special:BlockList: apply simpler conditions when listing user blocks [core] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1014075 (https://phabricator.wikimedia.org/T360864) (owner: 10Tim Starling) [22:22:15] !log tstarling@deploy1002 Started scap: Backport for [[gerrit:1014075|Special:BlockList: apply simpler conditions when listing user blocks (T360864)]] [22:22:19] T360864: Slow query in Special:BlockList with new block schema - https://phabricator.wikimedia.org/T360864 [22:24:41] !log tstarling@deploy1002 tstarling: Backport for [[gerrit:1014075|Special:BlockList: apply simpler conditions when listing user blocks (T360864)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:25:28] !log tstarling@deploy1002 tstarling: Continuing with sync [22:27:58] (03PS2) 10Kimberly Sarabia: Update mediawiki.web_ui_actions stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013234 (https://phabricator.wikimedia.org/T353029) (owner: 10Phuedx) [22:30:32] 06SRE, 10MediaWiki-Email: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9659745 (10andrea.denisse) [22:34:40] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@530e786]: Refine through Airflow POC [airflow-dags/analytics@530e7863] [22:35:08] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@530e786]: Refine through Airflow POC [airflow-dags/analytics@530e7863] (duration: 00m 28s) [22:36:34] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@530e786]: Keep analytics_test up to date [airflow-dags/analytics_test@530e7863] [22:36:44] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@530e786]: Keep analytics_test up to date [airflow-dags/analytics_test@530e7863] (duration: 00m 10s) [22:36:53] !log tstarling@deploy1002 Finished scap: Backport for [[gerrit:1014075|Special:BlockList: apply simpler conditions when listing user blocks (T360864)]] (duration: 14m 38s) [22:36:57] T360864: Slow query in Special:BlockList with new block schema - https://phabricator.wikimedia.org/T360864 [22:43:43] (03PS2) 10EoghanGaffney: [gitlab] Lock backups/restores on switch_from host after backup creation [cookbooks] - 10https://gerrit.wikimedia.org/r/1014103 [22:49:43] (03CR) 10EoghanGaffney: [C:03+2] [gitlab] Restart wmf_auto_restart_ssh on rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/1014102 (owner: 10EoghanGaffney) [22:49:57] (03PS3) 10Kimberly Sarabia: Update mediawiki.web_ui_actions stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013234 (https://phabricator.wikimedia.org/T360955) (owner: 10Phuedx) [22:55:37] (03Merged) 10jenkins-bot: [gitlab] Restart wmf_auto_restart_ssh on rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/1014102 (owner: 10EoghanGaffney) [23:01:34] (03PS1) 10Dzahn: ssl: delete ticket-test.discovery.wmnet cert, not used [puppet] - 10https://gerrit.wikimedia.org/r/1014130 (https://phabricator.wikimedia.org/T360413) [23:02:46] (03CR) 10Dzahn: [C:03+2] ssl: delete ticket-test.discovery.wmnet cert, not used [puppet] - 10https://gerrit.wikimedia.org/r/1014130 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [23:02:46] (ProbeDown) firing: (2) Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:02:52] (03PS1) 10Dzahn: delete ticket-test.discovery.wmnet dummy key, not used [labs/private] - 10https://gerrit.wikimedia.org/r/1014131 (https://phabricator.wikimedia.org/T360413) [23:04:22] (03CR) 10Dzahn: [V:03+2 C:03+2] delete ticket-test.discovery.wmnet dummy key, not used [labs/private] - 10https://gerrit.wikimedia.org/r/1014131 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [23:04:38] (03PS2) 10Dzahn: delete ticket-test.discovery.wmnet dummy key, not used [labs/private] - 10https://gerrit.wikimedia.org/r/1014131 (https://phabricator.wikimedia.org/T360413) [23:05:00] (03CR) 10Dzahn: [V:03+2 C:03+2] delete ticket-test.discovery.wmnet dummy key, not used [labs/private] - 10https://gerrit.wikimedia.org/r/1014131 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [23:06:37] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9659893 (10Dzahn) [23:16:16] (03PS1) 10Dzahn: ci: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014132 (https://phabricator.wikimedia.org/T360413) [23:17:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:17:55] (03PS2) 10Dzahn: ci: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014132 (https://phabricator.wikimedia.org/T360413) [23:18:18] (03CR) 10Dzahn: "@hashar just fyi this is happening, no worries, we have already done this for a bunch of other services without issues" [puppet] - 10https://gerrit.wikimedia.org/r/1014132 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [23:20:42] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:29:21] (03PS1) 10Dzahn: delete apt-staging.discovery dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1014133 (https://phabricator.wikimedia.org/T360413) [23:32:16] (03CR) 10Dzahn: "Just deleted the files from the private repo that were only used until this change:" [puppet] - 10https://gerrit.wikimedia.org/r/973323 (owner: 10EoghanGaffney) [23:33:46] (03PS2) 10Dzahn: delete apt-staging.discovery dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1014133 (https://phabricator.wikimedia.org/T360413) [23:41:43] (03CR) 10Dzahn: [V:03+2 C:03+2] delete apt-staging.discovery dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1014133 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [23:47:31] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9660002 (10Dzahn) [23:47:39] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9660003 (10Dzahn)