[00:24:15] !log sukhe@cumin2002 dbctl commit (dc=all): 'depool db1246', diff saved to https://phabricator.wikimedia.org/P61073 and previous config saved to /var/cache/conftool/dbconfig/20240423-002413-sukhe.json [00:24:26] ^ got a page for this so depooled [00:24:43] it recoverd but I don't think all is well [00:26:46] 10ops-eqiad, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119 (10ssingh) 03NEW [01:03:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T363120 (10phaultfinder) 03NEW [01:07:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.2 [core] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1022534 (https://phabricator.wikimedia.org/T361396) [01:07:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.2 [core] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1022534 (https://phabricator.wikimedia.org/T361396) (owner: 10TrainBranchBot) [01:27:45] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.2 [core] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1022534 (https://phabricator.wikimedia.org/T361396) (owner: 10TrainBranchBot) [02:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240422T0700) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T0200) [02:30:49] (PuppetDisabled) firing: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [02:38:51] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:33] (KubernetesCalicoDown) firing: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240422T0700) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T0300) [03:01:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T352010)', diff saved to https://phabricator.wikimedia.org/P61074 and previous config saved to /var/cache/conftool/dbconfig/20240423-030106-ladsgroup.json [03:01:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:01:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:03:51] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:43] !log mwpresync@deploy1002 Pruned MediaWiki: 1.42.0-wmf.25 (duration: 05m 37s) [03:16:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P61075 and previous config saved to /var/cache/conftool/dbconfig/20240423-031613-ladsgroup.json [03:31:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P61076 and previous config saved to /var/cache/conftool/dbconfig/20240423-033120-ladsgroup.json [03:46:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T352010)', diff saved to https://phabricator.wikimedia.org/P61077 and previous config saved to /var/cache/conftool/dbconfig/20240423-034628-ladsgroup.json [03:46:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [03:46:41] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:46:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [03:46:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T352010)', diff saved to https://phabricator.wikimedia.org/P61078 and previous config saved to /var/cache/conftool/dbconfig/20240423-034652-ladsgroup.json [05:10:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:42:42] (03PS1) 10KartikMistry: CX: Initialize publishNamespace for CXTarget [extensions/ContentTranslation] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023148 (https://phabricator.wikimedia.org/T349959) [06:05:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:13:01] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018409 (owner: 10PipelineBot) [06:13:04] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021400 (owner: 10PipelineBot) [06:13:21] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021403 (owner: 10PipelineBot) [06:14:16] (03CR) 10Slyngshede: [C:03+2] Fix issue where navigation bar jumps around in height. [software/bitu] - 10https://gerrit.wikimedia.org/r/1022501 (https://phabricator.wikimedia.org/T360520) (owner: 10Slyngshede) [06:15:54] (03Merged) 10jenkins-bot: Fix issue where navigation bar jumps around in height. [software/bitu] - 10https://gerrit.wikimedia.org/r/1022501 (https://phabricator.wikimedia.org/T360520) (owner: 10Slyngshede) [06:16:17] (03CR) 10Jgiannelos: "Is this patch still valid or have we enabled read views already on labs in a different way?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999060 (https://phabricator.wikimedia.org/T357054) (owner: 10C. Scott Ananian) [06:16:45] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017464 (owner: 10PipelineBot) [06:17:48] (03PS2) 10Sbailey: wikifeeds: upgrade to node18 from node16 deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007959 (https://phabricator.wikimedia.org/T358017) [06:30:49] (PuppetDisabled) firing: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [06:37:50] (03CR) 10Muehlenhoff: [C:03+2] Switch matomo role to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021899 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [06:40:33] (KubernetesCalicoDown) firing: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:42:09] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9734108 (10MoritzMuehlenhoff) [06:43:41] (03PS1) 10Muehlenhoff: Remove access for dbad2021 [puppet] - 10https://gerrit.wikimedia.org/r/1023317 [06:44:29] (03CR) 10CI reject: [V:04-1] Remove access for dbad2021 [puppet] - 10https://gerrit.wikimedia.org/r/1023317 (owner: 10Muehlenhoff) [06:50:38] (03PS2) 10Muehlenhoff: Remove access for dbad2021 [puppet] - 10https://gerrit.wikimedia.org/r/1023317 [06:51:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T352010)', diff saved to https://phabricator.wikimedia.org/P61079 and previous config saved to /var/cache/conftool/dbconfig/20240423-065136-ladsgroup.json [06:51:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:52:18] (03CR) 10Muehlenhoff: [C:03+2] Remove access for dbad2021 [puppet] - 10https://gerrit.wikimedia.org/r/1023317 (owner: 10Muehlenhoff) [06:53:28] (03PS3) 10Jgiannelos: wikifeeds: upgrade to node18 from node16 deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007959 (https://phabricator.wikimedia.org/T358017) (owner: 10Sbailey) [07:00:05] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T0700). [07:00:05] cscott and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:03:52] (JobUnavailable) firing: Reduced availability for job thanos-sidecar in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:06:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P61080 and previous config saved to /var/cache/conftool/dbconfig/20240423-070643-ladsgroup.json [07:08:07] cscott seems not around? [07:08:32] I can +2 my patch meanwhile or start deployment.. [07:09:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023148 (https://phabricator.wikimedia.org/T349959) (owner: 10KartikMistry) [07:14:08] (03CR) 10Jgiannelos: "I changed the order of the arguments because the dns arg should be passed to the node CLI binary not the server.js" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007959 (https://phabricator.wikimedia.org/T358017) (owner: 10Sbailey) [07:15:00] jouncebot: now [07:15:00] For the next 0 hour(s) and 44 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T0700) [07:16:35] kart_: i will do c scott patch [07:16:51] and I am +2ing it right now to gain some time due to CI :) [07:17:40] (03CR) 10Jgiannelos: "@hnowlan Since the problem looks like DNS related I would prefer if someone from serviceops took a look first before merging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007959 (https://phabricator.wikimedia.org/T358017) (owner: 10Sbailey) [07:17:40] (03CR) 10Hashar: [C:03+2] ParserOutput: don't complain if TOCHTML is unset from ParserCache [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023100 (https://phabricator.wikimedia.org/T363107) (owner: 10C. Scott Ananian) [07:19:38] hashar: sure. [07:19:40] (03CR) 10Fabfur: [C:03+2] benthos/haproxy: using hiera aliases for benthos socket address [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [07:21:31] hashar: kart_: hi, please ping me when you're done, I have a patch of my own to deploy [07:21:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P61081 and previous config saved to /var/cache/conftool/dbconfig/20240423-072151-ladsgroup.json [07:22:49] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2115.codfw.wmnet [07:24:42] (03PS1) 10Muehlenhoff: Switch db2115 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023323 (https://phabricator.wikimedia.org/T349619) [07:26:06] (03PS1) 10Fabfur: hiera: applied benthos socket address alias to upload cluster too [puppet] - 10https://gerrit.wikimedia.org/r/1023324 (https://phabricator.wikimedia.org/T358109) [07:26:48] (03Merged) 10jenkins-bot: CX: Initialize publishNamespace for CXTarget [extensions/ContentTranslation] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023148 (https://phabricator.wikimedia.org/T349959) (owner: 10KartikMistry) [07:26:56] (03CR) 10Muehlenhoff: [C:03+2] Switch db2115 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023323 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:27:39] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1023148|CX: Initialize publishNamespace for CXTarget (T349959)]] [07:27:44] T349959: Limit or inhibit access to machine translation for users in Chinese Wikipedia - https://phabricator.wikimedia.org/T349959 [07:29:48] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2070/console" [puppet] - 10https://gerrit.wikimedia.org/r/1023324 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [07:30:02] (03CR) 10Fabfur: [V:03+1 C:03+2] hiera: applied benthos socket address alias to upload cluster too [puppet] - 10https://gerrit.wikimedia.org/r/1023324 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [07:30:19] taavi: sure [07:31:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2115.codfw.wmnet [07:34:30] kart_: your patch has merged :) [07:36:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T352010)', diff saved to https://phabricator.wikimedia.org/P61082 and previous config saved to /var/cache/conftool/dbconfig/20240423-073658-ladsgroup.json [07:37:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [07:37:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [07:37:15] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:38:02] (03Merged) 10jenkins-bot: ParserOutput: don't complain if TOCHTML is unset from ParserCache [core] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023100 (https://phabricator.wikimedia.org/T363107) (owner: 10C. Scott Ananian) [07:38:23] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2131.codfw.wmnet [07:38:45] (03CR) 10DCausse: "lgtm!" [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [07:41:49] (03CR) 10DCausse: [C:03+2] search: Wait for young pool alert to fail for 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/1013575 (owner: 10Ebernhardson) [07:42:09] !log kartik@deploy1002 kartik: Backport for [[gerrit:1023148|CX: Initialize publishNamespace for CXTarget (T349959)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:42:26] T349959: Limit or inhibit access to machine translation for users in Chinese Wikipedia - https://phabricator.wikimedia.org/T349959 [07:43:17] (03PS1) 10Muehlenhoff: Switch db2131 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023371 (https://phabricator.wikimedia.org/T349619) [07:43:21] (03Merged) 10jenkins-bot: search: Wait for young pool alert to fail for 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/1013575 (owner: 10Ebernhardson) [07:44:20] (03PS1) 10Fabfur: benthos: add termination_state metric like the one provided by mtail [puppet] - 10https://gerrit.wikimedia.org/r/1023372 (https://phabricator.wikimedia.org/T361845) [07:48:44] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2071/console" [puppet] - 10https://gerrit.wikimedia.org/r/1023372 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [07:49:14] (03CR) 10Muehlenhoff: [C:03+2] Switch db2131 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023371 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:51:33] kart_: have you finished the sync or are you still testing? :) [07:58:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2131.codfw.wmnet [07:59:13] !log kartik@deploy1002 kartik: Continuing with sync [07:59:37] :) [08:03:34] hashar: big testing done :) [08:03:56] one day, we will have to work on speeding up that process [08:04:08] (03CR) 10Fabfur: [V:03+1 C:03+2] benthos: add termination_state metric like the one provided by mtail [puppet] - 10https://gerrit.wikimedia.org/r/1023372 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [08:04:09] +1 [08:04:17] CI and scap deployment take wayyy tooooo loooong [08:04:47] !log restore sre business hour escalation policy - T350192 [08:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:55] T350192: On-call batphone escalation configuration holidays FY2023-24 - https://phabricator.wikimedia.org/T350192 [08:05:29] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2191.codfw.wmnet [08:06:35] (03PS1) 10Muehlenhoff: Switch db2191 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023374 (https://phabricator.wikimedia.org/T349619) [08:09:13] (03CR) 10Muehlenhoff: [C:03+2] Switch db2191 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023374 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:09:22] 10ops-eqiad, 06DBA: db1234 has hardware issues - https://phabricator.wikimedia.org/T363102#9734301 (10ABran-WMF) 05Open→03In progress [08:12:32] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1023148|CX: Initialize publishNamespace for CXTarget (T349959)]] (duration: 44m 53s) [08:12:39] T349959: Limit or inhibit access to machine translation for users in Chinese Wikipedia - https://phabricator.wikimedia.org/T349959 [08:12:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1002.wikimedia.org [08:13:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2191.codfw.wmnet [08:15:25] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:32] I am doing cscott one [08:15:37] and after that I am fixing the train [08:15:53] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1023100|ParserOutput: don't complain if TOCHTML is unset from ParserCache (T363107)]] [08:15:59] T363107: Potential logspam on rollback of 1.43-wmf.2 - https://phabricator.wikimedia.org/T363107 [08:16:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1002.wikimedia.org [08:18:00] (03CR) 10Muehlenhoff: [C:04-1] "This patch isn't the correct fix. thirdparty/kubeadm-k8s-1-15 isn't affected by the buster-backports archival" [puppet] - 10https://gerrit.wikimedia.org/r/1022215 (https://phabricator.wikimedia.org/T362518) (owner: 10Dzahn) [08:19:15] 08:16:52 docker_pull_k8s: 99% (in-flight: 1; ok: 350; fail: 0; left: 0) \ [08:19:15] ssh: connect to host parse1002.eqiad.wmnet port 22: Connection timed out [08:19:22] hashar: just saw this: `08:12:32 backport failed: Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=kartik', 'Backport for [[gerrit:1023148|CX: Initialize publishNamespace for CXTarget (T349959)]]']' returned non-zero exit status 1.` [08:19:22] T349959: Limit or inhibit access to machine translation for users in Chinese Wikipedia - https://phabricator.wikimedia.org/T349959 [08:19:26] I am filing that one since that broke the mediawiki train presync [08:19:29] yeah [08:19:39] I thin the exit 1 you got is due to above [08:19:50] and that is what made the MediaWiki train to error out [08:19:57] my guess is that specific host is faulty [08:20:23] !log hashar@deploy1002 hashar and cscott: Backport for [[gerrit:1023100|ParserOutput: don't complain if TOCHTML is unset from ParserCache (T363107)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:20:23] !log hashar@deploy1002 Sync cancelled. [08:21:04] that is https://phabricator.wikimedia.org/T363086 I guess [08:21:25] parse1002's management interface is unreachable too :/ [08:22:07] (03CR) 10Muehlenhoff: [C:04-1] "codesearch is broken because it pulls in iptables from buster-backports here: https://github.com/wikimedia/operations-puppet/blob/producti" [puppet] - 10https://gerrit.wikimedia.org/r/1022215 (https://phabricator.wikimedia.org/T362518) (owner: 10Dzahn) [08:23:08] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9734315 (10hashar) [08:23:14] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9734314 (10hashar) `parse1002.eqiad.wmnet` is down / unreachable but is still in the pool of hosts to deploy tool. That has caused the MediaWiki train to fail over night and is causing every MediaWiki deploy... [08:23:42] taavi: yup the host is dead but still in the list of hosts to deploy too [08:23:45] which is a common problem [08:24:11] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1023100|ParserOutput: don't complain if TOCHTML is unset from ParserCache (T363107)]] [08:24:16] T363107: Potential logspam on rollback of 1.43-wmf.2 - https://phabricator.wikimedia.org/T363107 [08:24:32] hmm, seems like we pull the host list directly from puppetdb? is there an option to exclude a dead host somehow? [08:24:49] before mw-on-k8s you could just depool it on conftool, but i don't see a way to do that here [08:26:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2206', diff saved to https://phabricator.wikimedia.org/P61083 and previous config saved to /var/cache/conftool/dbconfig/20240423-082621-arnaudb.json [08:26:37] I have no idea, my guess is scap get the list of hosts from the dsh groups that are themselves generated by puppet out of etcd data [08:26:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2206.codfw.wmnet with reason: T362746 [08:26:48] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [08:26:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2206.codfw.wmnet with reason: T362746 [08:27:15] (03PS1) 10TChin: Add datasets-config-next namespace to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023376 (https://phabricator.wikimedia.org/T357434) [08:27:21] (03PS1) 10TChin: Add datasets-config-next namespace to dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1023377 (https://phabricator.wikimedia.org/T357434) [08:28:04] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2206.codfw.wmnet [08:28:29] $ grep -R parse1002 /etc/dsh/group [08:28:30] /etc/dsh/group/kubernetes-workers:parse1002.eqiad.wmnet [08:28:37] and that is generated from a PuppetDB query [08:28:39] !log hashar@deploy1002 cscott and hashar: Backport for [[gerrit:1023100|ParserOutput: don't complain if TOCHTML is unset from ParserCache (T363107)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:28:45] !log hashar@deploy1002 cscott and hashar: Continuing with sync [08:31:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2206.codfw.wmnet [08:33:15] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2215.codfw.wmnet [08:34:17] (03PS1) 10Muehlenhoff: Switch db2215 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023378 (https://phabricator.wikimedia.org/T349619) [08:37:15] (03PS1) 10Fabfur: benthos: fix termination_state format [puppet] - 10https://gerrit.wikimedia.org/r/1023379 (https://phabricator.wikimedia.org/T361845) [08:37:32] (03CR) 10Muehlenhoff: [C:03+2] Switch db2215 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023378 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:38:35] (03PS2) 10Brouberol: datasets-config: create public wikimedia.org subdomain [dns] - 10https://gerrit.wikimedia.org/r/1021380 (https://phabricator.wikimedia.org/T357434) [08:38:45] (03PS2) 10Brouberol: datasets-config: create private servcice record [dns] - 10https://gerrit.wikimedia.org/r/1021381 (https://phabricator.wikimedia.org/T357434) [08:39:16] (03CR) 10Brouberol: [C:03+1] Add datasets-config-next namespace to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023376 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [08:39:29] (03CR) 10Brouberol: [C:03+1] Add datasets-config-next namespace to dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1023377 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [08:40:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 10%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P61084 and previous config saved to /var/cache/conftool/dbconfig/20240423-084008-arnaudb.json [08:40:25] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1023100|ParserOutput: don't complain if TOCHTML is unset from ParserCache (T363107)]] (duration: 16m 13s) [08:40:31] T363107: Potential logspam on rollback of 1.43-wmf.2 - https://phabricator.wikimedia.org/T363107 [08:41:18] (03CR) 10Fabfur: [C:03+2] benthos: fix termination_state format [puppet] - 10https://gerrit.wikimedia.org/r/1023379 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [08:41:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2172', diff saved to https://phabricator.wikimedia.org/P61085 and previous config saved to /var/cache/conftool/dbconfig/20240423-084146-arnaudb.json [08:42:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2172.codfw.wmnet with reason: T362746 [08:42:30] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [08:42:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2172.codfw.wmnet with reason: T362746 [08:43:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2172.codfw.wmnet with OS bookworm [08:43:49] (03CR) 10Klausman: "Regarding the order of tokens, I have no particular preference (as long as it's linear in the detailed vs. broad dimension)." [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [08:50:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2002.wikimedia.org [08:54:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2002.wikimedia.org [08:55:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 25%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P61086 and previous config saved to /var/cache/conftool/dbconfig/20240423-085514-arnaudb.json [08:57:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2215.codfw.wmnet [08:57:53] (03PS1) 10Filippo Giunchedi: jaeger: upgrade to 1.56 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023381 (https://phabricator.wikimedia.org/T362719) [08:58:45] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] jaeger: remove es-index-cleaner image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021458 (https://phabricator.wikimedia.org/T344953) (owner: 10Filippo Giunchedi) [08:59:48] (03PS47) 10Klausman: deployment_server: Add Cassandra to autogenerated external svcs [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) [08:59:49] (03PS2) 10Filippo Giunchedi: jaeger: remove es-index-cleaner image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021458 (https://phabricator.wikimedia.org/T344953) [09:00:07] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] jaeger: remove es-index-cleaner image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021458 (https://phabricator.wikimedia.org/T344953) (owner: 10Filippo Giunchedi) [09:00:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2172.codfw.wmnet with reason: host reimage [09:02:38] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1179.eqiad.wmnet [09:02:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2172.codfw.wmnet with reason: host reimage [09:04:26] !log delete tags for docker-registry.discovery.wmnet/jaeger-es-index-cleaner - T344953 [09:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:43] T344953: Manage jaeger-* index lifecycle - https://phabricator.wikimedia.org/T344953 [09:04:56] !log Backport & config window completed [09:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:22] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9734474 (10hashar) [09:05:49] (PuppetDisabled) resolved: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [09:07:31] (03PS1) 10David Caro: puppetserver::wmcs: enable auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/1023383 [09:08:18] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9734482 (10akosiaris) >>! In T363086#9734314, @hashar wrote: > `parse1002.eqiad.wmnet` is down / unreachable but is still in the pool of hosts to deploy tool. That has caused the MediaWiki train to fail over... [09:08:34] (03CR) 10Filippo Giunchedi: "Generally LGTM, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) (owner: 10Klausman) [09:09:16] (03CR) 10Alexandros Kosiaris: "Out of curiosity, why do you need 1 more namespace for a "staging" environment? Depending on the use case, requirements and wanted user ex" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023376 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [09:10:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 50%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P61087 and previous config saved to /var/cache/conftool/dbconfig/20240423-091019-arnaudb.json [09:10:38] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1023383 (owner: 10David Caro) [09:10:42] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2073/console" [puppet] - 10https://gerrit.wikimedia.org/r/1021502 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [09:11:22] (03CR) 10Filippo Giunchedi: "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1021502 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [09:11:59] (03PS48) 10Klausman: deployment_server: Add Cassandra to autogenerated external svcs [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) [09:13:11] (03CR) 10Filippo Giunchedi: ncredir,benthos: Provide benthos support on ncredir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [09:13:34] (03CR) 10Alexandros Kosiaris: [C:04-1] "This kinda goes against all current best practices. All libraries default to prioritizing IPv6 (or maybe happy eyeballs), not prioritizing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007959 (https://phabricator.wikimedia.org/T358017) (owner: 10Sbailey) [09:13:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1023383 (owner: 10David Caro) [09:15:25] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:35] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2074/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [09:15:42] (03PS1) 10Majavah: hieradata: move cloudvirt2002-dev to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1023384 (https://phabricator.wikimedia.org/T358761) [09:16:45] (03CR) 10Hashar: [C:03+2] wm-zuul-status: filter based solely on change number [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1020840 (https://phabricator.wikimedia.org/T358253) (owner: 10Hashar) [09:17:24] (03Merged) 10jenkins-bot: wm-zuul-status: filter based solely on change number [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1020840 (https://phabricator.wikimedia.org/T358253) (owner: 10Hashar) [09:17:35] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2076/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023384 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [09:18:11] !log hashar@deploy1002 Started deploy [gerrit/gerrit@8b4ae00]: wm-zuul-status: filter based solely on change number - T358253 [09:18:18] T358253: Gerrit "Checks" permalink to Zuul also matches other changes - https://phabricator.wikimedia.org/T358253 [09:18:18] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@8b4ae00]: wm-zuul-status: filter based solely on change number - T358253 (duration: 00m 07s) [09:18:32] (03PS1) 10Muehlenhoff: Switch db1179 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023385 (https://phabricator.wikimedia.org/T349619) [09:18:57] (03CR) 10Klausman: "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [09:23:50] (03CR) 10Brouberol: [C:03+2] Add datasets-config-next namespace to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023376 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [09:24:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2172.codfw.wmnet with OS bookworm [09:24:45] (03CR) 10Brouberol: [C:03+2] "Sweeping in for Thomas here: this seems to be the norm in the DSE k8s cluster, so this is mostly for conformity." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023376 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [09:25:01] (03CR) 10Brouberol: [C:03+2] Add datasets-config-next namespace to dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1023377 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [09:25:10] (03CR) 10JMeybohm: [C:03+1] build-bare-slim: Date tag wikimedia-buster images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021877 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [09:25:18] (03CR) 10Hashar: [C:03+2] logging: always register udp2log handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019253 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [09:25:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 75%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P61088 and previous config saved to /var/cache/conftool/dbconfig/20240423-092525-arnaudb.json [09:26:09] (03Merged) 10jenkins-bot: logging: always register udp2log handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019253 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [09:26:10] (03CR) 10Muehlenhoff: [C:03+2] Switch db1179 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023385 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:26:50] (03Merged) 10jenkins-bot: Add datasets-config-next namespace to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023376 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [09:27:30] (03CR) 10David Caro: "This is ok, though this does not set the wmf_auto_restart_* process, it just makes the server restart when config file changes happen, I'l" [puppet] - 10https://gerrit.wikimedia.org/r/1023383 (owner: 10David Caro) [09:27:36] (03PS3) 10Klausman: team-ml: Add alerting rule for high error rate in LW services [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) [09:27:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 5%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P61089 and previous config saved to /var/cache/conftool/dbconfig/20240423-092738-arnaudb.json [09:27:40] (KubernetesRsyslogDown) firing: rsyslog on mw2384:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2384 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:28:49] (03CR) 10Klausman: team-ml: Add alerting rule for high error rate in LW services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) (owner: 10Klausman) [09:29:13] (03PS2) 10JMeybohm: Replace wikimedia-buster base images with buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021878 (https://phabricator.wikimedia.org/T362518) [09:29:27] (03PS4) 10Klausman: team-ml: Add alerting rule for high error rate in LW services [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) [09:29:27] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1019253|logging: always register udp2log handlers (T228838)]] [09:29:32] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [09:29:32] (03PS3) 10Brouberol: datasets-config: create public wikimedia.org subdomain [dns] - 10https://gerrit.wikimedia.org/r/1021380 (https://phabricator.wikimedia.org/T357434) [09:29:32] (03PS3) 10Brouberol: datasets-config: create private servcice record [dns] - 10https://gerrit.wikimedia.org/r/1021381 (https://phabricator.wikimedia.org/T357434) [09:29:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1179.eqiad.wmnet [09:30:17] (03PS2) 10David Caro: puppetserver::wmcs: enable auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/1023383 [09:30:17] (03PS1) 10David Caro: puppetserver::wmcs: add wmf_auto_restarts timer [puppet] - 10https://gerrit.wikimedia.org/r/1023386 [09:31:15] (03PS4) 10Brouberol: datasets-config: create public wikimedia.org subdomain [dns] - 10https://gerrit.wikimedia.org/r/1021380 (https://phabricator.wikimedia.org/T357434) [09:31:15] (03PS4) 10Brouberol: datasets-config: create private servcice record [dns] - 10https://gerrit.wikimedia.org/r/1021381 (https://phabricator.wikimedia.org/T357434) [09:31:42] (03PS5) 10Brouberol: datasets-config: create public wikimedia.org subdomain [dns] - 10https://gerrit.wikimedia.org/r/1021380 (https://phabricator.wikimedia.org/T357434) [09:31:47] (03PS5) 10Brouberol: datasets-config: create private servcice record [dns] - 10https://gerrit.wikimedia.org/r/1021381 (https://phabricator.wikimedia.org/T357434) [09:31:51] !log hashar@deploy1002 hashar: Backport for [[gerrit:1019253|logging: always register udp2log handlers (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:32:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2384:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2384 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:32:49] (03CR) 10JMeybohm: [V:03+2 C:03+2] Replace wikimedia-buster base images with buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021878 (https://phabricator.wikimedia.org/T362518) (owner: 10JMeybohm) [09:33:02] !log hashar@deploy1002 hashar: Continuing with sync [09:35:50] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1023386 (owner: 10David Caro) [09:36:27] (03CR) 10Majavah: [C:04-1] "hieradata/cloud.yaml sets `profile::puppetserver::auto_restart: true`, so this is not necessary (and I don't think we want to prevent peop" [puppet] - 10https://gerrit.wikimedia.org/r/1023383 (owner: 10David Caro) [09:36:40] (03CR) 10Filippo Giunchedi: team-ml: Add alerting rule for high error rate in LW services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) (owner: 10Klausman) [09:37:35] (03PS2) 10David Caro: puppetserver::wmcs: add wmf_auto_restarts timer [puppet] - 10https://gerrit.wikimedia.org/r/1023386 [09:37:38] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023386 (owner: 10David Caro) [09:37:58] (03CR) 10David Caro: "Not needed then, nice" [puppet] - 10https://gerrit.wikimedia.org/r/1023383 (owner: 10David Caro) [09:38:10] (03Abandoned) 10David Caro: puppetserver::wmcs: enable auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/1023383 (owner: 10David Caro) [09:38:24] (03PS5) 10Klausman: team-ml: Add alerting rule for high error rate in LW services [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) [09:38:48] (03CR) 10JMeybohm: [C:03+1] "OI see, thanks for the explanation. I'm just ignorant about those Cassandra suffixes and wanted to double check what makes the most sense " [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [09:39:16] (03PS1) 10Fabfur: nit: just a little clarification on comment [software] - 10https://gerrit.wikimedia.org/r/1023388 [09:39:30] (03CR) 10CI reject: [V:04-1] team-ml: Add alerting rule for high error rate in LW services [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) (owner: 10Klausman) [09:39:51] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1220.eqiad.wmnet [09:40:30] (03CR) 10JMeybohm: [C:03+1] shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [09:40:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 100%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P61090 and previous config saved to /var/cache/conftool/dbconfig/20240423-094030-arnaudb.json [09:40:48] (03PS1) 10Muehlenhoff: Switch db1220 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023389 (https://phabricator.wikimedia.org/T349619) [09:41:22] (03CR) 10JMeybohm: [C:04-1] "This produces a negative FPM__process_control_timeout in CI" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [09:42:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 10%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P61091 and previous config saved to /var/cache/conftool/dbconfig/20240423-094244-arnaudb.json [09:44:39] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1019253|logging: always register udp2log handlers (T228838)]] (duration: 15m 11s) [09:44:45] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [09:45:16] (03CR) 10Alexandros Kosiaris: "If it's already a pattern, conformity is as good a reason as any." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023376 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [09:46:50] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9734707 (10Ladsgroup) nvm, it didn't recover [09:47:12] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9734696 (10Ladsgroup) It recovered on its own, might be a network issue. I will take a look. [09:49:55] (03CR) 10Clément Goubert: [C:03+2] trafficserver: move 75% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1021905 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [09:51:25] (03CR) 10WMDE-Fisch: [C:03+1] "Giving more of a team `+1` here. Would be great to have that." [puppet] - 10https://gerrit.wikimedia.org/r/1023085 (https://phabricator.wikimedia.org/T362904) (owner: 10Awight) [09:51:31] (03CR) 10TChin: [C:03+1] datasets-config: create public wikimedia.org subdomain [dns] - 10https://gerrit.wikimedia.org/r/1021380 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [09:51:39] (03CR) 10TChin: [C:03+1] datasets-config: create private servcice record [dns] - 10https://gerrit.wikimedia.org/r/1021381 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [09:52:04] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9734722 (10Clement_Goubert) [09:52:08] (03CR) 10David Caro: [C:03+2] puppetserver::wmcs: add wmf_auto_restarts timer [puppet] - 10https://gerrit.wikimedia.org/r/1023386 (owner: 10David Caro) [09:52:21] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:52:41] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:53:38] (03PS6) 10Klausman: team-ml: Add alerting rule for high error rate in LW services [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) [09:53:48] (03CR) 10Brouberol: [C:03+2] datasets-config: create private servcice record [dns] - 10https://gerrit.wikimedia.org/r/1021381 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [09:54:02] (03CR) 10Brouberol: [C:03+2] datasets-config: create public wikimedia.org subdomain [dns] - 10https://gerrit.wikimedia.org/r/1021380 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [09:54:18] (03CR) 10Btullis: [C:03+1] "Looks good." [dns] - 10https://gerrit.wikimedia.org/r/1021380 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [09:54:57] (03CR) 10Btullis: [C:03+1] datasets-config: create private servcice record [dns] - 10https://gerrit.wikimedia.org/r/1021381 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [09:57:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 15%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P61092 and previous config saved to /var/cache/conftool/dbconfig/20240423-095749-arnaudb.json [09:57:51] (03CR) 10Klausman: team-ml: Add alerting rule for high error rate in LW services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) (owner: 10Klausman) [09:58:27] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9734767 (10ABran-WMF) 05Open→03In progress it seems to be a hardware issue as well: ` ------------------------------------------------------------------------------- Record: 51 Date/Time: 04/23/2024 00:19:19 Sour... [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1000) [10:01:02] (03CR) 10Clément Goubert: build-bare-slim: Date tag wikimedia-buster images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021877 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [10:01:10] (03CR) 10Clément Goubert: [C:03+2] build-bare-slim: Date tag wikimedia-buster images [puppet] - 10https://gerrit.wikimedia.org/r/1021877 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [10:02:01] (03PS4) 10Brouberol: trafficserver: Add CDN config for datasets-config.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) [10:03:30] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9734800 (10Ladsgroup) oh thanks for looking into it. [10:05:31] (03CR) 10JMeybohm: [C:03+1] build-bare-slim: Date tag wikimedia-buster images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021877 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [10:06:33] (03PS49) 10Klausman: deployment_server: Add Cassandra to autogenerated external svcs [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) [10:06:40] (KubernetesRsyslogDown) firing: rsyslog on mw2425:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2425 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:07:29] (03CR) 10Slyngshede: [C:03+2] IP blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1018256 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [10:09:01] (03CR) 10Muehlenhoff: [C:03+2] Switch db1220 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023389 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:09:02] (03Merged) 10jenkins-bot: IP blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1018256 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [10:09:15] (03CR) 10Klausman: "Now that you mention it, name and instance are definitely more tightly related than either is to DC. From a natural language perspective, " [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [10:10:01] (03PS50) 10Klausman: deployment_server: Add Cassandra to autogenerated external svcs [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) [10:11:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2425:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2425 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:11:52] (03CR) 10JMeybohm: [C:03+1] deployment_server: Add Cassandra to autogenerated external svcs [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [10:11:52] (03PS2) 10Slyngshede: API: Introduce settings parameter to enable API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1021912 [10:12:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 25%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P61093 and previous config saved to /var/cache/conftool/dbconfig/20240423-101255-arnaudb.json [10:13:10] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2418:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:13:35] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2078/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [10:15:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1220.eqiad.wmnet [10:17:04] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2196.codfw.wmnet [10:17:57] (03CR) 10Klausman: [V:03+1 C:03+2] deployment_server: Add Cassandra to autogenerated external svcs [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [10:18:08] (03PS1) 10Muehlenhoff: Switch db2196 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023394 (https://phabricator.wikimedia.org/T349619) [10:18:10] (KubernetesRsyslogDown) resolved: (2) rsyslog on mw2418:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:18:32] (03PS2) 10Jcrespo: dbbackups: Add dbprov1005 to the hosts that can dump eqiad backup sources [puppet] - 10https://gerrit.wikimedia.org/r/1021903 (https://phabricator.wikimedia.org/T362509) [10:22:34] !log btullis@deploy1002 Started deploy [analytics/hdfs-tools/deploy@3618aab]: (no justification provided) [10:22:45] !log btullis@deploy1002 Finished deploy [analytics/hdfs-tools/deploy@3618aab]: (no justification provided) (duration: 00m 11s) [10:25:43] (03CR) 10Muehlenhoff: [C:03+2] Switch db2196 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1023394 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:25:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1245.eqiad.wmnet with reason: T360116 [10:25:55] T360116: Upgrade s5 to MariaDB 10.6 - https://phabricator.wikimedia.org/T360116 [10:26:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1245.eqiad.wmnet with reason: T360116 [10:27:11] (03CR) 10Btullis: [C:03+2] Add the verbose flag to the geoipupdate command [puppet] - 10https://gerrit.wikimedia.org/r/1021901 (https://phabricator.wikimedia.org/T358268) (owner: 10Btullis) [10:28:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 50%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P61094 and previous config saved to /var/cache/conftool/dbconfig/20240423-102801-arnaudb.json [10:34:07] (03PS1) 10Clément Goubert: kubernetes: move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1023397 (https://phabricator.wikimedia.org/T362323) [10:34:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2196.codfw.wmnet [10:37:00] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9734938 (10MoritzMuehlenhoff) [10:40:33] (KubernetesCalicoDown) firing: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:41:23] (03CR) 10JMeybohm: [C:03+1] kubernetes: move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1023397 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [10:43:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 75%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P61097 and previous config saved to /var/cache/conftool/dbconfig/20240423-104306-arnaudb.json [10:43:40] !log kubectl cordon parse1002.eqiad.wmnet - T363086 [10:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:44] T363086: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086 [10:44:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1021912 (owner: 10Slyngshede) [10:45:05] jouncebot: nowandnext [10:45:06] For the next 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1000) [10:45:06] In 1 hour(s) and 14 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1200) [10:45:26] !log Depooling mw1414.eqiad.wmnet,mw1415.eqiad.wmnet,mw1416.eqiad.wmnet,mw1448.eqiad.wmnet,mw1449.eqiad.wmnet for reimage to kubernetes - T351074 [10:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:30] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [10:48:11] (03CR) 10Clément Goubert: [V:03+1 C:03+2] kubernetes: move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1023397 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [10:53:33] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1414.eqiad.wmnet with OS bullseye [10:53:58] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1415.eqiad.wmnet with OS bullseye [10:54:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1416.eqiad.wmnet with OS bullseye [10:54:49] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1448.eqiad.wmnet with OS bullseye [10:55:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1449.eqiad.wmnet with OS bullseye [10:58:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 100%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P61098 and previous config saved to /var/cache/conftool/dbconfig/20240423-105812-arnaudb.json [11:01:15] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:01:15] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:01:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:02:02] Appservers is me due to reimages, should be transient [11:03:52] (JobUnavailable) firing: Reduced availability for job thanos-sidecar in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:06:34] (03CR) 10Arnaudb: [C:03+1] dbbackups: Add dbprov1005 to the hosts that can dump eqiad backup sources [puppet] - 10https://gerrit.wikimedia.org/r/1021903 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [11:06:44] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1414.eqiad.wmnet with reason: host reimage [11:06:57] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1415.eqiad.wmnet with reason: host reimage [11:07:15] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1416.eqiad.wmnet with reason: host reimage [11:07:56] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1448.eqiad.wmnet with reason: host reimage [11:08:10] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1449.eqiad.wmnet with reason: host reimage [11:09:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:10:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1414.eqiad.wmnet with reason: host reimage [11:11:37] (03PS1) 10Muehlenhoff: Add an option to pass the Druid firewall settings compatible with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1023402 [11:11:53] (03CR) 10Filippo Giunchedi: team-ml: Add alerting rule for high error rate in LW services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) (owner: 10Klausman) [11:12:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1449.eqiad.wmnet with reason: host reimage [11:14:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:15:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1416.eqiad.wmnet with reason: host reimage [11:16:46] (03CR) 10Slyngshede: [C:03+2] API: Introduce settings parameter to enable API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1021912 (owner: 10Slyngshede) [11:17:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023402 (owner: 10Muehlenhoff) [11:17:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1448.eqiad.wmnet with reason: host reimage [11:18:23] (03Merged) 10jenkins-bot: API: Introduce settings parameter to enable API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1021912 (owner: 10Slyngshede) [11:21:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1415.eqiad.wmnet with reason: host reimage [11:21:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:21:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:23:45] (03CR) 10Jcrespo: [C:03+2] dbbackups: Add dbprov1005 to the hosts that can dump eqiad backup sources [puppet] - 10https://gerrit.wikimedia.org/r/1021903 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [11:28:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1414.eqiad.wmnet with OS bullseye [11:28:56] (03PS7) 10Klausman: team-ml: Add alerting rule for high error rate in LW services [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) [11:30:02] (03PS8) 10Klausman: team-ml: Add alerting rule for high error rate in LW services [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) [11:30:03] (03CR) 10CI reject: [V:04-1] team-ml: Add alerting rule for high error rate in LW services [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) (owner: 10Klausman) [11:30:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1449.eqiad.wmnet with OS bullseye [11:31:14] (03PS2) 10Muehlenhoff: Add an option to pass the Druid firewall settings compatible with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1023402 [11:31:43] (03CR) 10Klausman: team-ml: Add alerting rule for high error rate in LW services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) (owner: 10Klausman) [11:32:21] (03PS9) 10Klausman: team-ml: Add alerting rule for high error rate in LW services [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) [11:33:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1416.eqiad.wmnet with OS bullseye [11:35:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1448.eqiad.wmnet with OS bullseye [11:35:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023402 (owner: 10Muehlenhoff) [11:38:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1415.eqiad.wmnet with OS bullseye [11:38:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Host has hardware issues [11:39:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Host has hardware issues [11:39:20] !log Running homer 'cr*eqiad*' commit 'T351074' [11:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:40] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [11:40:03] (03CR) 10Btullis: [C:03+2] Temporary monitoring for long-running analytics client job [puppet] - 10https://gerrit.wikimedia.org/r/1023085 (https://phabricator.wikimedia.org/T362904) (owner: 10Awight) [11:41:10] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9735127 (10ABran-WMF) np, host [[ https://sal.toolforge.org/log/_2vACo8BGiVuUzOdfPP3 | has been downtimed ]] for 7 days! [11:43:03] (03PS3) 10Muehlenhoff: Add an option to pass the Druid firewall settings compatible with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1023402 [11:43:44] (03PS1) 10Awight: Revert "Temporary monitoring for long-running analytics client job" [puppet] - 10https://gerrit.wikimedia.org/r/1023152 (https://phabricator.wikimedia.org/T362904) [11:44:11] (03CR) 10Awight: [C:04-1] "Don't merge yet, this is a placeholder for when our work is complete." [puppet] - 10https://gerrit.wikimedia.org/r/1023152 (https://phabricator.wikimedia.org/T362904) (owner: 10Awight) [11:47:26] !log Pooling and uncordoning mw1414.eqiad.wmnet,mw1415.eqiad.wmnet,mw1416.eqiad.wmnet,mw1448.eqiad.wmnet,mw1449.eqiad.wmnet - T351074 [11:47:30] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05): Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9735141 (10BTullis) a:03BTullis [11:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:31] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [11:47:38] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1414.eqiad.wmnet|mw1415.eqiad.wmnet|mw1416.eqiad.wmnet|mw1448.eqiad.wmnet|mw1449.eqiad.wmnet),cluster=kubernetes,service=kubesvc [11:48:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023402 (owner: 10Muehlenhoff) [11:52:27] (03PS4) 10Muehlenhoff: Add an option to pass the Druid firewall settings compatible with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1023402 [11:57:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023402 (owner: 10Muehlenhoff) [11:58:39] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [11:58:54] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 80% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023412 (https://phabricator.wikimedia.org/T362323) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1200) [12:00:07] (03PS1) 10Clément Goubert: trafficserver: move 80% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1023413 (https://phabricator.wikimedia.org/T362323) [12:02:52] (03PS1) 10Ilias Sarantopoulos: amdpytorch21: use bullseye as pytorch base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1023414 (https://phabricator.wikimedia.org/T362984) [12:02:56] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#9735205 (10ayounsi) Another question I think is "do we still have to go through text files ?" It made sens for back in the time when we were manually editing the configuration, and f... [12:10:56] jouncebot: nowandnext [12:10:56] For the next 0 hour(s) and 49 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1200) [12:10:56] In 0 hour(s) and 49 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1300) [12:11:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023046 (https://phabricator.wikimedia.org/T363057) (owner: 10Majavah) [12:12:53] (03Merged) 10jenkins-bot: Add cawiki 750k logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023046 (https://phabricator.wikimedia.org/T363057) (owner: 10Majavah) [12:13:07] !log taavi@deploy1002 Started scap: Backport for [[gerrit:1023046|Add cawiki 750k logo (T363057)]] [12:13:14] T363057: Changing logos and tagline for the 750k article milestone in the Catalan Wikipedia - https://phabricator.wikimedia.org/T363057 [12:14:47] (03CR) 10Jgiannelos: "The rationale behind this patch is that after the upgrade to node18 the requests to internal mesh (eg. MW api) stopped working and I think" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007959 (https://phabricator.wikimedia.org/T358017) (owner: 10Sbailey) [12:17:13] (03CR) 10Jgiannelos: "That said the proper fix shouldn't happen on the node DNS resolution level ideally, but that could be a way to test if that's the actual p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007959 (https://phabricator.wikimedia.org/T358017) (owner: 10Sbailey) [12:17:15] !log taavi@deploy1002 taavi: Backport for [[gerrit:1023046|Add cawiki 750k logo (T363057)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:17:21] !log taavi@deploy1002 taavi: Continuing with sync [12:18:04] (03CR) 10JMeybohm: [C:03+1] trafficserver: move 80% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1023413 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [12:21:59] (03PS5) 10Klausman: ml-services: tweak reference to ML Cassandra clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021895 (https://phabricator.wikimedia.org/T360428) [12:22:57] (03CR) 10Klausman: [C:03+1] amdpytorch21: use bullseye as pytorch base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1023414 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [12:24:47] (03CR) 10Vgutierrez: [V:03+1 C:03+2] profile::benthos: Don't require kafka config [puppet] - 10https://gerrit.wikimedia.org/r/1021502 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [12:27:02] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05): Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9735349 (10BTullis) I am starting by looking at the relforge cluster. I see that the certificates are served by nginx and they are still using the puppet CA based cert... [12:28:32] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:1023046|Add cawiki 750k logo (T363057)]] (duration: 15m 24s) [12:28:46] (03PS11) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [12:29:33] (03CR) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [12:34:11] (03PS12) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [12:34:52] (03CR) 10Michael Große: [C:03+1] "Looks good to me. This would mean in practice the same list of wikis for which `wmgUseGrowthExperiments` is `true`, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023101 (https://phabricator.wikimedia.org/T348086) (owner: 10Urbanecm) [12:36:33] (03PS1) 10JMeybohm: Revert: kubernetes::node restart rsyslog if too many fd's are blocked by inotify [puppet] - 10https://gerrit.wikimedia.org/r/1023417 (https://phabricator.wikimedia.org/T357616) [12:37:05] (03PS2) 10JMeybohm: Revert: kubernetes::node restart rsyslog if too many fd's are blocked by inotify [puppet] - 10https://gerrit.wikimedia.org/r/1023417 (https://phabricator.wikimedia.org/T357616) [12:37:26] 10ops-codfw, 06SRE: Inbound interface errors - https://phabricator.wikimedia.org/T363120#9735389 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [12:39:51] (03CR) 10Brouberol: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1023402 (owner: 10Muehlenhoff) [12:41:07] (03CR) 10Vgutierrez: [C:04-1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [12:41:46] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2081/c" [puppet] - 10https://gerrit.wikimedia.org/r/1023417 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [12:42:31] (03PS2) 10Hashar: Remove registerStyleModule() for Gerrit 3.8 [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1002933 (https://phabricator.wikimedia.org/T354886) [12:42:50] (03CR) 10Brouberol: "The application is not yet deployed. This is foundation work preparing the deployment of a chart that will create this certificate. We can" [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [12:43:55] (03CR) 10Hashar: [C:03+2] "Gerrit 3.8 has removed the API:" [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1002933 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [12:44:37] (03CR) 10Muehlenhoff: "I still need to read up on rspamd a bit more, but a few initial comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [12:44:44] (03Merged) 10jenkins-bot: Remove registerStyleModule() for Gerrit 3.8 [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1002933 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [12:45:12] !log hashar@deploy1002 Started deploy [gerrit/gerrit@ff51759]: Remove registerStyleModule() for Gerrit 3.8 - T354886 [12:45:16] (03PS1) 10Filippo Giunchedi: toil: add rsyslog_imfile_remedy [puppet] - 10https://gerrit.wikimedia.org/r/1023422 (https://phabricator.wikimedia.org/T357616) [12:45:19] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@ff51759]: Remove registerStyleModule() for Gerrit 3.8 - T354886 (duration: 00m 07s) [12:45:41] T354886: Upgrade to Gerrit 3.8 - https://phabricator.wikimedia.org/T354886 [12:47:03] (03CR) 10Ilias Sarantopoulos: "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1023414 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [12:50:54] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9735410 (10Trizek-WMF) Checking after #MoveComms-Support was added to this task: what kind of support do you need, if... [12:50:56] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05): Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9735411 (10MoritzMuehlenhoff) >>! In T360439#9735349, @BTullis wrote: > I'll check to see if there is any code ready to deploy cfssl based certificates for nginx. Joh... [12:53:27] (03CR) 10Muehlenhoff: "Did you mean "to make it clear we're enabling ingress traffic, and not egress"?" [puppet] - 10https://gerrit.wikimedia.org/r/1023402 (owner: 10Muehlenhoff) [12:54:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 depool', diff saved to https://phabricator.wikimedia.org/P61099 and previous config saved to /var/cache/conftool/dbconfig/20240423-125430-arnaudb.json [12:55:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db2155.codfw.wmnet with reason: Reimage db2155 [12:55:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2155.codfw.wmnet with reason: Reimage db2155 [12:55:27] (03PS1) 10Filippo Giunchedi: trafficserver: move prometheus-eqiad to prometheus1006 [puppet] - 10https://gerrit.wikimedia.org/r/1023423 (https://phabricator.wikimedia.org/T362990) [12:55:40] (KubernetesRsyslogDown) firing: rsyslog on parse2019:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=parse2019 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:56:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 50%: Sanitarium master', diff saved to https://phabricator.wikimedia.org/P61100 and previous config saved to /var/cache/conftool/dbconfig/20240423-125622-arnaudb.json [12:57:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2147', diff saved to https://phabricator.wikimedia.org/P61101 and previous config saved to /var/cache/conftool/dbconfig/20240423-125703-arnaudb.json [12:57:12] (03PS1) 10Awight: Include job for scraper monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1023424 (https://phabricator.wikimedia.org/T362904) [12:57:19] (03CR) 10Elukey: "I am not 100% convinced that this is the right change, I would prefer to upgrade the rocm drivers instead. What do you think?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1023414 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [12:57:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2147.codfw.wmnet with reason: T362746 [12:57:38] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [12:57:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2147.codfw.wmnet with reason: T362746 [12:57:50] (03CR) 10Muehlenhoff: "This type of query only works with full puppetdb access, we couldn't run it from the ganeti hosts. Some of those queries were moved direct" [puppet] - 10https://gerrit.wikimedia.org/r/1021896 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [12:58:19] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1023423 (https://phabricator.wikimedia.org/T362990) (owner: 10Filippo Giunchedi) [12:58:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2147.codfw.wmnet with OS bookworm [12:59:20] (03PS2) 10Awight: Revert temporary monitoring for scraper [puppet] - 10https://gerrit.wikimedia.org/r/1023152 (https://phabricator.wikimedia.org/T362904) [12:59:44] (03CR) 10Filippo Giunchedi: [C:03+2] trafficserver: move prometheus-eqiad to prometheus1006 [puppet] - 10https://gerrit.wikimedia.org/r/1023423 (https://phabricator.wikimedia.org/T362990) (owner: 10Filippo Giunchedi) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:40] (KubernetesRsyslogDown) resolved: rsyslog on parse2019:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=parse2019 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:01:05] yup, nothing to deploy [13:01:21] (03PS2) 10Elukey: role::aqs: complete the move of Cassandra instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1020267 (https://phabricator.wikimedia.org/T352647) [13:02:05] (03PS1) 10Btullis: Switch relforge certificates from cergen to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023426 (https://phabricator.wikimedia.org/T360439) [13:03:36] (03PS2) 10Btullis: Switch relforge certificates from cergen to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023426 (https://phabricator.wikimedia.org/T360439) [13:04:49] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2083/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023426 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [13:05:04] (03CR) 10Filippo Giunchedi: [C:03+1] Revert: kubernetes::node restart rsyslog if too many fd's are blocked by inotify [puppet] - 10https://gerrit.wikimedia.org/r/1023417 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [13:05:57] (03CR) 10Filippo Giunchedi: "LGTM, just a nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1023424 (https://phabricator.wikimedia.org/T362904) (owner: 10Awight) [13:07:29] (03CR) 10Vgutierrez: [C:04-1] "yep, this needs to be merged after the applayer is ready" [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [13:09:43] (03PS2) 10Awight: Include job for scraper monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1023424 (https://phabricator.wikimedia.org/T362904) [13:09:52] (03CR) 10CI reject: [V:04-1] Include job for scraper monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1023424 (https://phabricator.wikimedia.org/T362904) (owner: 10Awight) [13:10:06] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2084/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023054 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:11:21] (03CR) 10Vgutierrez: [V:03+1 C:03+1] "LGTM, maybe rewrite commit message as mentioned inline" [puppet] - 10https://gerrit.wikimedia.org/r/1023054 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:11:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 75%: Sanitarium master', diff saved to https://phabricator.wikimedia.org/P61102 and previous config saved to /var/cache/conftool/dbconfig/20240423-131128-arnaudb.json [13:13:45] (03PS13) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [13:14:29] (03CR) 10Filippo Giunchedi: [C:03+1] Include job for scraper monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1023424 (https://phabricator.wikimedia.org/T362904) (owner: 10Awight) [13:14:38] (03PS3) 10Awight: Include job for scraper monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1023424 (https://phabricator.wikimedia.org/T362904) [13:14:40] (03CR) 10Ssingh: [C:03+2] wikimedia.org: add DKIM records for Mailchimp [dns] - 10https://gerrit.wikimedia.org/r/1022075 (https://phabricator.wikimedia.org/T362921) (owner: 10Ssingh) [13:14:58] (03PS2) 10Ssingh: wikimedia.org: add DKIM records for Mailchimp [dns] - 10https://gerrit.wikimedia.org/r/1022075 (https://phabricator.wikimedia.org/T362921) [13:15:15] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2085/console" [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [13:15:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2147.codfw.wmnet with reason: host reimage [13:16:34] (03CR) 10Filippo Giunchedi: [C:03+2] Include job for scraper monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1023424 (https://phabricator.wikimedia.org/T362904) (owner: 10Awight) [13:16:36] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] Include job for scraper monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1023424 (https://phabricator.wikimedia.org/T362904) (owner: 10Awight) [13:17:02] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1022075 (https://phabricator.wikimedia.org/T362921) (owner: 10Ssingh) [13:17:41] (03CR) 10Fabfur: [C:03+1] "ok for me!" [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [13:17:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2147.codfw.wmnet with reason: host reimage [13:18:59] !log running authdns-update for T362921 [13:19:02] (03CR) 10Vgutierrez: [V:03+1 C:03+2] ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [13:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:03] (03CR) 10Awight: Include job for scraper monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023424 (https://phabricator.wikimedia.org/T362904) (owner: 10Awight) [13:19:15] T362921: Authenticating wikimedia.org domain with MailChimp - https://phabricator.wikimedia.org/T362921 [13:19:19] (03PS4) 10Fabfur: hiera: add envvar for buffer limit [puppet] - 10https://gerrit.wikimedia.org/r/1023054 (https://phabricator.wikimedia.org/T358109) [13:19:19] (03PS2) 10Fabfur: hiera: buffer memory limit override for cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1023060 (https://phabricator.wikimedia.org/T358109) [13:19:26] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice, LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) (owner: 10Klausman) [13:20:38] (03CR) 10Klausman: [C:03+2] team-ml: Add alerting rule for high error rate in LW services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) (owner: 10Klausman) [13:20:44] (03PS1) 10Vgutierrez: hiera: Enable benthos on ncredir@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1023428 (https://phabricator.wikimedia.org/T362776) [13:21:45] (03Merged) 10jenkins-bot: team-ml: Add alerting rule for high error rate in LW services [alerts] - 10https://gerrit.wikimedia.org/r/1021417 (https://phabricator.wikimedia.org/T362661) (owner: 10Klausman) [13:22:17] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1023428 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [13:24:16] (03CR) 10Elukey: [C:03+2] role::aqs: complete the move of Cassandra instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1020267 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:24:18] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022029 (owner: 10PipelineBot) [13:24:33] (03CR) 10Fabfur: [C:03+1] "go for it!" [puppet] - 10https://gerrit.wikimedia.org/r/1023428 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [13:25:21] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022029 (owner: 10PipelineBot) [13:25:34] (03CR) 10Fabfur: hiera: add envvar for buffer limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023054 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:26:18] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable benthos on ncredir@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1023428 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [13:26:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 100%: Sanitarium master', diff saved to https://phabricator.wikimedia.org/P61103 and previous config saved to /var/cache/conftool/dbconfig/20240423-132633-arnaudb.json [13:29:32] (03CR) 10Elukey: [C:03+1] ml-services: tweak reference to ML Cassandra clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021895 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:34:53] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Deploy new TLS Keystore - PKI - elukey@cumin1002 [13:35:18] !log installing glibc security updates [13:35:21] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:41] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:35:48] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:36:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 35.19% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:36:32] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:36:41] (03PS1) 10Vgutierrez: fifo_log_demux: Create fifo iff ensure = present [puppet] - 10https://gerrit.wikimedia.org/r/1023430 (https://phabricator.wikimedia.org/T362776) [13:37:08] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:38:05] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:38:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 952.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:38:46] (03PS1) 10Brouberol: idp_test: register the mpic_next service configuration [puppet] - 10https://gerrit.wikimedia.org/r/1023431 (https://phabricator.wikimedia.org/T361341) [13:38:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:39:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2147.codfw.wmnet with OS bookworm [13:40:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 5%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61105 and previous config saved to /var/cache/conftool/dbconfig/20240423-134034-arnaudb.json [13:41:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2140.codfw.wmnet with reason: T362746 [13:41:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2140.codfw.wmnet with reason: T362746 [13:41:51] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [13:43:30] (03PS2) 10Vgutierrez: fifo_log_demux: Create fifo iff ensure = present [puppet] - 10https://gerrit.wikimedia.org/r/1023430 (https://phabricator.wikimedia.org/T362776) [13:43:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2140.codfw.wmnet with OS bookworm [13:44:59] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1023430 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [13:47:59] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9735646 (10Jclark-ctr) a:03Jclark-ctr Opened ticket with Dell sr 189290647 [13:48:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 926.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:49:02] 10ops-eqiad, 06SRE: eqiad: magru transport down - https://phabricator.wikimedia.org/T363117#9735659 (10Jclark-ctr) installed loop facing Telxius [13:50:16] (03CR) 10Gmodena: Add datasets-config helm chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:51:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at eqiad: 38.1% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:52:49] (03PS4) 10Elukey: kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) [13:53:52] (ProbeDown) firing: (39) Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:54:34] (03PS5) 10Elukey: kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) [13:55:17] (03PS1) 10Majavah: wikitech: Also disable password changes when logged-in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023432 [13:55:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 10%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61106 and previous config saved to /var/cache/conftool/dbconfig/20240423-135540-arnaudb.json [13:56:25] 10ops-eqiad, 06SRE: eqiad: magru transport down - https://phabricator.wikimedia.org/T363117#9735691 (10Jclark-ctr) a:03Jclark-ctr [13:56:47] 10ops-eqiad, 06SRE, 06DBA: db1234 has hardware issues - https://phabricator.wikimedia.org/T363102#9735689 (10Jclark-ctr) a:03Jclark-ctr You have successfully submitted request SR189292045. Ordered dimm from dell [13:58:22] jouncebot: nowandnext [13:58:22] For the next 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1300) [13:58:22] In 1 hour(s) and 1 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1500) [13:58:42] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1022532 (https://phabricator.wikimedia.org/T363093) (owner: 10Zabe) [13:59:28] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1022532 (https://phabricator.wikimedia.org/T363093) (owner: 10Zabe) [14:00:25] (ProbeDown) firing: (39) Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:00:46] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1022532|Update interwiki cache (T363093)]] [14:00:54] (03PS1) 10NMW03: Added namespace alias for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1022535 (https://phabricator.wikimedia.org/T362645) [14:01:20] T363093: Please run maintenance task "scap update-interwiki-cache" - https://phabricator.wikimedia.org/T363093 [14:01:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2140.codfw.wmnet with reason: host reimage [14:03:06] !log zabe@deploy1002 zabe: Backport for [[gerrit:1022532|Update interwiki cache (T363093)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:03:24] !log zabe@deploy1002 zabe: Continuing with sync [14:03:52] (ProbeDown) firing: (39) Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:04:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2140.codfw.wmnet with reason: host reimage [14:05:04] (03PS8) 10TChin: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) [14:05:26] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9735726 (10Jclark-ctr) @akosiaris @hashar reset idrac with no change i will need to reboot server and hook crash cart up to it. Please advise if i am able to reboot. [14:06:11] (03CR) 10Ssingh: [C:03+1] fifo_log_demux: Create fifo iff ensure = present [puppet] - 10https://gerrit.wikimedia.org/r/1023430 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [14:06:32] (03CR) 10Vgutierrez: [V:03+1 C:03+2] fifo_log_demux: Create fifo iff ensure = present [puppet] - 10https://gerrit.wikimedia.org/r/1023430 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [14:08:52] (ProbeDown) firing: (39) Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:10] !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling restart_daemons on A:ncredir-ulsfo [14:10:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling restart_daemons on A:ncredir-ulsfo [14:10:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 25%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61107 and previous config saved to /var/cache/conftool/dbconfig/20240423-141045-arnaudb.json [14:12:43] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9735760 (10akosiaris) >>! In T363086#9735726, @Jclark-ctr wrote: > @akosiaris @hashar reset idrac with no change i will need to reboot server and hook crash cart up to it. Please advise if i am able to r... [14:13:44] (03PS6) 10Elukey: kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) [14:13:55] !log upload prometheus-memcached-exporter_0.14.2-2~wmf1_amd64 to bookworm-wikimedia - T350807 [14:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:03] !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling restart_daemons on A:ncredir [14:14:17] T350807: Package latest version of prometheus-memcached-exporter (v0.14.2) - https://phabricator.wikimedia.org/T350807 [14:14:27] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1022532|Update interwiki cache (T363093)]] (duration: 13m 41s) [14:14:44] (03CR) 10CI reject: [V:04-1] kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [14:14:51] T363093: Please run maintenance task "scap update-interwiki-cache" - https://phabricator.wikimedia.org/T363093 [14:15:25] (ProbeDown) firing: (36) Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:18:52] (ProbeDown) firing: (34) Service aqs1013-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:20:13] (03CR) 10Fabfur: [C:03+2] hiera: add envvar for buffer limit [puppet] - 10https://gerrit.wikimedia.org/r/1023054 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [14:21:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling restart_daemons on A:ncredir [14:23:52] (ProbeDown) firing: (32) Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org [14:25:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2140.codfw.wmnet with OS bookworm [14:25:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 50%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61108 and previous config saved to /var/cache/conftool/dbconfig/20240423-142551-arnaudb.json [14:25:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2136.codfw.wmnet with reason: T362746 [14:26:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2136.codfw.wmnet with reason: T362746 [14:26:26] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [14:26:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2136', diff saved to https://phabricator.wikimedia.org/P61109 and previous config saved to /var/cache/conftool/dbconfig/20240423-142630-arnaudb.json [14:26:55] (03PS1) 10Zabe: Add Apache configuration for wikipedia-pl-sysop.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1023436 (https://phabricator.wikimedia.org/T361041) [14:27:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 25%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P61110 and previous config saved to /var/cache/conftool/dbconfig/20240423-142723-arnaudb.json [14:29:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2136.codfw.wmnet with OS bookworm [14:30:25] (ProbeDown) firing: (28) Service aqs1015-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org [14:33:52] (ProbeDown) firing: (26) Service aqs1015-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:26] (03PS1) 10Ladsgroup: logos: Add fawiki logo for 1,000,000 article [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023439 (https://phabricator.wikimedia.org/T363165) [14:34:28] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:35:03] (03CR) 10CI reject: [V:04-1] logos: Add fawiki logo for 1,000,000 article [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023439 (https://phabricator.wikimedia.org/T363165) (owner: 10Ladsgroup) [14:35:17] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['parse1002.eqiad.wmnet'] [14:35:22] (03PS1) 10Btullis: Replace tabs with 4 spaces in tlsproxy nginx.conf [puppet] - 10https://gerrit.wikimedia.org/r/1023440 (https://phabricator.wikimedia.org/T360439) [14:35:28] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['parse1002.eqiad.wmnet'] [14:35:39] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts parse1002.eqiad.wmnet [14:35:40] (KubernetesRsyslogDown) firing: rsyslog on parse2015:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=parse2015 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:36:06] (03PS1) 10Hashar: logging: do not explicitly set blackhole handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023441 (https://phabricator.wikimedia.org/T228838) [14:36:43] (03CR) 10CI reject: [V:04-1] logging: do not explicitly set blackhole handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023441 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [14:37:00] (03PS2) 10Ladsgroup: logos: Add fawiki logo for 1,000,000 article [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023439 (https://phabricator.wikimedia.org/T363165) [14:37:45] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: Authenticating wikimedia.org domain with MailChimp - https://phabricator.wikimedia.org/T362921#9735864 (10EdErhart-WMF) The domain has been authenticated on MailChimp's side. Thanks, @ssingh! [14:38:01] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: Authenticating wikimedia.org domain with MailChimp - https://phabricator.wikimedia.org/T362921#9735865 (10EdErhart-WMF) 05Open→03Resolved [14:38:08] jouncebot: nowandnext [14:38:09] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [14:38:09] In 0 hour(s) and 21 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1500) [14:38:19] (03CR) 10Ladsgroup: [C:03+2] logos: Add fawiki logo for 1,000,000 article [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023439 (https://phabricator.wikimedia.org/T363165) (owner: 10Ladsgroup) [14:38:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org [14:38:32] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1023440 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [14:38:52] (ProbeDown) firing: (24) Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:52] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023439 (https://phabricator.wikimedia.org/T363165) (owner: 10Ladsgroup) [14:39:14] (03Merged) 10jenkins-bot: logos: Add fawiki logo for 1,000,000 article [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023439 (https://phabricator.wikimedia.org/T363165) (owner: 10Ladsgroup) [14:39:28] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:39:29] (03CR) 10Brennen Bearnes: [C:03+2] Update the PHP files Phabricator reads to show the latest translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery) [14:39:32] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1023439|logos: Add fawiki logo for 1,000,000 article (T363165)]] [14:39:50] T363165: Changing the logo of Persian Wikipedia on the occasion of one million articles - https://phabricator.wikimedia.org/T363165 [14:40:02] (03PS2) 10Hashar: logging: do not explicitly set blackhole handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023441 (https://phabricator.wikimedia.org/T228838) [14:40:15] (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] Update the PHP files Phabricator reads to show the latest translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery) [14:40:25] (ProbeDown) firing: (24) Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:40:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 75%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61111 and previous config saved to /var/cache/conftool/dbconfig/20240423-144057-arnaudb.json [14:42:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T352010)', diff saved to https://phabricator.wikimedia.org/P61112 and previous config saved to /var/cache/conftool/dbconfig/20240423-144202-ladsgroup.json [14:42:14] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1023439|logos: Add fawiki logo for 1,000,000 article (T363165)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:42:24] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:42:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 50%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P61113 and previous config saved to /var/cache/conftool/dbconfig/20240423-144229-arnaudb.json [14:44:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org [14:44:58] !log depool ncredir6001 [14:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:28] (03PS3) 10Fabfur: hiera: buffer memory limit override for cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1023060 (https://phabricator.wikimedia.org/T358109) [14:45:58] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:46:29] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [14:47:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts parse1002.eqiad.wmnet [14:47:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2136.codfw.wmnet with reason: host reimage [14:47:25] 10ops-eqiad, 06SRE, 10Observability-Metrics: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687#9735882 (10herron) Prometheus1005 is down and depooled, any time works! [14:48:09] (03PS4) 10Fabfur: hiera: buffer memory limit increase for cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1023060 (https://phabricator.wikimedia.org/T358109) [14:48:36] (03PS1) 10Ladsgroup: logos: Add the override for 1M variant of fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023445 (https://phabricator.wikimedia.org/T363165) [14:48:52] (ProbeDown) firing: (18) Service aqs1017-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:16] (03CR) 10CI reject: [V:04-1] logos: Add the override for 1M variant of fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023445 (https://phabricator.wikimedia.org/T363165) (owner: 10Ladsgroup) [14:49:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2136.codfw.wmnet with reason: host reimage [14:50:36] (03PS2) 10Ladsgroup: logos: Add the override for 1M variant of fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023445 (https://phabricator.wikimedia.org/T363165) [14:50:40] (KubernetesRsyslogDown) resolved: rsyslog on parse2015:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=parse2015 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:50:49] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9735903 (10Marostegui) @Jclark-ctr this is the third time this host crashes with the same exact HW error see: T361968 T359940 - hopefully Dell won't ask us again to upgrade firmware and BIOS and instead replace whatever pie... [14:50:58] (KeyholderUnarmed) resolved: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:51:58] (03CR) 10Ladsgroup: [C:03+2] logos: Add the override for 1M variant of fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023445 (https://phabricator.wikimedia.org/T363165) (owner: 10Ladsgroup) [14:52:38] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9735925 (10Jclark-ctr) Server is out of warranty preformed reboot came up with no issues, Swapped idrac cable and updated idrac firmware. seems to be up and running now. @akosiaris [14:52:52] (03Merged) 10jenkins-bot: logos: Add the override for 1M variant of fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023445 (https://phabricator.wikimedia.org/T363165) (owner: 10Ladsgroup) [14:53:02] (03CR) 10Bking: sre.hosts.decommission: ask on failure (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans) [14:53:45] !log jmm@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-durum rolling restart_daemons on A:durum-drmrs [14:53:52] (ProbeDown) firing: (16) Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:08] (03PS1) 10Muehlenhoff: sre.dns-roll-restart-reboot-durum: Also allow 'durum' meta alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1023447 [14:55:25] (ProbeDown) firing: (16) Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:26] (03PS3) 10Aklapper: Replace a strlen(null) call for PHP 8.1 [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1020170 (https://phabricator.wikimedia.org/T342244) [14:55:28] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1023060 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [14:55:51] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1023447 (owner: 10Muehlenhoff) [14:56:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 100%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61114 and previous config saved to /var/cache/conftool/dbconfig/20240423-145603-arnaudb.json [14:57:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P61115 and previous config saved to /var/cache/conftool/dbconfig/20240423-145709-ladsgroup.json [14:57:10] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1023439|logos: Add fawiki logo for 1,000,000 article (T363165)]] (duration: 17m 38s) [14:57:22] (03CR) 10Ssingh: [C:03+1] "Thanks for fixing!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1023447 (owner: 10Muehlenhoff) [14:57:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 75%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P61116 and previous config saved to /var/cache/conftool/dbconfig/20240423-145734-arnaudb.json [14:57:37] T363165: Changing the logo of Persian Wikipedia on the occasion of one million articles - https://phabricator.wikimedia.org/T363165 [14:58:03] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1023445|logos: Add the override for 1M variant of fawiki (T363165)]] [14:58:46] (03CR) 10Gmodena: [C:03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [14:58:52] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:31] !log repool ncredir6001 [14:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:04] eoghan, jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1500). [15:00:25] (ProbeDown) firing: (12) Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:01:19] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1023445|logos: Add the override for 1M variant of fawiki (T363165)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:01:21] (03CR) 10Muehlenhoff: [C:03+2] sre.dns-roll-restart-reboot-durum: Also allow 'durum' meta alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1023447 (owner: 10Muehlenhoff) [15:01:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:01:39] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [15:02:43] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9735957 (10BTullis) I have a whitespace-only change in the nginx configuration for tlsproxy here: https://gerrit.wikimedia.org/r/1023440 It looks... [15:03:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling restart_daemons on A:durum-drmrs [15:03:52] (ProbeDown) resolved: (12) Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:03:57] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9735965 (10akosiaris) Cool. Thanks. I 've just uncordoned it, it should receive mediawiki payloads in the next deployment. I 've also checked and it's again a scap target for kubernetes-workers group. [15:05:02] (03CR) 10Vgutierrez: [C:03+1] hiera: buffer memory limit increase for cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1023060 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [15:05:15] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Deploy new TLS Keystore - PKI - elukey@cumin1002 [15:05:22] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9735970 (10akosiaris) 05Open→03Resolved a:03akosiaris I am resolving, hopefully we won't see a recurrence. [15:06:49] (03CR) 10Eevans: [C:03+1] role::restbase::production: change Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1021915 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:07:56] /15 [15:07:59] err [15:08:12] !log jmm@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-durum rolling restart_daemons on A:durum [15:08:49] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:10:12] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2136.codfw.wmnet with OS bookworm [15:11:28] (03PS1) 10Muehlenhoff: Add Cumin alias for apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1023450 [15:11:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 5%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61117 and previous config saved to /var/cache/conftool/dbconfig/20240423-151140-arnaudb.json [15:12:14] (03PS1) 10Ladsgroup: logos: revert back the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023452 (https://phabricator.wikimedia.org/T363165) [15:12:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P61118 and previous config saved to /var/cache/conftool/dbconfig/20240423-151216-ladsgroup.json [15:12:31] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1023445|logos: Add the override for 1M variant of fawiki (T363165)]] (duration: 14m 28s) [15:12:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 100%: post upgrade repool', diff saved to https://phabricator.wikimedia.org/P61119 and previous config saved to /var/cache/conftool/dbconfig/20240423-151240-arnaudb.json [15:12:56] T363165: Changing the logo of Persian Wikipedia on the occasion of one million articles - https://phabricator.wikimedia.org/T363165 [15:12:58] (03PS1) 10Elukey: role::aqs: remove old settings not used anymore after the move to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1023453 (https://phabricator.wikimedia.org/T352647) [15:13:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023452 (https://phabricator.wikimedia.org/T363165) (owner: 10Ladsgroup) [15:13:45] (03Merged) 10jenkins-bot: logos: revert back the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023452 (https://phabricator.wikimedia.org/T363165) (owner: 10Ladsgroup) [15:13:49] (ProbeDown) resolved: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:13:59] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1023452|logos: revert back the tagline (T363165)]] [15:14:24] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2091/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023453 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:16:26] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1023452|logos: revert back the tagline (T363165)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:16:41] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [15:18:49] (03PS1) 10Elukey: Deploy the Java Truststore with PKI Root CA on Hadoop and Stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/1023454 (https://phabricator.wikimedia.org/T362181) [15:19:14] (03CR) 10Brouberol: [C:03+1] "Looks good, and thanks again for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/1023453 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:19:36] !log restarting FPM on phab1004 [15:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=1) rolling restart_daemons on A:durum [15:20:44] (03CR) 10Klausman: [C:03+2] ml-services: tweak reference to ML Cassandra clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021895 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:21:39] (03Merged) 10jenkins-bot: ml-services: tweak reference to ML Cassandra clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021895 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:21:59] (03PS2) 10Elukey: Deploy the Java Truststore with PKI Root CA on Stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/1023454 (https://phabricator.wikimedia.org/T362181) [15:26:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 10%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61120 and previous config saved to /var/cache/conftool/dbconfig/20240423-152646-arnaudb.json [15:27:05] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:27:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T352010)', diff saved to https://phabricator.wikimedia.org/P61121 and previous config saved to /var/cache/conftool/dbconfig/20240423-152725-ladsgroup.json [15:27:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [15:27:29] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1023452|logos: revert back the tagline (T363165)]] (duration: 13m 30s) [15:27:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [15:27:41] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:28:13] T363165: Changing the logo of Persian Wikipedia on the occasion of one million articles - https://phabricator.wikimedia.org/T363165 [15:28:35] (03CR) 10Muehlenhoff: [C:03+2] Add Cumin alias for apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1023450 (owner: 10Muehlenhoff) [15:30:00] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:30:22] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:33:15] (03CR) 10Pppery: wikitech: Also disable password changes when logged-in (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023432 (owner: 10Majavah) [15:36:38] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023440 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [15:36:49] (03CR) 10Majavah: wikitech: Also disable password changes when logged-in (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023432 (owner: 10Majavah) [15:39:10] (03CR) 10Ryan Kemper: [C:03+1] "Change itself looks good; we'll need to coordinate to deploy this on a single elastic host with puppet disabled elsewhere to verify things" [puppet] - 10https://gerrit.wikimedia.org/r/1023440 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [15:41:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 25%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61122 and previous config saved to /var/cache/conftool/dbconfig/20240423-154152-arnaudb.json [15:41:53] (03PS1) 10AikoChou: ml-services: update revertrisk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023460 [15:42:26] (03PS4) 10SBassett: Implement security.txt standard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana) [15:43:27] (03PS3) 10Pppery: Phabricator: Delete chatlog group [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1022053 (https://phabricator.wikimedia.org/T318763) [15:43:56] (03CR) 10SBassett: "Ok, PS4 removes the root locations and adds the .well-known ones, per the RFC." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana) [15:44:29] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: T363174 [15:44:43] T363174: Deploy Phabricator/Phorge 2024-04-23 - https://phabricator.wikimedia.org/T363174 [15:44:44] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: T363174 [15:44:49] (03CR) 10Pppery: [C:04-1] "Poking at this more for now" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1022053 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery) [15:45:01] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: T363174 [15:45:17] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: T363174 [15:45:33] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on phabricator.wikimedia.org with reason: T363174 [15:45:35] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phabricator.wikimedia.org with reason: T363174 [15:46:17] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab.wmfusercontent.org with reason: T363174 [15:46:31] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab.wmfusercontent.org with reason: T363174 [15:48:04] !log brennen@deploy1002 Started deploy [phabricator/deployment@12abb76]: test deploy phab2002 for T363174 [15:48:36] !log brennen@deploy1002 Finished deploy [phabricator/deployment@12abb76]: test deploy phab2002 for T363174 (duration: 00m 32s) [15:49:02] !log brennen@deploy1002 Started deploy [phabricator/deployment@12abb76]: deploy phab1004 for T363174 [15:49:34] !log brennen@deploy1002 Finished deploy [phabricator/deployment@12abb76]: deploy phab1004 for T363174 (duration: 00m 32s) [15:49:53] T363174: Deploy Phabricator/Phorge 2024-04-23 - https://phabricator.wikimedia.org/T363174 [15:51:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:52:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:53:52] (JobUnavailable) resolved: Reduced availability for job thanos-sidecar in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:54:01] (03PS9) 10Pppery: Merge in changes to qqq.json rather than overwriting them [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975392 (https://phabricator.wikimedia.org/T351363) [15:54:05] (03PS8) 10Pppery: Undo qqq.json overwrites [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) [15:54:29] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update revertrisk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023460 (owner: 10AikoChou) [15:54:47] (03PS9) 10Pppery: Undo qqq.json overwrites [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) [15:56:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 50%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61123 and previous config saved to /var/cache/conftool/dbconfig/20240423-155657-arnaudb.json [15:58:29] (03PS10) 10Pppery: Undo qqq.json overwrites [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) [16:00:05] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1600). [16:00:05] zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:25] o/ [16:04:28] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9736290 (10BCornwall) 05Open→03In progress a:03Lina_Farid_WMDE [16:09:02] (03PS2) 10AikoChou: ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023460 [16:12:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 75%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61124 and previous config saved to /var/cache/conftool/dbconfig/20240423-161204-arnaudb.json [16:13:08] !log Backing up LibreNMS DB - T363141 [16:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:23] (03CR) 10AikoChou: [C:03+2] ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023460 (owner: 10AikoChou) [16:17:27] (03Merged) 10jenkins-bot: ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023460 (owner: 10AikoChou) [16:17:34] (03CR) 10Eevans: [C:03+1] role::aqs: remove old settings not used anymore after the move to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1023453 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:25:41] (03PS1) 10Htriedman: T354456: 23 April 2024 update of ruwiki redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023465 [16:26:25] (03CR) 10Htriedman: "updating page redaction list" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023465 (owner: 10Htriedman) [16:26:31] zabe: hi - i can deploy your patches if jhathaway or rzl are not around [16:26:33] (03CR) 10CI reject: [V:04-1] T354456: 23 April 2024 update of ruwiki redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023465 (owner: 10Htriedman) [16:26:59] sorry, missed this I was in a meeting, happy to as well taavi & zabe [16:27:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 100%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61125 and previous config saved to /var/cache/conftool/dbconfig/20240423-162709-arnaudb.json [16:27:17] just give me a moment to check whether mediawiki::sites also applies to mw-on-k8s or the bare metal hosts only [16:28:10] alright:) [16:28:41] so profile::kubernetes::deployment_server::mediawiki::config seems to use it, so we need to do a mw-on-k8s deploy after merging [16:29:14] (03PS2) 10Zabe: Add Apache configuration for wikipedia-pl-sysop.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1023436 (https://phabricator.wikimedia.org/T361041) [16:30:19] !log disable puppet on P:mediawiki::webserver to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020920 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023436 [16:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:15] (03CR) 10Majavah: [C:03+2] Add Apache configuration for ae.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020920 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [16:31:19] (03CR) 10Majavah: [C:03+2] Add Apache configuration for wikipedia-pl-sysop.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1023436 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe) [16:32:21] thanks taavi :) [16:32:48] running puppet on mwdebug servers [16:33:26] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1023453 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:34:47] the mwdebug servers seem to be fine, so re-enabling puppet everywhere [16:36:06] (03PS1) 10Andrew Bogott: wmcs VM backups: move all backups to one host, cloudbackup1004 [puppet] - 10https://gerrit.wikimedia.org/r/1023466 (https://phabricator.wikimedia.org/T332400) [16:36:07] (03PS1) 10Andrew Bogott: wmcs VM backups: move all backups to one host, cloudbackup1003 [puppet] - 10https://gerrit.wikimedia.org/r/1023467 (https://phabricator.wikimedia.org/T332400) [16:36:09] (03PS1) 10Andrew Bogott: Revert "wmcs VM backups: move all backups to one host, cloudbackup1003" Revert "wmcs VM backups: move all backups to one host, cloudbackup1004" [puppet] - 10https://gerrit.wikimedia.org/r/1023468 (https://phabricator.wikimedia.org/T332400) [16:36:49] zabe: do you want to do an empty scap deployment to pick up the vhost changes on mw-on-k8s (https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes#No_image_build_deployment_(helmfile_only)) or should I? [16:37:16] I can do it [16:37:38] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:38:02] thanks! just ran puppet on deploy1002 to update the values [16:39:36] !log zabe@deploy1002 Started scap: T361041 T362529 [16:39:52] Thanks! [16:39:58] T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041 [16:39:58] T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529 [16:40:40] (03CR) 10CI reject: [V:04-1] Revert "wmcs VM backups: move all backups to one host, cloudbackup1003" Revert "wmcs VM backups: move all backups to one host, cloudbackup1004" [puppet] - 10https://gerrit.wikimedia.org/r/1023468 (https://phabricator.wikimedia.org/T332400) (owner: 10Andrew Bogott) [16:41:02] (03PS3) 10Btullis: Switch relforge certificates from cergen to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023426 (https://phabricator.wikimedia.org/T360439) [16:41:02] (03PS1) 10Btullis: Add server aliases to the cirrus/cfssl proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1023469 (https://phabricator.wikimedia.org/T360439) [16:41:11] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [16:41:31] !log Upgrading LibreNMS in production - T363141 [16:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:11] (03PS7) 10Elukey: kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) [16:42:30] !log denisse@deploy1002 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 24.4.0 - T363141 [16:42:42] !log denisse@deploy1002 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 24.4.0 - T363141 (duration: 00m 12s) [16:42:42] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2093/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023469 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [16:42:50] (03CR) 10Andrew Bogott: [C:03+2] wmcs VM backups: move all backups to one host, cloudbackup1004 [puppet] - 10https://gerrit.wikimedia.org/r/1023466 (https://phabricator.wikimedia.org/T332400) (owner: 10Andrew Bogott) [16:44:47] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2094/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023426 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [16:45:33] (03PS2) 10Andrew Bogott: Revert "wmcs VM backups: move all backups to one host" [puppet] - 10https://gerrit.wikimedia.org/r/1023468 (https://phabricator.wikimedia.org/T332400) [16:46:04] !log zabe@deploy1002 Finished scap: T361041 T362529 (duration: 06m 28s) [16:46:23] T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041 [16:46:24] T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529 [16:54:59] (03PS1) 10Zabe: Initial configuration for bewwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023471 (https://phabricator.wikimedia.org/T357866) [16:56:40] (KubernetesRsyslogDown) firing: rsyslog on parse2014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=parse2014 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1700) [17:01:40] (KubernetesRsyslogDown) resolved: rsyslog on parse2014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=parse2014 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:09:17] !log bking@mw1461 "restart rsyslog to reclaim fds T357616" [17:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:36] T357616: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616 [17:10:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2420:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:11:35] 10ops-eqiad, 06SRE, 10Observability-Metrics: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687#9736729 (10herron) 05Open→03Resolved Thanks! Looks good! JFTR I made a backup of the ipmi sel in /root/ipmi-sel.log-20240423 and then cleared the sel for a clean sla... [17:15:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:15:40] (KubernetesRsyslogDown) firing: (2) rsyslog on mw2420:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:15:55] (03PS1) 10Herron: Revert "promote prometheus1006 as pushgateway primary" [dns] - 10https://gerrit.wikimedia.org/r/1023154 [17:16:06] (03PS1) 10Herron: Revert "prometheus: promote prometheus1006 to pushgateway duty" [puppet] - 10https://gerrit.wikimedia.org/r/1023155 [17:24:25] 10ops-magru, 13Patch-For-Review: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722#9736772 (10BCornwall) p:05Triage→03Medium [17:24:57] (ProbeDown) firing: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-web:4450 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:25:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 3.859% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:26:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [17:29:57] (ProbeDown) resolved: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-web:4450 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:30:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 3.859% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:31:23] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1023154 (owner: 10Herron) [17:31:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [17:31:51] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1023155 (owner: 10Herron) [17:34:27] (03PS8) 10Elukey: kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) [17:42:13] (03PS1) 10Elukey: ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622) [18:00:04] brennen and dancy: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T1800). [18:00:09] o/ [18:02:41] !log add backup user to db1208 T349397 [18:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:53] jouncebot: refresh [18:02:54] I refreshed my knowledge about deployments. [18:03:03] T349397: Migrate the matomo host to bookworm - https://phabricator.wikimedia.org/T349397 [18:03:06] !log db1208 aka matomo db (data engineering) [18:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:41] !log train 1.43.0-wmf.2 (T361396) status: no current blockers, rolling to group0 [18:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:53] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023477 (https://phabricator.wikimedia.org/T361396) [18:04:57] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023477 (https://phabricator.wikimedia.org/T361396) (owner: 10TrainBranchBot) [18:04:58] T361396: 1.43.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T361396 [18:05:36] (03CR) 10Dwisehaupt: "@dzahn@wikimedia.org I have tested over tunnels and things are working as expected. I think this can be merged and pushed at your convenie" [puppet] - 10https://gerrit.wikimedia.org/r/983951 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [18:05:46] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023477 (https://phabricator.wikimedia.org/T361396) (owner: 10TrainBranchBot) [18:09:25] o/ [18:10:12] * dancy browses logs. [18:10:43] hrm: https://phabricator.wikimedia.org/P61126 [18:10:58] no /srv/mediawiki-staging/php-1.43.0-wmf.2 [18:11:43] huh. I saw that train presync ran.. lemme look at the email again [18:12:09] ah, failed in `scap clean` phase, which has been moved to the front. [18:12:20] so, simplest is to run `scap stage-train` [18:12:35] and come back later and redo `scap train` [18:13:03] i note `scap train` thinks we're at group0 [18:13:32] ah, because it updated wikiversions already. [18:13:44] will that collide with `stage-train` at all? [18:14:53] Hm. OK let's do `scap train, selecting testwikis as the target station. [18:15:19] * dancy ponders. [18:15:45] righto [18:15:58] I'm around if it goes weird. [18:16:04] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023478 (https://phabricator.wikimedia.org/T361396) [18:16:07] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023478 (https://phabricator.wikimedia.org/T361396) (owner: 10TrainBranchBot) [18:16:52] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023478 (https://phabricator.wikimedia.org/T361396) (owner: 10TrainBranchBot) [18:17:47] It'll take a while. Normally just under an hour for the whole stage-trian. [18:18:02] ::nod:: [18:18:34] !log brennen@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.2 refs T361396 [18:18:51] T361396: 1.43.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T361396 [18:21:00] Hmm. Who do we nag about https://phabricator.wikimedia.org/T362814 [18:22:00] (03CR) 10Dzahn: stewards: create a local git repo for user db data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1022170 (https://phabricator.wikimedia.org/T361547) (owner: 10Dzahn) [18:22:15] (03PS2) 10Dzahn: stewards: create a local git repo for user db data [puppet] - 10https://gerrit.wikimedia.org/r/1022170 (https://phabricator.wikimedia.org/T361547) [18:24:29] (03CR) 10Urbanecm: [C:03+1] "let's try. I am unsure whether the chmod would apply before or after running git init, but I guess there is only one way to find out" [puppet] - 10https://gerrit.wikimedia.org/r/1022170 (https://phabricator.wikimedia.org/T361547) (owner: 10Dzahn) [18:24:49] (03PS1) 10Herron: Revert "trafficserver: move prometheus-eqiad to prometheus1006" [puppet] - 10https://gerrit.wikimedia.org/r/1023156 [18:25:45] (03PS1) 10Andrew Bogott: eqiad1 openstack -> version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1023480 (https://phabricator.wikimedia.org/T356287) [18:30:24] (03CR) 10Majavah: stewards: create a local git repo for user db data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1022170 (https://phabricator.wikimedia.org/T361547) (owner: 10Dzahn) [18:34:21] (03PS3) 10Dzahn: stewards: create a local git repo for user db data [puppet] - 10https://gerrit.wikimedia.org/r/1022170 (https://phabricator.wikimedia.org/T361547) [18:35:17] (03CR) 10Dzahn: stewards: create a local git repo for user db data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1022170 (https://phabricator.wikimedia.org/T361547) (owner: 10Dzahn) [18:37:19] (03CR) 10Dzahn: "2fa/MFA is enforced now since https://phabricator.wikimedia.org/T361277" [puppet] - 10https://gerrit.wikimedia.org/r/1012350 (owner: 10Aklapper) [18:38:03] (03PS2) 10Dzahn: phabricator: MFA status check: Exclude bot accounts [puppet] - 10https://gerrit.wikimedia.org/r/1012350 (https://phabricator.wikimedia.org/T361277) (owner: 10Aklapper) [18:38:50] (03CR) 10Dzahn: "sorry, gitlab and phab mixed up of course" [puppet] - 10https://gerrit.wikimedia.org/r/1012350 (https://phabricator.wikimedia.org/T361277) (owner: 10Aklapper) [18:39:03] (03PS3) 10Dzahn: phabricator: MFA status check: Exclude bot accounts [puppet] - 10https://gerrit.wikimedia.org/r/1012350 (owner: 10Aklapper) [18:39:16] (03CR) 10Dzahn: [C:03+2] phabricator: MFA status check: Exclude bot accounts [puppet] - 10https://gerrit.wikimedia.org/r/1012350 (owner: 10Aklapper) [18:39:17] (03CR) 10Dzahn: [V:03+2 C:03+2] phabricator: MFA status check: Exclude bot accounts [puppet] - 10https://gerrit.wikimedia.org/r/1012350 (owner: 10Aklapper) [18:40:42] dancy: good question. this feels like something that ought to get surfaced to user rather than throwing an exception? [18:41:02] ...but also like the kind of thing that happens seldom enough to not get much attention [18:41:46] (03PS1) 10Jdlrobson: Use dedicated Codex style modules [extensions/MobileFrontend] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023157 (https://phabricator.wikimedia.org/T362986) [18:43:27] (03PS1) 10Jdlrobson: Enable night mode styles on Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023482 (https://phabricator.wikimedia.org/T362726) [18:44:15] nod. It's been persistent lately. [18:47:56] (03CR) 10Dzahn: [V:03+2 C:03+2] "you should have a test mail (i did get my own)" [puppet] - 10https://gerrit.wikimedia.org/r/1012350 (owner: 10Aklapper) [18:50:22] (03Abandoned) 10Dzahn: codsearch: use thirdparty-ci repo to get docker-ce on buster [puppet] - 10https://gerrit.wikimedia.org/r/1022215 (https://phabricator.wikimedia.org/T362518) (owner: 10Dzahn) [18:55:36] (03PS42) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [18:55:40] (KubernetesRsyslogDown) resolved: rsyslog on parse2014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=parse2014 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:56:43] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [18:57:26] (03PS1) 10Jdlrobson: Use dedicated Codex style modules [extensions/MobileFrontend] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1023158 (https://phabricator.wikimedia.org/T362986) [19:01:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:02:27] (03CR) 10CI reject: [V:04-1] Use dedicated Codex style modules [extensions/MobileFrontend] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023157 (https://phabricator.wikimedia.org/T362986) (owner: 10Jdlrobson) [19:02:50] (03PS43) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [19:08:17] (03CR) 10Ryan Kemper: Add Flink alerts for Cirrus Streaming Updater (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:08:27] (03CR) 10Ryan Kemper: [C:03+1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:08:31] (03PS1) 10JHathaway: WIP: add function to generate ganeti known hosts [puppet] - 10https://gerrit.wikimedia.org/r/1023486 (https://phabricator.wikimedia.org/T309724) [19:08:55] (03CR) 10Bking: [C:03+2] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:10:32] (03CR) 10JHathaway: "I might be missing something as well, I was thinking you could do something akin to our current known hosts function, something like this," [puppet] - 10https://gerrit.wikimedia.org/r/1021896 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [19:15:24] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.2 refs T361396 (duration: 56m 50s) [19:15:49] T361396: 1.43.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T361396 [19:16:24] k, now going to group0. [19:16:51] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023488 (https://phabricator.wikimedia.org/T361396) [19:16:53] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023488 (https://phabricator.wikimedia.org/T361396) (owner: 10TrainBranchBot) [19:17:36] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023488 (https://phabricator.wikimedia.org/T361396) (owner: 10TrainBranchBot) [19:20:06] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9737291 (10Eevans) >>! In T362033#9707550, @Eevans wrote: >>>! In T362033#9700949, @Jclark-ctr wrote: >> @Eevans Hey looks like same drive as T354499 is failed again let me know if i can replace it again... [19:21:18] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9737308 (10Eevans) a:03Jclark-ctr Hey @Jclark-ctr: I hope it's OK to assign this one to you as well. [19:21:53] 10ops-codfw, 06DC-Ops, 06serviceops: Q#:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9737312 (10RobH) [19:22:10] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9737314 (10RobH) [19:22:42] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9737317 (10RobH) [19:29:14] (03PS8) 10Eevans: (WIP) cassandra-dev: surrogate user for cqlsh dev access [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) [19:29:42] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9737354 (10Andrew) a:05Andrew→03dcaro [19:30:47] 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212 (10RobH) 03NEW [19:31:16] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.2 refs T361396 [19:31:19] 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9737376 (10RobH) [19:31:21] T361396: 1.43.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T361396 [19:37:40] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:42:25] (03CR) 10Santiago Faci: "Looks good but the 'id' is repeated. I'm not sure if it's ok" [puppet] - 10https://gerrit.wikimedia.org/r/1023431 (https://phabricator.wikimedia.org/T361341) (owner: 10Brouberol) [19:42:40] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2010 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:53:12] (03PS1) 10Brennen Bearnes: Add afl_var_dump to AbuseLogPager::getQueryInfo [extensions/AbuseFilter] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1023159 (https://phabricator.wikimedia.org/T363213) [19:57:46] 10SRE-Access-Requests, 06Fundraising-Backlog: Can we please add our vendor to Google Postmaster Tools - https://phabricator.wikimedia.org/T360907#9737473 (10BCornwall) p:05Triage→03Low [19:59:54] 10SRE-tools: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943#9737477 (10BCornwall) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T2000). [20:00:05] NMW03 and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:59] o/ [20:02:06] 10SRE-tools: Create a spicerack cookbook for restoring an etcd cluster from backups - https://phabricator.wikimedia.org/T203944#9737486 (10BCornwall) [20:02:25] (03CR) 10CDanis: [C:03+1] jaeger: upgrade to 1.56 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023381 (https://phabricator.wikimedia.org/T362719) (owner: 10Filippo Giunchedi) [20:02:33] 10SRE-tools: Covert deploy_apache_change.sh to a spicerack cookbook - https://phabricator.wikimedia.org/T203948#9737481 (10BCornwall) 05Open→03Declined Declining due to inactivity. Do please re-open if/when the need arises to change from Puppet to a cookbook. [20:02:59] I can deploy if no one else is around [20:03:10] 06SRE, 10MediaWiki-Email: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9737493 (10BCornwall) Which teams should be added? [20:03:12] Jdlrobson: is it ok to deploy all your changes together? [20:03:18] zabe: yep no problem [20:03:53] Jdlrobson: is the test failure for the wmf.1 patch a random failure? [20:04:05] (03CR) 10Zabe: [C:03+2] Use dedicated Codex style modules [extensions/MobileFrontend] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1023158 (https://phabricator.wikimedia.org/T362986) (owner: 10Jdlrobson) [20:04:19] looking.. [20:04:36] (03CR) 10Zabe: [C:03+2] Enable night mode styles on Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023482 (https://phabricator.wikimedia.org/T362726) (owner: 10Jdlrobson) [20:04:43] zabe: not random.. looks like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/1020309 [20:05:02] we probably need to backport that too :-( [20:05:18] (03PS1) 10Jdlrobson: .nvmrc: Update version from 18.17.0 to 18.20.2 [extensions/MobileFrontend] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023160 [20:05:21] 10SRE-Access-Requests, 06Fundraising-Backlog: Can we please add our vendor to Google Postmaster Tools - https://phabricator.wikimedia.org/T360907#9737513 (10BCornwall) 05Open→03Stalled What should we do with this ticket? If SRE is unable to administer postmaster tools and ITS is suggested as the ones to co... [20:05:25] (03Merged) 10jenkins-bot: Enable night mode styles on Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023482 (https://phabricator.wikimedia.org/T362726) (owner: 10Jdlrobson) [20:05:46] (03PS2) 10Jdlrobson: Use dedicated Codex style modules [extensions/MobileFrontend] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023157 (https://phabricator.wikimedia.org/T362986) [20:06:44] (03PS3) 10Jdlrobson: Use dedicated Codex style modules [extensions/MobileFrontend] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023157 (https://phabricator.wikimedia.org/T362986) [20:06:51] (03CR) 10Zabe: [C:03+2] .nvmrc: Update version from 18.17.0 to 18.20.2 [extensions/MobileFrontend] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023160 (owner: 10Jdlrobson) [20:06:52] (03CR) 10Zabe: [C:03+2] Use dedicated Codex style modules [extensions/MobileFrontend] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023157 (https://phabricator.wikimedia.org/T362986) (owner: 10Jdlrobson) [20:06:58] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9737519 (10BCornwall) @Lina_Farid_WMDE have you been able to get that signed? [20:07:09] ok then lets see whether that passes ci [20:08:16] an lets start with your config change, the other stuff needs ~20 min for ci anyway [20:09:23] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1023482|Enable night mode styles on Vector 2022 skin (T362726)]] [20:09:41] 👍 [20:09:42] T362726: [config] Enable night mode styles on Vector 2022 skin - https://phabricator.wikimedia.org/T362726 [20:10:16] NMW03: around? [20:10:23] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9737528 (10KFrancis) @BCornwall hello! I'm still waiting on the signer's email address to put together the NDA agreement. [20:11:03] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968#9737530 (10BCornwall) 05Stalled→03Invalid Closing this until the original poster can get this done. [20:11:07] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968#9737533 (10BCornwall) 05Invalid→03Declined [20:11:28] 06SRE, 10MediaWiki-Email: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9737535 (10Peachey88) >>! In T360902#9737489, @BCornwall wrote: > Which teams should be added? relevant subtasks should be created for the various system/tool... [20:12:02] !log zabe@deploy1002 jdlrobson and zabe: Backport for [[gerrit:1023482|Enable night mode styles on Vector 2022 skin (T362726)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:15:42] Jdlrobson: can you test? [20:16:25] zabe: on it [20:16:27] 06SRE: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9737563 (10taavi) This task does not seem to have anything to do with email handling internals in MediaWiki? [20:16:34] alright:) [20:20:13] (03Merged) 10jenkins-bot: Use dedicated Codex style modules [extensions/MobileFrontend] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1023158 (https://phabricator.wikimedia.org/T362986) (owner: 10Jdlrobson) [20:20:24] zabe: need a bit more time on this one [20:24:05] zabe: okay please sync [20:24:13] !log zabe@deploy1002 jdlrobson and zabe: Continuing with sync [20:25:50] 10SRE-Access-Requests, 06Fundraising-Backlog: Can we please add our vendor to Google Postmaster Tools - https://phabricator.wikimedia.org/T360907#9737640 (10AKanji-WMF) Hi All, as per @akosiaris suggestion (and confirmed by @jhathaway) I reached out to ITS - @NMariano-WMF has just confirmed the above email add... [20:27:55] (03Merged) 10jenkins-bot: .nvmrc: Update version from 18.17.0 to 18.20.2 [extensions/MobileFrontend] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023160 (owner: 10Jdlrobson) [20:27:58] (03Merged) 10jenkins-bot: Use dedicated Codex style modules [extensions/MobileFrontend] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023157 (https://phabricator.wikimedia.org/T362986) (owner: 10Jdlrobson) [20:29:34] (03PS2) 10Zabe: Add afl_var_dump to AbuseLogPager::getQueryInfo [extensions/AbuseFilter] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1023159 (https://phabricator.wikimedia.org/T363213) (owner: 10Brennen Bearnes) [20:31:26] (03CR) 10Dzahn: [C:03+2] stewards: create a local git repo for user db data [puppet] - 10https://gerrit.wikimedia.org/r/1022170 (https://phabricator.wikimedia.org/T361547) (owner: 10Dzahn) [20:35:32] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1023482|Enable night mode styles on Vector 2022 skin (T362726)]] (duration: 26m 08s) [20:35:55] T362726: [config] Enable night mode styles on Vector 2022 skin - https://phabricator.wikimedia.org/T362726 [20:36:27] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1023160|.nvmrc: Update version from 18.17.0 to 18.20.2]], [[gerrit:1023157|Use dedicated Codex style modules (T362986)]], [[gerrit:1023158|Use dedicated Codex style modules (T362986)]] [20:36:33] T362986: Notifications are not usable on mobile version of RTL - https://phabricator.wikimedia.org/T362986 [20:39:03] !log zabe@deploy1002 zabe and jdlrobson: Backport for [[gerrit:1023160|.nvmrc: Update version from 18.17.0 to 18.20.2]], [[gerrit:1023157|Use dedicated Codex style modules (T362986)]], [[gerrit:1023158|Use dedicated Codex style modules (T362986)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:42:06] jan_drewniak: around? [20:42:25] Jdlrobson: yup, I'm here! [20:42:53] Do we need to test to RTL notifications? [20:43:07] yep that's the remaining patches (for both branches) [20:43:40] I can replicate on https://fa.m.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C?uselang=fa [20:43:53] right now clicking notification bell doesnt show notification drawer [20:43:58] after patch it should go full screen [20:44:04] (make sure you test on iPhone emulator) [20:44:22] zabe: I have to run out so jan_drewniak will take over testing my backport. [20:44:58] ok [20:45:15] thanks zabe for your help and sorry for the last minute switcheroo :) [20:46:55] The notification drawer now appears [20:47:10] hi [20:47:48] hey [20:47:59] (03PS1) 10MusikAnimal: [hewiki] enable CodeMirrorV6 and CodeMirrorLineNumberingNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023501 (https://phabricator.wikimedia.org/T357795) [20:48:03] i forgot the deployment [20:49:13] zabe: Yeah I see the notification drawer now too (the patch is on mwdebug right?) [20:49:19] yes [20:49:49] ok it's good to sync then [20:50:13] !log zabe@deploy1002 zabe and jdlrobson: Continuing with sync [20:50:16] cool [20:50:49] zabe excuse me, who is the deployer [20:51:06] I am [20:51:27] we can do your patch after the one currently syncing [20:51:32] sure, thanks [20:51:52] I was busy with my personal website lol, forgot this [20:51:54] zabe: When will the patch be published? [20:53:07] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9737814 (10Jclark-ctr) @Eevans hey sorry about missing the update for being available i did just swap the drive now. When you are recreating the md2 what commands are you running? [20:53:11] which patch [20:53:12] ? [20:53:46] (03PS2) 10NMW03: Added namespace alias for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1022535 (https://phabricator.wikimedia.org/T362645) [20:53:50] (03CR) 10Zabe: [C:03+2] Added namespace alias for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1022535 (https://phabricator.wikimedia.org/T362645) (owner: 10NMW03) [20:54:34] (03Merged) 10jenkins-bot: Added namespace alias for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1022535 (https://phabricator.wikimedia.org/T362645) (owner: 10NMW03) [20:54:51] zabe: patch to T362986: Notifications are not usable on mobile version of RTL [20:54:53] T362986: Notifications are not usable on mobile version of RTL - https://phabricator.wikimedia.org/T362986 [20:55:16] 5 min maybe [20:55:43] zabe: Fine, thank you [20:59:08] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9737870 (10Jclark-ctr) @Eevans this one is out of warranty also let me know if i am able to swap drive i can take care of in morning [21:00:04] zabe: #bothumor My software never has bugs. It just develops random features. Rise for New wikis \o/ (at least I'm trying, let us see if addwiki.php lets me; might start earlier if the UTC late backport window is not fully used). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240423T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:01:00] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1023160|.nvmrc: Update version from 18.17.0 to 18.20.2]], [[gerrit:1023157|Use dedicated Codex style modules (T362986)]], [[gerrit:1023158|Use dedicated Codex style modules (T362986)]] (duration: 24m 32s) [21:01:32] T362986: Notifications are not usable on mobile version of RTL - https://phabricator.wikimedia.org/T362986 [21:01:34] jan_drewniak: patches are live [21:01:49] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1022535|Added namespace alias for Azerbaijani Wikiquote (T362645)]] [21:02:11] T362645: Adding "VS" namespace to Azerbaijani Wikiquote - https://phabricator.wikimedia.org/T362645 [21:02:49] zabe: thank you! [21:04:47] !log zabe@deploy1002 zabe and nmw03: Backport for [[gerrit:1022535|Added namespace alias for Azerbaijani Wikiquote (T362645)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:04:49] NMW03: can you test? [21:04:55] sure [21:05:49] LGTM [21:06:00] !log zabe@deploy1002 zabe and nmw03: Continuing with sync [21:06:03] cool [21:08:37] zabe: thanks for handling the window. if you want to ping when backports are done, i can get the abusefilter one out. [21:10:18] (03CR) 10Zabe: [C:03+2] Add afl_var_dump to AbuseLogPager::getQueryInfo [extensions/AbuseFilter] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1023159 (https://phabricator.wikimedia.org/T363213) (owner: 10Brennen Bearnes) [21:10:27] brennen: thanks, will do [21:10:54] (+2'ed to get CI running now) [21:13:05] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [21:15:13] (03CR) 10Dzahn: [C:03+2] "the repo has been created and it's owned root:stewards-users. A new file created with "sudo touch foo" is group-owned by stewards-users. Y" [puppet] - 10https://gerrit.wikimedia.org/r/1022170 (https://phabricator.wikimedia.org/T361547) (owner: 10Dzahn) [21:15:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:33] (03PS4) 10Dzahn: stewards: move pathes to parameters with lookups [puppet] - 10https://gerrit.wikimedia.org/r/1022177 [21:15:44] (03CR) 10Dzahn: [C:03+2] stewards: move pathes to parameters with lookups [puppet] - 10https://gerrit.wikimedia.org/r/1022177 (owner: 10Dzahn) [21:16:48] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1022535|Added namespace alias for Azerbaijani Wikiquote (T362645)]] (duration: 14m 58s) [21:17:07] T362645: Adding "VS" namespace to Azerbaijani Wikiquote - https://phabricator.wikimedia.org/T362645 [21:17:10] !log zabe@mwmaint1002:~$ mwscript namespaceDupes.php azwikiquote --fix # T362645 [21:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:26] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: telxius magru-eqiad - ayounsi@cumin1002" [21:17:58] brennen: done with the window, could you ping me when you are done with backporting the abusefilter patch, I would like to try addwiki.php (after making myself some food) :) [21:18:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: telxius magru-eqiad - ayounsi@cumin1002" [21:18:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:18:24] zabe: will do [21:18:29] thanks zabe [21:18:36] yw [21:18:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/AbuseFilter] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1023159 (https://phabricator.wikimedia.org/T363213) (owner: 10Brennen Bearnes) [21:24:40] (KubernetesRsyslogDown) firing: rsyslog on mw2413:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2413 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:24:45] (03PS2) 10Zabe: Initial configuration for bewwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023471 (https://phabricator.wikimedia.org/T357866) [21:29:09] (03PS1) 10Urbanecm: stewards-onboarder: Add mediawiki_api to the config [puppet] - 10https://gerrit.wikimedia.org/r/1023505 (https://phabricator.wikimedia.org/T351202) [21:29:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2413:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2413 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:30:15] (03Merged) 10jenkins-bot: Add afl_var_dump to AbuseLogPager::getQueryInfo [extensions/AbuseFilter] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1023159 (https://phabricator.wikimedia.org/T363213) (owner: 10Brennen Bearnes) [21:30:29] 10ops-magru, 13Patch-For-Review: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722#9738070 (10nshahquinn-wmf) FYI: [wikitech:Magru data center](https://wikitech.wikimedia.org/wiki/Magru_data_center). [21:30:48] !log brennen@deploy1002 Started scap: Backport for [[gerrit:1023159|Add afl_var_dump to AbuseLogPager::getQueryInfo (T363213)]] [21:31:12] T363213: PHP Notice: Undefined property: stdClass::$afl_var_dump - https://phabricator.wikimedia.org/T363213 [21:33:23] !log brennen@deploy1002 brennen: Backport for [[gerrit:1023159|Add afl_var_dump to AbuseLogPager::getQueryInfo (T363213)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:35:50] !log brennen@deploy1002 brennen: Continuing with sync [21:37:04] (03PS1) 10Urbanecm: stewards: Mark steward repos as safe [puppet] - 10https://gerrit.wikimedia.org/r/1023506 (https://phabricator.wikimedia.org/T361544) [21:37:24] (03CR) 10CI reject: [V:04-1] stewards: Mark steward repos as safe [puppet] - 10https://gerrit.wikimedia.org/r/1023506 (https://phabricator.wikimedia.org/T361544) (owner: 10Urbanecm) [21:38:00] (03PS2) 10Urbanecm: stewards: Mark steward repos as safe [puppet] - 10https://gerrit.wikimedia.org/r/1023506 (https://phabricator.wikimedia.org/T361544) [21:38:20] (03CR) 10CI reject: [V:04-1] stewards: Mark steward repos as safe [puppet] - 10https://gerrit.wikimedia.org/r/1023506 (https://phabricator.wikimedia.org/T361544) (owner: 10Urbanecm) [21:38:44] (03PS3) 10Urbanecm: stewards: Mark steward repos as safe [puppet] - 10https://gerrit.wikimedia.org/r/1023506 (https://phabricator.wikimedia.org/T361544) [21:39:04] (03CR) 10CI reject: [V:04-1] stewards: Mark steward repos as safe [puppet] - 10https://gerrit.wikimedia.org/r/1023506 (https://phabricator.wikimedia.org/T361544) (owner: 10Urbanecm) [21:39:36] (03PS4) 10Urbanecm: stewards: Mark steward repos as safe [puppet] - 10https://gerrit.wikimedia.org/r/1023506 (https://phabricator.wikimedia.org/T361544) [21:39:53] (03CR) 10Majavah: "you can use git::systemconfig to avoid having a separate file for this" [puppet] - 10https://gerrit.wikimedia.org/r/1023506 (https://phabricator.wikimedia.org/T361544) (owner: 10Urbanecm) [21:39:56] (03CR) 10CI reject: [V:04-1] stewards: Mark steward repos as safe [puppet] - 10https://gerrit.wikimedia.org/r/1023506 (https://phabricator.wikimedia.org/T361544) (owner: 10Urbanecm) [21:43:35] (03CR) 10Dzahn: [C:03+2] stewards-onboarder: Add mediawiki_api to the config [puppet] - 10https://gerrit.wikimedia.org/r/1023505 (https://phabricator.wikimedia.org/T351202) (owner: 10Urbanecm) [21:45:35] (03PS5) 10Urbanecm: stewards: Mark steward repos as safe [puppet] - 10https://gerrit.wikimedia.org/r/1023506 (https://phabricator.wikimedia.org/T361544) [21:46:54] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:1023159|Add afl_var_dump to AbuseLogPager::getQueryInfo (T363213)]] (duration: 16m 05s) [21:47:10] zabe: over to you. [21:47:12] T363213: PHP Notice: Undefined property: stdClass::$afl_var_dump - https://phabricator.wikimedia.org/T363213 [21:47:23] (03CR) 10Urbanecm: "thanks for the info! done" [puppet] - 10https://gerrit.wikimedia.org/r/1023506 (https://phabricator.wikimedia.org/T361544) (owner: 10Urbanecm) [21:47:44] alright:) [21:48:55] (03CR) 10Zabe: [C:03+2] Initial configuration for bewwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023471 (https://phabricator.wikimedia.org/T357866) (owner: 10Zabe) [21:49:45] (03Merged) 10jenkins-bot: Initial configuration for bewwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023471 (https://phabricator.wikimedia.org/T357866) (owner: 10Zabe) [21:50:02] (03CR) 10Dzahn: [C:03+2] stewards: Mark steward repos as safe [puppet] - 10https://gerrit.wikimedia.org/r/1023506 (https://phabricator.wikimedia.org/T361544) (owner: 10Urbanecm) [21:50:10] (03CR) 10Ladsgroup: [C:03+2] Add sysop_plwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1022447 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe) [21:50:18] (03PS2) 10Zabe: Add sysop_plwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1022447 (https://phabricator.wikimedia.org/T361041) [21:50:25] (03CR) 10Ladsgroup: [V:03+2] Add sysop_plwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1022447 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe) [21:50:43] The big moment [21:51:16] addwiki did not crash \o/ [21:52:13] !log create Wikipedia Betawi # T357866 [21:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:30] T357866: Create Wikipedia Betawi - https://phabricator.wikimedia.org/T357866 [21:53:10] zabe: :) nice! [21:53:32] !log zabe@deploy1002 Started scap: Creating bewwiki (T357866) [21:56:12] !log zabe@deploy1002 zabe: Creating bewwiki (T357866) synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:58:29] !log zabe@deploy1002 Sync cancelled. [21:58:37] !log zabe@deploy1002 Started scap: Creating bewwiki (T357866) [21:58:54] whops, mistype [21:59:00] T357866: Create Wikipedia Betawi - https://phabricator.wikimedia.org/T357866 [21:59:03] zabe: I try to get the bot running [22:00:22] Thanks! [22:01:47] (03PS1) 10Zabe: Initial configuration for kuswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023511 (https://phabricator.wikimedia.org/T359757) [22:03:59] \p/ [22:04:06] \o/*' [22:04:17] sigh. apparently pressing the correct keys on the keyboard today is hard [22:04:35] heh :p [22:09:55] (03CR) 10Zabe: [C:03+2] Initial configuration for kuswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023511 (https://phabricator.wikimedia.org/T359757) (owner: 10Zabe) [22:11:37] !log zabe@deploy1002 Finished scap: Creating bewwiki (T357866) (duration: 12m 59s) [22:11:52] T357866: Create Wikipedia Betawi - https://phabricator.wikimedia.org/T357866 [22:11:57] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=bewwiki --cluster=all 2>&1 | tee /tmp/bewwiki.UpdateSearchIndexConfig.log # T357866 [22:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:28] (03Merged) 10jenkins-bot: Initial configuration for kuswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023511 (https://phabricator.wikimedia.org/T359757) (owner: 10Zabe) [22:14:18] !log create Wikipedia Kusaal # T359757 [22:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:35] T359757: Create Wikipedia Kusaal - https://phabricator.wikimedia.org/T359757 [22:14:47] !log zabe@deploy1002 Started scap: Creating kuswiki (T359757) [22:15:11] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1023156 (owner: 10Herron) [22:17:26] !log zabe@deploy1002 zabe: Creating kuswiki (T359757) synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:18:09] !log zabe@deploy1002 zabe: Continuing with sync [22:21:54] uh, does https://meta.wikimedia.org/wiki/Special:CentralAuth/Taavi give a DBQueryError to anyone else? [22:22:09] ah [22:22:10] Wikimedia\Rdbms\DBQueryError: Error 1049: Unknown database 'kuswiki' [22:22:26] so I created an account via mwdebug, and then turned it off, so now Special:CA fails until the patch is live everywhere [22:23:29] yeah, it should no longer appear as soon as the patch is live [22:27:44] you do really like to have the lowest of user ids [22:28:05] (03PS1) 10Zabe: Initial configuration for mywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023516 (https://phabricator.wikimedia.org/T361085) [22:28:08] are you still mad about mailman? :D [22:28:13] yup [22:28:25] xD [22:28:54] sorry but also not sorry [22:28:56] !log zabe@deploy1002 Finished scap: Creating kuswiki (T359757) (duration: 14m 10s) [22:29:04] we should just move to uuids to fix this I guess [22:29:11] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=kuswiki --cluster=all 2>&1 | tee /tmp/kuswiki.UpdateSearchIndexConfig.log # T359757 [22:29:11] T359757: Create Wikipedia Kusaal - https://phabricator.wikimedia.org/T359757 [22:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:09] (03CR) 10Zabe: [C:03+2] Initial configuration for mywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023516 (https://phabricator.wikimedia.org/T361085) (owner: 10Zabe) [22:32:08] (03Merged) 10jenkins-bot: Initial configuration for mywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023516 (https://phabricator.wikimedia.org/T361085) (owner: 10Zabe) [22:32:59] !log create Wikisource Burmese # T361085 [22:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:17] T361085: Create Wikisource Burmese - https://phabricator.wikimedia.org/T361085 [22:33:27] !log zabe@deploy1002 Started scap: Creating mywikisource (T361085) [22:33:39] (03PS1) 10Dzahn: admin: add stewards_users to adm group [puppet] - 10https://gerrit.wikimedia.org/r/1023517 [22:33:49] (03CR) 10CI reject: [V:04-1] admin: add stewards_users to adm group [puppet] - 10https://gerrit.wikimedia.org/r/1023517 (owner: 10Dzahn) [22:36:07] !log zabe@deploy1002 zabe: Creating mywikisource (T361085) synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:36:15] !log zabe@deploy1002 zabe: Continuing with sync [22:38:02] (03PS1) 10Zabe: Initial configuration for iglwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023518 (https://phabricator.wikimedia.org/T361644) [22:39:34] (03PS2) 10Zabe: Initial configuration for iglwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023518 (https://phabricator.wikimedia.org/T361644) [22:40:14] (03CR) 10Zabe: [C:03+2] Initial configuration for iglwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023518 (https://phabricator.wikimedia.org/T361644) (owner: 10Zabe) [22:40:17] (03Abandoned) 10Dzahn: admin: add stewards_users to adm group [puppet] - 10https://gerrit.wikimedia.org/r/1023517 (owner: 10Dzahn) [22:41:01] (03Merged) 10jenkins-bot: Initial configuration for iglwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023518 (https://phabricator.wikimedia.org/T361644) (owner: 10Zabe) [22:47:13] !log zabe@deploy1002 Finished scap: Creating mywikisource (T361085) (duration: 13m 45s) [22:47:24] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=mywikisource --cluster=all 2>&1 | tee /tmp/mywikisource.UpdateSearchIndexConfig.log # T361085 [22:47:28] T361085: Create Wikisource Burmese - https://phabricator.wikimedia.org/T361085 [22:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:15] !log create Wikipedia Igala # T361644 [22:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:29] T361644: Create Wikipedia Igala - https://phabricator.wikimedia.org/T361644 [22:49:54] !log zabe@deploy1002 Started scap: Creating iglwiki (T361644) [22:51:11] (03PS1) 10Dzahn: stewards: remove "recurse" parameter on private repo dir [puppet] - 10https://gerrit.wikimedia.org/r/1023519 [22:51:31] (03CR) 10CI reject: [V:04-1] stewards: remove "recurse" parameter on private repo dir [puppet] - 10https://gerrit.wikimedia.org/r/1023519 (owner: 10Dzahn) [22:52:34] !log zabe@deploy1002 zabe: Creating iglwiki (T361644) synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:52:35] !log zabe@deploy1002 zabe: Continuing with sync [22:52:51] (03PS2) 10Dzahn: stewards: remove "recurse" parameter on private repo dir [puppet] - 10https://gerrit.wikimedia.org/r/1023519 [22:56:31] (03CR) 10Dzahn: [C:03+2] stewards: remove "recurse" parameter on private repo dir [puppet] - 10https://gerrit.wikimedia.org/r/1023519 (owner: 10Dzahn) [22:56:36] (03PS1) 10Zabe: Initial configuration for kaawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023520 (https://phabricator.wikimedia.org/T362135) [22:58:29] zabe: fwiw, bewwiki and kuswiki got the auto-generated tickets, aglwiki did not it seems [22:59:18] nevermind, just couldn't find them because "subtask of subtask", ok [23:00:02] alright [23:00:11] !log eevans@cumin1002 START - Cookbook sre.hosts.reboot-single for host aqs1013.eqiad.wmnet [23:00:37] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9738633 (10ops-monitoring-bot) Host rebooted by eevans@cumin1002 with reason: None [23:01:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:03:26] !log zabe@deploy1002 Finished scap: Creating iglwiki (T361644) (duration: 13m 32s) [23:03:32] (03CR) 10Zabe: [C:03+2] Initial configuration for kaawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023520 (https://phabricator.wikimedia.org/T362135) (owner: 10Zabe) [23:03:44] T361644: Create Wikipedia Igala - https://phabricator.wikimedia.org/T361644 [23:03:52] (ProbeDown) firing: (4) Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:03:58] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=iglwiki --cluster=all 2>&1 | tee /tmp/iglwiki.UpdateSearchIndexConfig.log # T362135 [23:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:09] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=iglwiki --cluster=all 2>&1 | tee /tmp/iglwiki.UpdateSearchIndexConfig.log # T361644 [23:04:13] T362135: Create Wiktionary Karakalpak - https://phabricator.wikimedia.org/T362135 [23:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:20] (03Merged) 10jenkins-bot: Initial configuration for kaawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023520 (https://phabricator.wikimedia.org/T362135) (owner: 10Zabe) [23:04:40] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9738653 (10Eevans) >>! In T362841#9737870, @Jclark-ctr wrote: > @Eevans this one is out of warranty also let me know if i am able to swap drive i can take care of in morning `lang=sh-session eevans@aqs1014... [23:06:23] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1013.eqiad.wmnet [23:06:26] !log create Wiktionary Karakalpak # T362135 [23:06:30] 10ops-eqiad, 06SRE: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T363280 (10ops-monitoring-bot) 03NEW [23:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:22] !log zabe@deploy1002 Started scap: Creating kaawiktionary (T362135) [23:08:52] (ProbeDown) resolved: (4) Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:09:59] !log zabe@deploy1002 zabe: Creating kaawiktionary (T362135) synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:10:04] !log zabe@deploy1002 zabe: Continuing with sync [23:10:16] T362135: Create Wiktionary Karakalpak - https://phabricator.wikimedia.org/T362135 [23:15:51] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9738688 (10Eevans) Here is a transcript of everything done (for posterity sake): `lang=sh-session eevans@aqs1013:~$ sudo sgdisk -R /dev/sde /dev/sdg Warning: Partition table header claims that the size of p... [23:17:29] (03PS1) 10Zabe: Initial configuration for mswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023523 (https://phabricator.wikimedia.org/T363039) [23:19:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [23:19:16] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [23:19:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T352010)', diff saved to https://phabricator.wikimedia.org/P61128 and previous config saved to /var/cache/conftool/dbconfig/20240423-231923-ladsgroup.json [23:19:45] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:20:56] !log zabe@deploy1002 Finished scap: Creating kaawiktionary (T362135) (duration: 13m 34s) [23:21:09] T362135: Create Wiktionary Karakalpak - https://phabricator.wikimedia.org/T362135 [23:21:21] (03CR) 10Zabe: [C:03+2] Initial configuration for mswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023523 (https://phabricator.wikimedia.org/T363039) (owner: 10Zabe) [23:22:03] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=kaawiktionary --cluster=all 2>&1 | tee /tmp/kaawiktionary.UpdateSearchIndexConfig.log # T362135 [23:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:11] (03Merged) 10jenkins-bot: Initial configuration for mswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023523 (https://phabricator.wikimedia.org/T363039) (owner: 10Zabe) [23:23:39] !log create Wikisource Malay # T363039 [23:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:49] T363039: Create Wikisource Malay - https://phabricator.wikimedia.org/T363039 [23:23:59] !log zabe@deploy1002 Started scap: Creating mswikisource (T363039) [23:26:38] !log zabe@deploy1002 zabe: Creating mswikisource (T363039) synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:27:38] !log zabe@deploy1002 zabe: Continuing with sync [23:28:16] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 873ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:30:32] (03PS1) 10Zabe: Initial configuration for kawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023546 (https://phabricator.wikimedia.org/T363085) [23:33:16] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 825.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:36:36] (03CR) 10Zabe: [C:03+2] Initial configuration for kawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023546 (https://phabricator.wikimedia.org/T363085) (owner: 10Zabe) [23:37:21] (03Merged) 10jenkins-bot: Initial configuration for kawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023546 (https://phabricator.wikimedia.org/T363085) (owner: 10Zabe) [23:38:16] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 821.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1023527 [23:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1023527 (owner: 10TrainBranchBot) [23:38:59] !log zabe@deploy1002 Finished scap: Creating mswikisource (T363039) (duration: 15m 00s) [23:39:07] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=mswikisource --cluster=all 2>&1 | tee /tmp/mswikisource.UpdateSearchIndexConfig.log # T363039 [23:39:14] T363039: Create Wikisource Malay - https://phabricator.wikimedia.org/T363039 [23:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:47] !log create Wikisource Georgian # T363085 [23:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:04] T363085: Create Wikisource Georgian - https://phabricator.wikimedia.org/T363085 [23:42:05] !log zabe@deploy1002 Started scap: Creating kawikisource (T363085) [23:44:55] !log zabe@deploy1002 zabe: Creating kawikisource (T363085) synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:45:54] !log zabe@deploy1002 zabe: Continuing with sync [23:48:16] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 800.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:50:16] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 834.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:55:16] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 833.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:56:03] (03PS1) 10Zabe: Set timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023550 (https://phabricator.wikimedia.org/T360310) [23:56:45] !log zabe@deploy1002 Finished scap: Creating kawikisource (T363085) (duration: 14m 40s) [23:56:52] T363085: Create Wikisource Georgian - https://phabricator.wikimedia.org/T363085 [23:57:06] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=kawikisource --cluster=all 2>&1 | tee /tmp/kawikisource.UpdateSearchIndexConfig.log # T363085 [23:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1023527 (owner: 10TrainBranchBot) [23:58:17] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023528 [23:58:17] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023528 (owner: 10Zabe) [23:58:58] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023528 (owner: 10Zabe) [23:59:13] (03CR) 10Zabe: [C:03+2] Set timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023550 (https://phabricator.wikimedia.org/T360310) (owner: 10Zabe) [23:59:59] (03Merged) 10jenkins-bot: Set timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023550 (https://phabricator.wikimedia.org/T360310) (owner: 10Zabe)