[01:16:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:28] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:23:28] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:30:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T356166)', diff saved to https://phabricator.wikimedia.org/P60169 and previous config saved to /var/cache/conftool/dbconfig/20240410-033019-marostegui.json [03:30:23] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [03:45:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P60170 and previous config saved to /var/cache/conftool/dbconfig/20240410-034526-marostegui.json [04:00:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P60171 and previous config saved to /var/cache/conftool/dbconfig/20240410-040033-marostegui.json [04:15:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T356166)', diff saved to https://phabricator.wikimedia.org/P60172 and previous config saved to /var/cache/conftool/dbconfig/20240410-041541-marostegui.json [04:15:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance [04:15:45] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [04:15:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance [04:16:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T356166)', diff saved to https://phabricator.wikimedia.org/P60173 and previous config saved to /var/cache/conftool/dbconfig/20240410-041604-marostegui.json [04:46:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Kernel reboot [04:46:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Kernel reboot [04:49:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60174 and previous config saved to /var/cache/conftool/dbconfig/20240410-044928-root.json [04:52:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 11.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:55:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad api_appserver POST/200: ... [04:55:15] 0.4246627549440708s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:55:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1223 T362134', diff saved to https://phabricator.wikimedia.org/P60175 and previous config saved to /var/cache/conftool/dbconfig/20240410-045534-marostegui.json [04:55:40] T362134: Upgrade s3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362134 [04:56:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool db1223', diff saved to https://phabricator.wikimedia.org/P60176 and previous config saved to /var/cache/conftool/dbconfig/20240410-045632-marostegui.json [04:57:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1166 T362134', diff saved to https://phabricator.wikimedia.org/P60177 and previous config saved to /var/cache/conftool/dbconfig/20240410-045710-marostegui.json [04:57:15] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.01% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:58:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1166.eqiad.wmnet with OS bookworm [05:00:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad api_appserver POST/200: ... [05:00:15] 0.4246627549440708s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:04:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60178 and previous config saved to /var/cache/conftool/dbconfig/20240410-050434-root.json [05:10:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1166.eqiad.wmnet with reason: host reimage [05:12:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1166.eqiad.wmnet with reason: host reimage [05:16:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60179 and previous config saved to /var/cache/conftool/dbconfig/20240410-051939-root.json [05:28:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60180 and previous config saved to /var/cache/conftool/dbconfig/20240410-052854-root.json [05:33:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1166.eqiad.wmnet with OS bookworm [05:34:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60181 and previous config saved to /var/cache/conftool/dbconfig/20240410-053445-root.json [05:44:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60182 and previous config saved to /var/cache/conftool/dbconfig/20240410-054400-root.json [05:47:54] fixed [05:49:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60183 and previous config saved to /var/cache/conftool/dbconfig/20240410-054952-root.json [05:55:00] marostegui: what have you fixed? :) [05:55:38] asking cause we had an unbreak now about VisualEditor not being to save draft parsoid html ( https://phabricator.wikimedia.org/T362210 ) [05:56:07] and that magically resolved ( https://grafana.wikimedia.org/d/t_x3DEu4k/parsoid-health?forceLogin=&from=1712706703415&orgId=1&to=1712728303415&refresh=15m&viewPanel=6 ) [05:56:07] :) [05:57:05] We had a p4ge [05:59:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60184 and previous config saved to /var/cache/conftool/dbconfig/20240410-055906-root.json [05:59:49] marostegui: what page was it? Cause non sre don't get them so I can't know what has happened [06:00:04] looks like some DB went wild maybe? [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T0600) [06:00:30] It was a db from x2 yeah [06:01:12] ah I see you commented on the task :) thanks! [06:01:39] I didn't close it yet but I'm sure it was the same thing [06:02:26] subbu: so my guess is we can remove the train blocker [06:03:04] yes .. it also impacted dewiki which didn't have the train roll out to yet. [06:03:05] and maybe want to investigate why `HtmlOutputRendererHelper` errors are not logged anywhere (or at least I haven't found them) [06:04:07] orI misunderstood the MediaWiki code [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:26] anyway that seems solved, and I am going to have breakfast with kids [06:04:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60185 and previous config saved to /var/cache/conftool/dbconfig/20240410-060457-root.json [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:14:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60186 and previous config saved to /var/cache/conftool/dbconfig/20240410-061411-root.json [06:20:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60187 and previous config saved to /var/cache/conftool/dbconfig/20240410-062003-root.json [06:21:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 5%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60188 and previous config saved to /var/cache/conftool/dbconfig/20240410-062114-arnaudb.json [06:29:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60189 and previous config saved to /var/cache/conftool/dbconfig/20240410-062917-root.json [06:36:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 10%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60190 and previous config saved to /var/cache/conftool/dbconfig/20240410-063620-arnaudb.json [06:37:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 1%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60191 and previous config saved to /var/cache/conftool/dbconfig/20240410-063734-arnaudb.json [06:44:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60192 and previous config saved to /var/cache/conftool/dbconfig/20240410-064423-root.json [06:51:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 20%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60193 and previous config saved to /var/cache/conftool/dbconfig/20240410-065125-arnaudb.json [06:52:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 2%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60194 and previous config saved to /var/cache/conftool/dbconfig/20240410-065239-arnaudb.json [06:59:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60195 and previous config saved to /var/cache/conftool/dbconfig/20240410-065929-root.json [07:00:05] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T0700). nyaa~ [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Steph Toyofuku - https://phabricator.wikimedia.org/T362113#9703016 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [07:06:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 25%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60196 and previous config saved to /var/cache/conftool/dbconfig/20240410-070631-arnaudb.json [07:07:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 4%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60197 and previous config saved to /var/cache/conftool/dbconfig/20240410-070745-arnaudb.json [07:21:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 50%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60198 and previous config saved to /var/cache/conftool/dbconfig/20240410-072137-arnaudb.json [07:22:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 8%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60199 and previous config saved to /var/cache/conftool/dbconfig/20240410-072253-arnaudb.json [07:25:58] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dumps::generation::server::spare [07:29:42] !log akosiaris@deploy1002 Synchronized wmf-config/mc.php: Dummy sync for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018332 (duration: 14m 03s) [07:33:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dumps::generation::server::spare [07:36:36] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9703049 (10MoritzMuehlenhoff) [07:36:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 75%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60200 and previous config saved to /var/cache/conftool/dbconfig/20240410-073644-arnaudb.json [07:37:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 16%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60201 and previous config saved to /var/cache/conftool/dbconfig/20240410-073759-arnaudb.json [07:50:13] 06SRE, 10Phabricator, 13Patch-For-Review: 14have any task put into ops-access-requests automatically generate an ops-access-review task - 14https://phabricator.wikimedia.org/T87467#9703058 (10Aklapper) 14For archaeology researchers: This functionality got broken/removed in February 2016 by https://gerri... [07:50:23] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3070.esams.wmnet [07:51:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 100%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60202 and previous config saved to /var/cache/conftool/dbconfig/20240410-075150-arnaudb.json [07:52:01] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp3070.esams.wmnet with OS bullseye [07:52:11] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9703059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3070.esams.wmnet with OS bullseye [07:53:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 25%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60203 and previous config saved to /var/cache/conftool/dbconfig/20240410-075304-arnaudb.json [07:54:44] (03CR) 10Majavah: [C:03+2] Remove names for old cloudmetrics redirects [dns] - 10https://gerrit.wikimedia.org/r/1018312 (owner: 10Majavah) [07:55:50] (03PS2) 10Muehlenhoff: Add stoyofuku to analytics-privatedata-access [puppet] - 10https://gerrit.wikimedia.org/r/1018634 (https://phabricator.wikimedia.org/T362113) [07:56:30] !log installing glibc security updates on bullseye [07:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:22] (03PS1) 10Slyngshede: Change ssh key validator from class to function. [software/bitu] - 10https://gerrit.wikimedia.org/r/1018635 [08:00:05] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T0800) [08:00:13] (03CR) 10Slyngshede: [C:03+2] API: Username validation API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1017244 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [08:01:17] (03Merged) 10jenkins-bot: API: Username validation API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1017244 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [08:04:16] (03PS1) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [08:06:37] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018638 (https://phabricator.wikimedia.org/T360158) [08:06:40] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018638 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot) [08:07:14] (03CR) 10Hashar: "I have no idea of how many logs that would generate and what kind of pressure that can adds to the logging stack." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [08:07:22] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018638 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot) [08:08:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 50%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60204 and previous config saved to /var/cache/conftool/dbconfig/20240410-080810-arnaudb.json [08:11:05] (03PS1) 10Slyngshede: C:idm::deployment Add Django REST Framework. [puppet] - 10https://gerrit.wikimedia.org/r/1018640 [08:11:07] (03CR) 10Muehlenhoff: [C:03+2] puppetdb::microservice: Use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1017769 (owner: 10Muehlenhoff) [08:11:37] (03CR) 10Kevin Bazira: [C:03+1] ml-services: deploy mistral-7b-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018633 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [08:15:03] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3070.esams.wmnet with reason: host reimage [08:15:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1018640 (owner: 10Slyngshede) [08:15:42] (03CR) 10Slyngshede: [C:03+2] C:idm::deployment Add Django REST Framework. [puppet] - 10https://gerrit.wikimedia.org/r/1018640 (owner: 10Slyngshede) [08:18:38] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3070.esams.wmnet with reason: host reimage [08:21:13] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: deploy mistral-7b-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018633 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [08:21:27] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.26 refs T360158 [08:21:33] T360158: 1.42.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T360158 [08:21:47] (03PS1) 10Muehlenhoff: Add andyrussg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1018641 (https://phabricator.wikimedia.org/T361742) [08:22:07] (03Merged) 10jenkins-bot: ml-services: deploy mistral-7b-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018633 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [08:22:52] (03CR) 10CI reject: [V:04-1] Add andyrussg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1018641 (https://phabricator.wikimedia.org/T361742) (owner: 10Muehlenhoff) [08:23:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 75%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60205 and previous config saved to /var/cache/conftool/dbconfig/20240410-082316-arnaudb.json [08:24:47] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:24:59] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:25:13] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:25:15] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:25:43] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:25:53] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:25:57] (03PS2) 10Muehlenhoff: Add andyrussg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1018641 (https://phabricator.wikimedia.org/T361742) [08:34:20] !log gmodena@deploy1002 Started deploy [airflow-dags/analytics@46818a3]: Deploying cassandra_load_pageview_top_articles changes MR#648 [08:34:32] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.42.0-wmf.26 refs T360158 (duration: 13m 05s) [08:34:38] T360158: 1.42.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T360158 [08:34:54] !log gmodena@deploy1002 Finished deploy [airflow-dags/analytics@46818a3]: Deploying cassandra_load_pageview_top_articles changes MR#648 (duration: 00m 33s) [08:35:49] looks like it is working [08:35:53] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:36:21] (03PS1) 10EoghanGaffney: gitlab: Fix typo in systemctl timer command [puppet] - 10https://gerrit.wikimedia.org/r/1018642 [08:36:55] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9703148 (10BTullis) [08:38:00] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1018642 (owner: 10EoghanGaffney) [08:38:11] (03CR) 10EoghanGaffney: [C:03+2] gitlab: Fix typo in systemctl timer command [puppet] - 10https://gerrit.wikimedia.org/r/1018642 (owner: 10EoghanGaffney) [08:38:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 100%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60206 and previous config saved to /var/cache/conftool/dbconfig/20240410-083822-arnaudb.json [08:39:13] 06SRE, 06cloud-services-team, 10Data-Services, 06Infrastructure-Foundations: 14Switch labstore servers to default SSH configuration - 14https://phabricator.wikimedia.org/T177914#9703154 (10taavi) 05Open→03Invalid 14Closing as we've moved the NFS servers to Cloud VPS VMs and I'm pretty sure we did... [08:39:17] (03PS1) 10Filippo Giunchedi: titan: trim 5m retention to 3y + 2w [puppet] - 10https://gerrit.wikimedia.org/r/1018644 (https://phabricator.wikimedia.org/T351927) [08:41:00] (03CR) 10Btullis: [C:03+2] Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [08:41:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 27.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:42:03] (03Merged) 10jenkins-bot: Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [08:42:28] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3070.esams.wmnet with OS bullseye [08:42:43] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9703163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3070.esams.wmnet with OS bullseye completed: - cp3070 (**PASS**)... [08:44:03] (03PS2) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [08:46:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 27.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:49:01] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3070.esams.wmnet [08:50:30] jouncebot: nowandnext [08:50:31] For the next 1 hour(s) and 9 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T0800) [08:50:31] In 1 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1000) [08:53:06] (03CR) 10Filippo Giunchedi: [C:03+2] Use oauth2-proxy for opensearch dashboards [puppet] - 10https://gerrit.wikimedia.org/r/1015045 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [08:56:20] (03PS1) 10Ilias Sarantopoulos: ml-services: fix indentation in mistral model resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018646 (https://phabricator.wikimedia.org/T357986) [08:58:18] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9703181 (10Fabfur) [09:07:31] (03PS1) 10Filippo Giunchedi: hieradata: test sso for opensearch-dashboards in cloud vps [puppet] - 10https://gerrit.wikimedia.org/r/1018647 (https://phabricator.wikimedia.org/T337818) [09:07:49] (03CR) 10CI reject: [V:04-1] hieradata: test sso for opensearch-dashboards in cloud vps [puppet] - 10https://gerrit.wikimedia.org/r/1018647 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [09:13:28] (JobUnavailable) firing: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:16:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:33] (03CR) 10Filippo Giunchedi: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1015045 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [09:21:45] !jouncebot now [09:21:45] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [09:21:54] jouncebot: now [09:21:54] For the next 0 hour(s) and 38 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T0800) [09:23:05] (03PS1) 10Arnaudb: mariadb: create new account and database on m5 for striker_toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149) [09:25:40] (03PS2) 10Filippo Giunchedi: hieradata: test sso for opensearch-dashboards in cloud vps [puppet] - 10https://gerrit.wikimedia.org/r/1018647 (https://phabricator.wikimedia.org/T337818) [09:26:25] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9703207 (10fgiunchedi) [09:28:13] (03CR) 10Marostegui: mariadb: create new account and database on m5 for striker_toolsbeta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb) [09:28:39] (03PS2) 10Fabfur: prometheus: add aggregate metrics for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) [09:29:17] (03PS2) 10Arnaudb: mariadb: create new account and database on m5 for striker_toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149) [09:29:30] (03CR) 10Arnaudb: mariadb: create new account and database on m5 for striker_toolsbeta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb) [09:32:46] (03CR) 10Vgutierrez: [C:03+1] "thanks for submitting this" [puppet] - 10https://gerrit.wikimedia.org/r/1018355 (https://phabricator.wikimedia.org/T362197) (owner: 10BCornwall) [09:40:28] !log jiji@deploy1002 Started scap: (no justification provided) [09:41:52] (03CR) 10Marostegui: "Remember to drop those users with: drop user if exists 'USERNAME'@'IPS_REMOVED';" [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb) [09:42:45] (03CR) 10Marostegui: [C:03+1] mariadb: create new account and database on m5 for striker_toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb) [09:42:53] !log running scap sync-world to rebuild mw image and pick up gerrit:1015338 [09:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:41] (03CR) 10Arnaudb: [C:03+2] mariadb: create new account and database on m5 for striker_toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb) [09:45:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 966.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:50:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 961.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:51:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [09:52:02] (03PS1) 10Majavah: Update example Striker hiera [labs/private] - 10https://gerrit.wikimedia.org/r/1018652 [09:52:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [09:52:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T360332)', diff saved to https://phabricator.wikimedia.org/P60207 and previous config saved to /var/cache/conftool/dbconfig/20240410-095214-arnaudb.json [09:52:23] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:54:33] (03CR) 10Majavah: [V:03+2 C:03+2] Update example Striker hiera [labs/private] - 10https://gerrit.wikimedia.org/r/1018652 (owner: 10Majavah) [09:55:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T360332)', diff saved to https://phabricator.wikimedia.org/P60208 and previous config saved to /var/cache/conftool/dbconfig/20240410-095508-arnaudb.json [09:55:50] (03PS4) 10Majavah: P:wmcs::striker: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/1011174 (https://phabricator.wikimedia.org/T360025) [09:57:30] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: test sso for opensearch-dashboards in cloud vps [puppet] - 10https://gerrit.wikimedia.org/r/1018647 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [09:57:34] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1839/co" [puppet] - 10https://gerrit.wikimedia.org/r/1011174 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah) [09:58:04] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::sanitarium_master [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1000) [10:01:48] (03PS1) 10Muehlenhoff: Switch mariadb::sanitarium_master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018653 (https://phabricator.wikimedia.org/T349619) [10:02:02] (03PS2) 10Muehlenhoff: Switch mariadb::sanitarium_master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018653 (https://phabricator.wikimedia.org/T349619) [10:02:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 26.26% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:02:38] (03PS1) 10Filippo Giunchedi: opensearch: fix sso support [puppet] - 10https://gerrit.wikimedia.org/r/1018654 (https://phabricator.wikimedia.org/T337818) [10:03:23] (03PS1) 10Clément Goubert: kubernetes: move 6 eqiad api_appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1018655 (https://phabricator.wikimedia.org/T351074) [10:03:47] (03CR) 10Muehlenhoff: [C:03+2] Switch mariadb::sanitarium_master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018653 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:04:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.11s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:06:03] (03CR) 10Filippo Giunchedi: [C:03+2] opensearch: fix sso support [puppet] - 10https://gerrit.wikimedia.org/r/1018654 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [10:07:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 26.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:08:28] !log jiji@deploy1002 Finished scap: (no justification provided) (duration: 27m 59s) [10:08:49] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1011174 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah) [10:09:25] (03PS1) 10Majavah: Add toolsadmin-toolsbeta [dns] - 10https://gerrit.wikimedia.org/r/1018656 (https://phabricator.wikimedia.org/T360025) [10:09:31] (03CR) 10Alexandros Kosiaris: "It doesn't impact scandium at all. The only user of this destination was RESTBase and now it uses the mw-parsoid destination." [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [10:10:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P60209 and previous config saved to /var/cache/conftool/dbconfig/20240410-101015-arnaudb.json [10:11:31] (03CR) 10Clément Goubert: [C:03+1] Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [10:11:51] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::striker: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/1011174 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah) [10:12:02] (03CR) 10Clément Goubert: [C:03+1] services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [10:12:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::sanitarium_master [10:13:28] (JobUnavailable) resolved: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:14:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 872.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:14:22] (03PS1) 10Filippo Giunchedi: opensearch: use Sensitive[String] for sso secrets [puppet] - 10https://gerrit.wikimedia.org/r/1018657 (https://phabricator.wikimedia.org/T337818) [10:14:45] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] opensearch: use Sensitive[String] for sso secrets [puppet] - 10https://gerrit.wikimedia.org/r/1018657 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [10:16:11] !log Disabling puppet on O:docker_registry_ha::registry - T360636 [10:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:22] T360636: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636 [10:16:54] (03CR) 10Clément Goubert: [V:03+1 C:03+2] docker_registry_ha: Migrate to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1018251 (https://phabricator.wikimedia.org/T360636) (owner: 10Clément Goubert) [10:17:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1175 T362036', diff saved to https://phabricator.wikimedia.org/P60210 and previous config saved to /var/cache/conftool/dbconfig/20240410-101746-root.json [10:17:50] T362036: Switchover s2 master (db1162 -> db1222) - https://phabricator.wikimedia.org/T362036 [10:18:31] (03PS1) 10Marostegui: db1175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018658 [10:18:40] !log Enabling and running puppet on registry1003.eqiad.wmnet - T360636 [10:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:45] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9703294 (10MoritzMuehlenhoff) [10:19:15] (03CR) 10Marostegui: [C:03+2] db1175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018658 (owner: 10Marostegui) [10:19:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1175.eqiad.wmnet with OS bookworm [10:21:12] !log Enabling and running puppet on O:docker_registry_ha::registry - T360636 [10:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:05] (03PS1) 10Filippo Giunchedi: opensearch: move apache-auth-sso.erb to the right location [puppet] - 10https://gerrit.wikimedia.org/r/1018659 (https://phabricator.wikimedia.org/T337818) [10:22:26] (03PS1) 10Alexandros Kosiaris: Remove parsoid-php certificates from mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018660 (https://phabricator.wikimedia.org/T359387) [10:22:27] (03PS1) 10Alexandros Kosiaris: fixtures: Rename all parsoid-php references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018661 (https://phabricator.wikimedia.org/T359387) [10:22:28] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] opensearch: move apache-auth-sso.erb to the right location [puppet] - 10https://gerrit.wikimedia.org/r/1018659 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [10:25:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P60211 and previous config saved to /var/cache/conftool/dbconfig/20240410-102523-arnaudb.json [10:26:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.052s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:26:19] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9703307 (10Clement_Goubert) [10:27:31] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9703310 (10Clement_Goubert) chartmuseum and docker-registry done [10:27:55] (03PS1) 10Muehlenhoff: puppetboard: Remove obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1018662 [10:28:57] (03PS1) 10Muehlenhoff: puppetboard: Remove obsolete cert [labs/private] - 10https://gerrit.wikimedia.org/r/1018663 [10:31:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 994.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:32:24] (03PS1) 10Majavah: hieradata: Add Striker toolsbeta instance [puppet] - 10https://gerrit.wikimedia.org/r/1018664 (https://phabricator.wikimedia.org/T360025) [10:32:27] (03PS1) 10Majavah: hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025) [10:32:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1175.eqiad.wmnet with reason: host reimage [10:33:28] (JobUnavailable) firing: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:33:39] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] puppetboard: Remove obsolete cert [labs/private] - 10https://gerrit.wikimedia.org/r/1018663 (owner: 10Muehlenhoff) [10:34:33] (03PS2) 10Majavah: hieradata: Add Striker toolsbeta instance [puppet] - 10https://gerrit.wikimedia.org/r/1018664 (https://phabricator.wikimedia.org/T360025) [10:34:33] (03PS2) 10Majavah: hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025) [10:34:39] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] puppetboard: Remove obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1018662 (owner: 10Muehlenhoff) [10:35:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1175.eqiad.wmnet with reason: host reimage [10:36:05] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1841/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018664 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah) [10:38:21] (03PS1) 10Filippo Giunchedi: opensearch: set vhost and issuer url for dashboards sso test [puppet] - 10https://gerrit.wikimedia.org/r/1018667 (https://phabricator.wikimedia.org/T337818) [10:40:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T360332)', diff saved to https://phabricator.wikimedia.org/P60212 and previous config saved to /var/cache/conftool/dbconfig/20240410-104030-arnaudb.json [10:40:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [10:40:40] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [10:40:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [10:40:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T360332)', diff saved to https://phabricator.wikimedia.org/P60213 and previous config saved to /var/cache/conftool/dbconfig/20240410-104053-arnaudb.json [10:43:38] (03CR) 10Clément Goubert: "Bunch of nitpicking to make ports match up with the actual services_proxy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018661 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [10:43:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T360332)', diff saved to https://phabricator.wikimedia.org/P60214 and previous config saved to /var/cache/conftool/dbconfig/20240410-104345-arnaudb.json [10:45:20] (03CR) 10Effie Mouzeli: mediawiki: add MW__MCROUTER_SERVER variable in chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:45:28] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki: add MW__MCROUTER_SERVER variable in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:46:18] (03PS5) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) [10:46:57] (03Merged) 10jenkins-bot: mediawiki: add MW__MCROUTER_SERVER variable in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:46:58] (03PS6) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) [10:47:11] (03PS1) 10Marostegui: Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018382 [10:47:23] (03PS2) 10Marostegui: Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018382 [10:48:28] (JobUnavailable) resolved: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:48:56] (03CR) 10Clément Goubert: Remove parsoid-php certificates from mw deployments (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018660 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [10:49:20] (03PS1) 10Mvolz: Revert "Revert "citoid: pipeline bot promote"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018383 [10:50:54] (03CR) 10Muehlenhoff: [C:03+2] Switch testreduce to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1018199 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [10:52:51] (03CR) 10David Caro: [C:03+1] "🎉 yay" [puppet] - 10https://gerrit.wikimedia.org/r/1018664 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah) [10:53:08] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:53:12] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:53:19] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Add Striker toolsbeta instance [puppet] - 10https://gerrit.wikimedia.org/r/1018664 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah) [10:53:50] (03CR) 10Marostegui: [C:03+2] Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018382 (owner: 10Marostegui) [10:54:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60215 and previous config saved to /var/cache/conftool/dbconfig/20240410-105444-root.json [10:55:03] (03PS1) 10Effie Mouzeli: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018670 [10:55:54] (03CR) 10Clément Goubert: [C:03+1] "A nit on the commit message so we don't confuse ourselves, otherwise LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:56:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 850.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:56:19] (03CR) 10Clément Goubert: [C:03+1] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018670 (owner: 10Effie Mouzeli) [10:56:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1175.eqiad.wmnet with OS bookworm [10:57:38] 06SRE, 10Citoid, 06serviceops, 13Patch-For-Review: 14Create a readiness probe for zotero - 14https://phabricator.wikimedia.org/T213689#9703351 (10Mvolz) 14I notice that Zotero is not part of this dashboard: https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?orgId=1 Is there a re... [10:58:39] (03CR) 10Effie Mouzeli: [C:03+1] kubernetes: move 6 eqiad api_appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1018655 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [10:58:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P60216 and previous config saved to /var/cache/conftool/dbconfig/20240410-105852-arnaudb.json [10:58:53] (03CR) 10Effie Mouzeli: [C:03+1] DNS-related cookbooks: adapt for conftool state [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans) [10:59:22] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018670 (owner: 10Effie Mouzeli) [10:59:42] !log Depooling mw1421.eqiad.wmnet,mw1422.eqiad.wmnet,mw1491.eqiad.wmnet,mw1492.eqiad.wmnet,mw1493.eqiad.wmnet - T351074 [10:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:46] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [11:00:05] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1100). [11:00:56] (03Merged) 10jenkins-bot: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018670 (owner: 10Effie Mouzeli) [11:01:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 850.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:01:26] (03CR) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:01:28] (03PS1) 10Muehlenhoff: Remove certs for docker-registry and testreduce [puppet] - 10https://gerrit.wikimedia.org/r/1018671 (https://phabricator.wikimedia.org/T360636) [11:01:29] (03PS7) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) [11:01:52] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9703359 (10MoritzMuehlenhoff) [11:02:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1018671 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [11:02:19] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:02:19] (03CR) 10Clément Goubert: [C:03+2] kubernetes: move 6 eqiad api_appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1018655 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [11:02:23] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:02:49] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:03:03] (03CR) 10Mvolz: [C:03+2] Revert "Revert "citoid: pipeline bot promote"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018383 (owner: 10Mvolz) [11:03:26] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:03:58] (03Merged) 10jenkins-bot: Revert "Revert "citoid: pipeline bot promote"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018383 (owner: 10Mvolz) [11:05:09] 06SRE, 10Citoid, 06serviceops, 13Patch-For-Review: 14Create a readiness probe for zotero - 14https://phabricator.wikimedia.org/T213689#9703363 (10Clement_Goubert) 14I think it's because monitoring is disabled in the service's `values.yaml` [11:07:05] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:07:07] (03PS1) 10Clément Goubert: zotero: Turn on monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) [11:07:36] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:07:59] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:08:10] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1421.eqiad.wmnet with OS bullseye [11:08:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1422.eqiad.wmnet with OS bullseye [11:08:37] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:09:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1491.eqiad.wmnet with OS bullseye [11:09:37] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1492.eqiad.wmnet with OS bullseye [11:09:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60217 and previous config saved to /var/cache/conftool/dbconfig/20240410-110949-root.json [11:10:02] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1493.eqiad.wmnet with OS bullseye [11:12:31] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [11:12:48] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:13:29] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:14:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P60218 and previous config saved to /var/cache/conftool/dbconfig/20240410-111400-arnaudb.json [11:14:02] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:15:11] (03PS3) 10Majavah: hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025) [11:15:11] (03PS1) 10Majavah: P:wmcs::striker: Set the port to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1018675 (https://phabricator.wikimedia.org/T360025) [11:15:15] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:15:21] (03PS1) 10Slyngshede: SSH Keymanagement: Fix label on SSH public key field. [software/bitu] - 10https://gerrit.wikimedia.org/r/1018676 (https://phabricator.wikimedia.org/T362049) [11:15:52] (03CR) 10CI reject: [V:04-1] P:wmcs::striker: Set the port to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1018675 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah) [11:15:55] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:15:55] (03CR) 10Clément Goubert: [C:03+1] Remove certs for docker-registry and testreduce [puppet] - 10https://gerrit.wikimedia.org/r/1018671 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [11:16:24] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:16:40] (03PS2) 10Majavah: P:wmcs::striker: Set the port to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1018675 (https://phabricator.wikimedia.org/T360025) [11:16:40] (03PS4) 10Majavah: hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025) [11:17:08] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:17:56] (03CR) 10Muehlenhoff: [C:03+2] Remove certs for docker-registry and testreduce [puppet] - 10https://gerrit.wikimedia.org/r/1018671 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [11:18:36] (03CR) 10Kevin Bazira: [C:03+1] ml-services: fix indentation in mistral model resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018646 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [11:18:47] The appservers unreachable alert is a false positive due to reimaging [11:18:55] looking at the httpbb issue [11:19:24] jouncebot: now [11:19:30] For the next 0 hour(s) and 40 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1100) [11:19:37] (03PS1) 10Muehlenhoff: Remove obsolete dummy certs for docker-registry and testreduce [labs/private] - 10https://gerrit.wikimedia.org/r/1018678 (https://phabricator.wikimedia.org/T360636) [11:19:49] !log jiji@deploy1002 Started scap: Deploy chart changes in gerrit:1015342 [11:19:53] (03CR) 10Majavah: [C:03+2] P:wmcs::striker: Set the port to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1018675 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah) [11:20:09] httpbb issue was transient [11:20:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:20:41] (03CR) 10Clément Goubert: [C:03+1] Remove obsolete dummy certs for docker-registry and testreduce [labs/private] - 10https://gerrit.wikimedia.org/r/1018678 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [11:21:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1421.eqiad.wmnet with reason: host reimage [11:21:25] (SystemdUnitFailed) firing: (5) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:27] (03CR) 10Slyngshede: [C:03+2] SSH Keymanagement: Fix label on SSH public key field. [software/bitu] - 10https://gerrit.wikimedia.org/r/1018676 (https://phabricator.wikimedia.org/T362049) (owner: 10Slyngshede) [11:21:36] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1422.eqiad.wmnet with reason: host reimage [11:21:39] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete dummy certs for docker-registry and testreduce [labs/private] - 10https://gerrit.wikimedia.org/r/1018678 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [11:22:13] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1491.eqiad.wmnet with reason: host reimage [11:22:31] (03Merged) 10jenkins-bot: SSH Keymanagement: Fix label on SSH public key field. [software/bitu] - 10https://gerrit.wikimedia.org/r/1018676 (https://phabricator.wikimedia.org/T362049) (owner: 10Slyngshede) [11:22:43] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1492.eqiad.wmnet with reason: host reimage [11:23:01] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1493.eqiad.wmnet with reason: host reimage [11:24:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1421.eqiad.wmnet with reason: host reimage [11:24:43] (03CR) 10Mvolz: [C:03+1] "LGTM but I'm not sure in retrospect the spec.yaml for Zotero will work-" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) (owner: 10Clément Goubert) [11:24:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60219 and previous config saved to /var/cache/conftool/dbconfig/20240410-112455-root.json [11:26:25] (SystemdUnitFailed) resolved: (5) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1493.eqiad.wmnet with reason: host reimage [11:28:07] !log jiji@deploy1002 Finished scap: Deploy chart changes in gerrit:1015342 (duration: 08m 18s) [11:28:54] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:29:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T360332)', diff saved to https://phabricator.wikimedia.org/P60220 and previous config saved to /var/cache/conftool/dbconfig/20240410-112907-arnaudb.json [11:29:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance [11:29:12] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [11:29:15] (03PS5) 10Majavah: hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025) [11:29:15] (03PS1) 10Majavah: P:wmcs::striker::docker: bind on 0.0.0.0 instead [puppet] - 10https://gerrit.wikimedia.org/r/1018679 [11:29:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance [11:29:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T360332)', diff saved to https://phabricator.wikimedia.org/P60221 and previous config saved to /var/cache/conftool/dbconfig/20240410-112929-arnaudb.json [11:29:56] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9703432 (10MoritzMuehlenhoff) [11:30:54] (03CR) 10Clément Goubert: "Hey Filippo, can you weigh in on swagger monitoring for zotero please?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) (owner: 10Clément Goubert) [11:31:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1422.eqiad.wmnet with reason: host reimage [11:31:40] (SystemdUnitFailed) firing: (8) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:31:55] (SystemdUnitFailed) firing: (10) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:32:00] (03CR) 10Alexandros Kosiaris: [C:04-1] "This would instruct prometheus to scrape zotero (or at least the sidecar statsd-exporter living next to zotero that exposes metrics from z" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) (owner: 10Clément Goubert) [11:32:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T360332)', diff saved to https://phabricator.wikimedia.org/P60222 and previous config saved to /var/cache/conftool/dbconfig/20240410-113220-arnaudb.json [11:33:20] (03CR) 10Majavah: [C:03+2] P:wmcs::striker::docker: bind on 0.0.0.0 instead [puppet] - 10https://gerrit.wikimedia.org/r/1018679 (owner: 10Majavah) [11:34:12] (03CR) 10Clément Goubert: "Yeah, that's what I was starting to piece together. I think we need to add the swagger probe type to the service definition, but I am unsu" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) (owner: 10Clément Goubert) [11:34:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1491.eqiad.wmnet with reason: host reimage [11:36:40] (SystemdUnitFailed) firing: (10) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:36:55] (SystemdUnitFailed) resolved: (10) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:47] 06SRE, 10Citoid, 06serviceops, 13Patch-For-Review: 14Create a readiness probe for zotero - 14https://phabricator.wikimedia.org/T213689#9703471 (10Clement_Goubert) 14Summing up the discussion on the patch set, this is not what is wanted, turning monitoring on in the service would turn on prometheus met... [11:38:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1492.eqiad.wmnet with reason: host reimage [11:38:52] (03PS1) 10Btullis: Update third-party/matomo repository definition [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) [11:38:54] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:39:18] (03PS1) 10Mvolz: Revert "Update zotero to node18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018686 (https://phabricator.wikimedia.org/T361728) [11:39:44] (03Abandoned) 10Mvolz: Revert "Update zotero to node18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018686 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz) [11:40:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60223 and previous config saved to /var/cache/conftool/dbconfig/20240410-114001-root.json [11:41:14] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1842/console" [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [11:42:01] (03Abandoned) 10Clément Goubert: zotero: Turn on monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) (owner: 10Clément Goubert) [11:42:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1421.eqiad.wmnet with OS bullseye [11:42:21] (03CR) 10Muehlenhoff: Update third-party/matomo repository definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [11:42:42] (03PS2) 10Btullis: Update third-party/matomo repository definition [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) [11:44:15] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1843/console" [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [11:45:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1493.eqiad.wmnet with OS bullseye [11:47:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P60224 and previous config saved to /var/cache/conftool/dbconfig/20240410-114728-arnaudb.json [11:48:51] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9703513 (10MoritzMuehlenhoff) [11:49:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1422.eqiad.wmnet with OS bullseye [11:51:20] (03CR) 10Hnowlan: [C:03+1] shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [11:53:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1491.eqiad.wmnet with OS bullseye [11:54:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1492.eqiad.wmnet with OS bullseye [11:55:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60225 and previous config saved to /var/cache/conftool/dbconfig/20240410-115506-root.json [12:01:31] !log Running homer 'cr*eqiad*' commit 'T351074' and homer 'lsw1-e3-eqiad*' commit 'T351074' [12:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:39] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [12:02:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P60226 and previous config saved to /var/cache/conftool/dbconfig/20240410-120235-arnaudb.json [12:02:59] (03PS3) 10Btullis: Update third-party/matomo repository definition [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) [12:04:22] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp-test1002.wikimedia.org with OS bookworm [12:04:40] (03CR) 10Btullis: Update third-party/matomo repository definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [12:05:52] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1844/console" [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [12:08:49] (03CR) 10JMeybohm: [C:03+1] "I'd say this looks reasonable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [12:09:15] 06SRE, 10Citoid, 06serviceops, 13Patch-For-Review: 14Create a readiness probe for zotero - 14https://phabricator.wikimedia.org/T213689#9703551 (10Mvolz) 14 >>! In T213689#9703471, @Clement_Goubert wrote: > Summing up the discussion on the patch set, this is not what is wanted, turning monitoring on... [12:09:16] jouncebot: now [12:09:16] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [12:10:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60227 and previous config saved to /var/cache/conftool/dbconfig/20240410-121012-root.json [12:11:53] !log Pooling and uncordoning mw1421.eqiad.wmnet,mw1422.eqiad.wmnet,mw1491.eqiad.wmnet,mw1492.eqiad.wmnet,mw1493.eqiad.wmnet - T351074 [12:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:57] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [12:12:09] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1421.eqiad.wmnet|mw1422.eqiad.wmnet|mw1491.eqiad.wmnet|mw1492.eqiad.wmnet|mw1493.eqiad.wmnet),cluster=kubernetes,service=kubesvc [12:14:58] !log lucaswerkmeister-wmde@deploy1002 ~ $ mwscript-k8s extensions/Wikibase/repo/maintenance/changePropertyDataType.php wikidatawiki --property-id P4496 --new-data-type external-id --summary '[[phabricator:T359297|T359297]]' # failed, will retry with non-k8s mwscript [12:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:07] T359297: Change Property datatypes from String to External Identifier for NACE code rev.2 (P4496) - https://phabricator.wikimedia.org/T359297 [12:15:36] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php wikidatawiki --property-id P4496 --new-data-type external-id --summary '[[phabricator:T359297|T359297]]' # succeeded [12:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T360332)', diff saved to https://phabricator.wikimedia.org/P60228 and previous config saved to /var/cache/conftool/dbconfig/20240410-121743-arnaudb.json [12:17:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1196.eqiad.wmnet with reason: Maintenance [12:17:49] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [12:18:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1196.eqiad.wmnet with reason: Maintenance [12:18:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:18:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:18:12] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test1002.wikimedia.org with reason: host reimage [12:18:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T360332)', diff saved to https://phabricator.wikimedia.org/P60229 and previous config saved to /var/cache/conftool/dbconfig/20240410-121814-arnaudb.json [12:18:52] 06SRE, 10Citoid, 06serviceops, 13Patch-For-Review: 14Create a readiness probe for zotero - 14https://phabricator.wikimedia.org/T213689#9703583 (10Clement_Goubert) 14>>! In T213689#9703551, @Mvolz wrote: > Thanks for linking the actual current Zotero probe - I see it checks the export endpoint? Where c... [12:19:42] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018710 [12:20:12] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test1002.wikimedia.org with reason: host reimage [12:20:16] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018710 (owner: 10Peter Fischer) [12:21:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T360332)', diff saved to https://phabricator.wikimedia.org/P60230 and previous config saved to /var/cache/conftool/dbconfig/20240410-122104-arnaudb.json [12:21:09] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018710 (owner: 10Peter Fischer) [12:25:00] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:25:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60231 and previous config saved to /var/cache/conftool/dbconfig/20240410-122518-root.json [12:25:43] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:26:34] (03CR) 10Filippo Giunchedi: "Idea LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [12:27:11] (03CR) 10Filippo Giunchedi: [C:03+2] opensearch: set vhost and issuer url for dashboards sso test [puppet] - 10https://gerrit.wikimedia.org/r/1018667 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [12:30:09] (03CR) 10Filippo Giunchedi: "FWIW for the service-wide checks you can add a probe of type: swagger in service::catalog (see wikifeeds for example)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) (owner: 10Clément Goubert) [12:31:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T356166)', diff saved to https://phabricator.wikimedia.org/P60232 and previous config saved to /var/cache/conftool/dbconfig/20240410-123130-marostegui.json [12:35:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1018635 (owner: 10Slyngshede) [12:36:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [12:36:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P60233 and previous config saved to /var/cache/conftool/dbconfig/20240410-123612-arnaudb.json [12:37:53] (03PS3) 10Fabfur: prometheus: add aggregate metrics for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) [12:38:26] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp-test1002.wikimedia.org with OS bookworm [12:38:35] (03CR) 10Fabfur: "Thanks for the comments!" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [12:44:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update for latest VMs - jmm@cumin2002" [12:45:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update for latest VMs - jmm@cumin2002" [12:46:03] (03PS5) 10Ayounsi: Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) [12:46:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P60234 and previous config saved to /var/cache/conftool/dbconfig/20240410-124638-marostegui.json [12:48:21] !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp-test2003.wikimedia.org [12:49:16] (03CR) 10Muehlenhoff: [C:03+2] Uninstall eject on production VMs [puppet] - 10https://gerrit.wikimedia.org/r/1017275 (owner: 10Muehlenhoff) [12:49:29] (03CR) 10Elukey: [V:03+1 C:03+2] cassandra::instance: fix PKI keystore for each instance [puppet] - 10https://gerrit.wikimedia.org/r/1018311 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [12:50:18] elukey: ok to merge your patch along? [12:50:23] moritzm: +1 thanks! [12:51:18] ack, merged now [12:51:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P60235 and previous config saved to /var/cache/conftool/dbconfig/20240410-125119-arnaudb.json [12:51:38] (03PS3) 10Elukey: Force PKI TLS certs for cassandra instances on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647) [12:52:59] (03CR) 10CI reject: [V:04-1] Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) (owner: 10Ayounsi) [12:53:17] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [12:56:20] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [12:56:58] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=authdns-update [12:59:04] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [12:59:04] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:59:04] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test2003.wikimedia.org [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:01:31] (03CR) 10Filippo Giunchedi: "There's a fix to make, rest LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [13:01:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P60236 and previous config saved to /var/cache/conftool/dbconfig/20240410-130145-marostegui.json [13:02:15] !log volans@cumin2002 START - Cookbook sre.dns.netbox [13:04:15] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test removing dns entry - volans@cumin2002" [13:05:06] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=authdns-update [13:05:07] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test removing dns entry - volans@cumin2002" [13:05:08] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:06:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T360332)', diff saved to https://phabricator.wikimedia.org/P60237 and previous config saved to /var/cache/conftool/dbconfig/20240410-130626-arnaudb.json [13:06:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1206.eqiad.wmnet with reason: Maintenance [13:06:42] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:06:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1206.eqiad.wmnet with reason: Maintenance [13:06:50] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=(cdn|ats-be) [13:06:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T360332)', diff saved to https://phabricator.wikimedia.org/P60238 and previous config saved to /var/cache/conftool/dbconfig/20240410-130650-arnaudb.json [13:06:58] !log depool cp4052 for PXE boot issue testing [13:07:00] !log volans@cumin2002 START - Cookbook sre.dns.netbox [13:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:55] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [13:08:03] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9703728 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp4052.ulsfo.wmnet with OS b... [13:08:08] (03PS1) 10Elukey: services: update the rec-api's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018717 (https://phabricator.wikimedia.org/T205870) [13:09:06] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test restoring dns entry - volans@cumin2002" [13:09:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T360332)', diff saved to https://phabricator.wikimedia.org/P60239 and previous config saved to /var/cache/conftool/dbconfig/20240410-130940-arnaudb.json [13:09:55] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test restoring dns entry - volans@cumin2002" [13:09:56] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:10:39] (03CR) 10Volans: [C:03+2] DNS-related cookbooks: adapt for conftool state [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans) [13:14:55] (03Merged) 10jenkins-bot: DNS-related cookbooks: adapt for conftool state [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans) [13:16:11] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1112.eqiad.wmnet,service=(cdn|ats-be) [13:16:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T356166)', diff saved to https://phabricator.wikimedia.org/P60240 and previous config saved to /var/cache/conftool/dbconfig/20240410-131653-marostegui.json [13:16:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance [13:16:58] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [13:17:00] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye [13:17:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: 14Q1:Install cp11[00-15] and rotate into production - 14https://phabricator.wikimedia.org/T349244#9703743 (10ops-monitoring-bot) 14Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1112.eqiad.wmnet with OS bullseye [13:17:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance [13:17:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T356166)', diff saved to https://phabricator.wikimedia.org/P60241 and previous config saved to /var/cache/conftool/dbconfig/20240410-131716-marostegui.json [13:19:22] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9703751 (10MoritzMuehlenhoff) >>! In T360414#9702570, @andrea.denisse wrote: > I've documented the migration process on Wikitech: https:/... [13:19:41] (03PS4) 10Elukey: ml-services: force HTTP in revert-risk agnostic staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984215 (https://phabricator.wikimedia.org/T353622) [13:24:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P60242 and previous config saved to /var/cache/conftool/dbconfig/20240410-132447-arnaudb.json [13:24:57] (03CR) 10Eevans: [C:03+2] sessionstore configure TLS verification in staging for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017935 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [13:26:07] (03Merged) 10jenkins-bot: sessionstore configure TLS verification in staging for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017935 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [13:26:39] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2088-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [13:26:46] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS bullseye [13:26:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: 14Q1:Install cp11[00-15] and rotate into production - 14https://phabricator.wikimedia.org/T349244#9703759 (10ops-monitoring-bot) 14Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1112.eqiad.wmnet with OS bullseye executed with errors:... [13:27:12] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye [13:27:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: 14Q1:Install cp11[00-15] and rotate into production - 14https://phabricator.wikimedia.org/T349244#9703762 (10ops-monitoring-bot) 14Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1112.eqiad.wmnet with OS bullseye [13:28:21] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9703774 (10andrea.denisse) >>! In T360414#9703751, @MoritzMuehlenhoff wrote: >>>! In T360414#9702570, @andrea.denisse wrote: >> I've docu... [13:28:34] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9703767 (10Volans) a:05Volans→03None De-assigning it from me as B... [13:28:40] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [13:28:49] (03CR) 10Muehlenhoff: [C:03+2] Remove now obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016313 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff) [13:29:00] (03PS2) 10Muehlenhoff: Remove now obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016313 (https://phabricator.wikimedia.org/T360412) [13:29:41] (03PS1) 10Volans: sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) [13:30:16] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1018644 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [13:30:40] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [13:30:53] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on elastic2088.codfw.wmnet with reason: T361525 [13:30:57] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), and 2 others: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9703798 (10bking) a:03RKemper [13:30:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on elastic2088.codfw.wmnet with reason: T361525 [13:30:58] T361525: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525 [13:31:17] 10ops-codfw, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9703808 (10bking) Sorry for the noise, I've just downtimed this host. [13:31:33] (03CR) 10Krinkle: logging: default to log any error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [13:31:38] (03PS4) 10Fabfur: prometheus: add aggregate metrics for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) [13:31:55] (03CR) 10Alexandros Kosiaris: shellbox: add PHP + Apache timeout settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [13:32:00] (03CR) 10Alexandros Kosiaris: [C:04-1] shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [13:32:09] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), and 2 others: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9703819 (10bking) Assigning to @RKemper /adding DPE SRE tags. [13:32:27] (03CR) 10Fabfur: prometheus: add aggregate metrics for benthos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [13:33:06] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [13:35:16] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [13:36:19] (03CR) 10Bking: [C:03+1] search: Wait for young pool alert to fail for 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/1013575 (owner: 10Ebernhardson) [13:38:15] (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [13:39:18] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [13:39:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P60243 and previous config saved to /var/cache/conftool/dbconfig/20240410-133955-arnaudb.json [13:41:08] (03CR) 10Muehlenhoff: [V:03+2] Remove now obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016313 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff) [13:42:42] (03CR) 10Muehlenhoff: [C:03+2] configmaster: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1004132 (owner: 10Muehlenhoff) [13:43:33] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage [13:46:06] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage [13:46:16] (03PS1) 10Clément Goubert: kubernetes: Move 7 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1018719 (https://phabricator.wikimedia.org/T351074) [13:47:44] !log installing unbound security updates [13:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:08] !log Delete unused Prometheus TLS certificates - T360414 [13:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:13] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [13:53:07] (03CR) 10Herron: [C:03+1] titan: trim 5m retention to 3y + 2w [puppet] - 10https://gerrit.wikimedia.org/r/1018644 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [13:54:54] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bullseye [13:55:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T360332)', diff saved to https://phabricator.wikimedia.org/P60244 and previous config saved to /var/cache/conftool/dbconfig/20240410-135502-arnaudb.json [13:55:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [13:55:05] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9703921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp4052.ulsfo.wmnet with OS bulls... [13:55:09] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:55:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [13:55:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T360332)', diff saved to https://phabricator.wikimedia.org/P60245 and previous config saved to /var/cache/conftool/dbconfig/20240410-135525-arnaudb.json [13:58:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T360332)', diff saved to https://phabricator.wikimedia.org/P60246 and previous config saved to /var/cache/conftool/dbconfig/20240410-135814-arnaudb.json [13:58:26] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet,service=(cdn|ats-be) [13:59:10] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 70% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018721 (https://phabricator.wikimedia.org/T360763) [13:59:42] (03PS1) 10Elukey: kask: allow to configure tls options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1400) [14:00:25] (03CR) 10CI reject: [V:04-1] kask: allow to configure tls options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [14:00:56] (03PS1) 10Clément Goubert: trafficserver: move 70% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1018723 (https://phabricator.wikimedia.org/T360763) [14:01:12] (03CR) 10Alexandros Kosiaris: Remove parsoid-php certificates from mw deployments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018660 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [14:03:07] (03CR) 10Fabfur: [C:03+2] prometheus: add aggregate metrics for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [14:03:33] (03PS1) 10Andrea Denisse: ssl: Delete dummy TLS key for the Prometheus hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1018724 (https://phabricator.wikimedia.org/T360414) [14:07:55] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1112.eqiad.wmnet with OS bullseye [14:08:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: 14Q1:Install cp11[00-15] and rotate into production - 14https://phabricator.wikimedia.org/T349244#9703948 (10ops-monitoring-bot) 14Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1112.eqiad.wmnet with OS bullseye completed: - cp1112 (... [14:13:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P60248 and previous config saved to /var/cache/conftool/dbconfig/20240410-141322-arnaudb.json [14:15:57] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:33] ugh [14:16:42] acked [14:16:48] !incidents [14:16:49] 4577 (ACKED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [14:16:49] o/ [14:16:49] 4576 (RESOLVED) db1152 (paged)/MariaDB read only x2 (paged) [14:17:00] So it's been exhausting workers more or less steadily since this morning [14:17:01] claime: related to any WIP? [14:17:09] https://grafana.wikimedia.org/goto/i70n34aSg?orgId=1 [14:17:13] Not that I know of [14:17:20] ok [14:17:27] let me check upload log on commons [14:17:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:17:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1112.eqiad.wmnet,service=(cdn|ats-be) [14:17:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T356166)', diff saved to https://phabricator.wikimedia.org/P60249 and previous config saved to /var/cache/conftool/dbconfig/20240410-141742-marostegui.json [14:17:46] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [14:17:58] claime: your link doesn't work [14:18:10] redirects to the rw-grafana home [14:18:15] volans: because i'm logged in probably, great [14:18:23] logging in [14:18:47] nope, same... [14:19:10] https://grafana.wikimedia.org/goto/3XnvqV-IR?orgId=1 [14:19:18] thx [14:20:05] I saw this, but it looks far from massive: https://commons.wikimedia.org/wiki/Special:Log?type=upload&user=Trade&page=&wpdate=&tagfilter=&wpfilters%5B%5D=newusers&wpFormIdentifier=logeventslist [14:20:53] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4052.ulsfo.wmnet [14:20:56] claime: are the docs still valid in the k8s world? [14:20:57] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:10] volans: This is not k8s [14:21:11] I see mnuch more hosts in codfw than equiad [14:21:13] I don't think we can get gameplays on commons, but not a current concern [14:21:24] This is the only remnants of bare metal for jobs, videoscalers [14:21:27] also different weights [14:21:36] for that matter [14:21:51] do you metrics of # of current enqueed or pending jobs? [14:21:55] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp4052.ulsfo.wmnet [14:22:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:50] volans: Yes, because in codfw the hosts have different CPUs, so they are weighted differently [14:22:51] !incidents [14:22:51] 4577 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [14:22:51] 4576 (RESOLVED) db1152 (paged)/MariaDB read only x2 (paged) [14:22:55] Shouldn't be the case in eqiad [14:22:57] ok [14:23:13] In any case, all transcodes are done the primary DC [14:23:56] but as far as workers go, you're right, we have a big imbalance between codfw and eqiad [14:23:57] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:22] We may have gone a bit too fast in reimaging jobrunners in eqiad [14:24:32] see lso https://grafana.wikimedia.org/goto/RyCxq4aSg?orgId=1 [14:24:39] acked [14:25:09] the cluster is totally CPU-bound [14:25:16] (03CR) 10Jforrester: Implementing security.txt standard (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana) [14:25:21] I belive in the past there was some bad balancing on transcoding jobs, or at least I remember mentions of it when there was mass video uploads [14:25:25] those are ominous graphs [14:25:34] it doesn't correlate to an increase in jobs [14:25:37] That's what I don´t like [14:26:01] https://grafana.wikimedia.org/goto/_rPL34-IR?orgId=1 [14:26:05] Even the prioritized ones [14:26:09] filesystem usage went from 10% to 40% and growind [14:26:17] for / [14:26:31] sigh [14:26:34] * akosiaris around [14:26:37] so something weird is happening [14:26:46] that uses a lot of disk on those hosts [14:26:56] very large videos to encode? [14:27:20] or shellbox not cleaning up after itself [14:27:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:43] w1437:~$ pgrep -f /usr/bin/ffmpeg |wc -l [14:27:44] 432 [14:27:45] wow [14:27:49] disk IOs are not crazy hight [14:28:09] do you have a wiki? I see nothing on commons [14:28:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P60250 and previous config saved to /var/cache/conftool/dbconfig/20240410-142829-arnaudb.json [14:28:57] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:39] akosiaris: did you kill processes? [14:29:41] !incidents [14:29:42] 4578 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [14:29:42] 4577 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [14:29:42] It says 72 now [14:29:42] 4576 (RESOLVED) db1152 (paged)/MariaDB read only x2 (paged) [14:29:44] no [14:29:56] Ah no [14:29:58] -f [14:30:06] still 432 for me [14:30:23] yeo [14:30:24] $ pgrep -cf /usr/bin/ffmpeg [14:30:24] 414 [14:30:31] I'm on another host [14:31:07] (03CR) 10Btullis: [V:03+1 C:03+2] Update third-party/matomo repository definition [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [14:31:29] is there a simple log to tail to check what they are encoding? [14:31:33] the ffmpeg I mean [14:31:57] not that I can remember [14:32:03] sigh [14:32:35] is it possible is the same set of videos over and over that maybe fails and gets re-enqued? [14:32:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P60251 and previous config saved to /var/cache/conftool/dbconfig/20240410-143249-marostegui.json [14:33:02] this started around 8am [14:33:30] ah, then let me search earlier [14:33:40] at 7:30 there was a sync file [14:33:52] at 10 scp [14:33:54] *scap [14:34:06] yeah first one was me deploying a helm chart change for /docs/ [14:34:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:34:17] that's the one more aligned :D [14:34:21] and the next one was eff.ie for adding the mcrouter env var [14:34:22] although seems unrelated [14:34:28] https://logstash.wikimedia.org/goto/3d82a32f7c64af5adedc10122bc5c5f1 [14:34:46] (03PS1) 10Majavah: alertmanager: karma: Set group too [puppet] - 10https://gerrit.wikimedia.org/r/1018727 [14:34:57] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:58] It's not that many events [14:35:07] although it's only the errors [14:35:08] acked [14:35:45] we started having webVideoTranscode job execution errors around 3:00 in the morning actually [14:36:25] (03CR) 10Andrew Bogott: [C:03+1] alertmanager: karma: Set group too [puppet] - 10https://gerrit.wikimedia.org/r/1018727 (owner: 10Majavah) [14:36:50] logstash showing errors like "estimated file size 2784886 KiB over soft limit 2097152 KiB" [14:36:52] claime, akosiaris: do we need to create an incident and start calling people? I can do IC [14:37:22] estimated file size 8755934 KiB over hard limit 3145728 KiB [14:37:30] for a random failure [14:37:33] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:37:44] volans: yeah, sure, makes sense by now [14:37:54] ok opening doc [14:37:56] 06SRE, 10Observability-Alerting: prometheus-icinga-am.service Fails to Start on alert2001 - https://phabricator.wikimedia.org/T358838#9704024 (10lmata) [14:38:01] * volans becomes IC [14:38:28] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:55] it's not even a lot of reuqests per apache logs [14:39:00] just some weird video ? [14:39:17] that's my guess [14:39:22] but no hard evidence [14:39:36] if it is just 1 video, maybe jobs can be killed and later restarted for mitigation? [14:39:55] just in case should we reduce the concurrency of video transcoding jobs? [14:40:20] there isn't btw any serious impact to anything (aside from the one on oncallers getting pages). Jobs aren't on the videoscalers for some time now [14:40:43] doc is https://docs.google.com/document/d/1k9eYWPpY8QsKfLpLgYXsLPaHTN4y1m5serPOv8H5Gd8/edit#heading=h.95p2g5d67t9q [14:41:01] hnowlan: I suppose it won't hurt [14:41:04] wanna do that ? [14:41:27] The errors are from TMH [14:41:30] this was a video uploaded after 3am and transcoding seems stuck since then: https://commons.wikimedia.org/w/index.php?title=File:Key_Bridge_Response_Photos_(240401-G-TL908-2303).webm&action=history [14:41:44] akosiaris: sure [14:42:08] afaict there are a few filenames that show up repeatedly fwiw, but not over long periods of time [14:42:31] it was renamed after being uploaded [14:42:34] akosiaris: do you think that throwing more hosts at the cluster would help? [14:43:27] (03PS1) 10Hnowlan: jobqueue: reduce webvideotranscode concurrency temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018729 [14:43:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T360332)', diff saved to https://phabricator.wikimedia.org/P60252 and previous config saved to /var/cache/conftool/dbconfig/20240410-144336-arnaudb.json [14:43:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance [14:43:42] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:43:46] volans: sure, but in fact on the legacy infra we 've never done that IIRC [14:43:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance [14:44:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T360332)', diff saved to https://phabricator.wikimedia.org/P60253 and previous config saved to /var/cache/conftool/dbconfig/20240410-144400-arnaudb.json [14:44:02] it's video transcoding, it can be delayed quite a bit, no harm done [14:44:15] ok [14:44:22] we used to separate the 2 clusters (jobrunners vs videoscalers) functionally when we had incidents like these [14:44:39] making sure that video transcoding wouldn't consume resources meant for jobs [14:44:44] Now the only resources we have for videoscaling are those hosts [14:44:52] but in this case, we no longer have jobs on that cluster [14:45:03] Or we repurpose some appservers quickly and throw them at it [14:45:04] ack [14:45:21] the more I think about it, the more I start to wonder whether ACKing it for say 20 hours is ok. [14:45:46] fwiw up until now, 4 was *plenty* for videoscaling [14:45:59] 06SRE, 10Observability-Logging, 10SRE Observability (FY2023/2024-Q4): Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998#9704059 (10fgiunchedi) [14:46:06] now, if I could isolate whatever file is causing the issue and just moving in the back of the queue, it would be better [14:46:07] I belive something like that was done last time, akosiaris, and someone told us "not to worry" [14:46:13] concurrency reduction: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018729 [14:46:27] jynus: yeah it's not the first time. It's like the nth occurence [14:46:40] (03CR) 10Alexandros Kosiaris: [C:03+1] jobqueue: reduce webvideotranscode concurrency temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018729 (owner: 10Hnowlan) [14:46:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T360332)', diff saved to https://phabricator.wikimedia.org/P60254 and previous config saved to /var/cache/conftool/dbconfig/20240410-144644-arnaudb.json [14:46:59] the queue itself doesn't seem to have spiked in a notable fashion either [14:47:23] akosiaris: please do, my bet is on that boat video, but I have 0 proof [14:47:26] steady at .05 jobs/s or so [14:47:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:47:38] actually let's do something. Let's get that patch ^ deployed (I +1ed already) and I 'll kill ffmpegs on 1 host and increase their weight [14:47:44] so I think we need to identify if we got a bunch of weird videos that create problems [14:47:51] or the code started to have issues [14:47:55] (03CR) 10Hnowlan: [C:03+2] jobqueue: reduce webvideotranscode concurrency temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018729 (owner: 10Hnowlan) [14:47:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P60255 and previous config saved to /var/cache/conftool/dbconfig/20240410-144757-marostegui.json [14:47:58] or the infra started to have issues [14:48:08] volans: bunch? I am willing to bet it's one [14:48:17] jynus: how did you identify that one? [14:48:23] volans: my suspicion is on that upload + rename [14:48:32] akosiaris: how can one affect all hosts at the same time? [14:48:32] akosiaris: first video stuck after 3am [14:48:42] (03Merged) 10jenkins-bot: jobqueue: reduce webvideotranscode concurrency temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018729 (owner: 10Hnowlan) [14:49:01] checked commons on special new files and filtered by video [14:49:06] let me get you an url [14:49:17] akosiaris: https://commons.wikimedia.org/wiki/Special:NewFiles?user=&showbots=1&mediatype%5B%5D=VIDEO&start=&end=&wpFormIdentifier=specialnewimages&limit=500&offset= [14:49:55] the rename makes it specially suspicious (not the first time a rename breaks things due to mw bug) [14:50:01] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:50:29] but please check if you have a way to compare it with log or processes [14:50:42] at the moment it is just a guess [14:50:50] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:51:14] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:51:15] it's not even a really big video. 435MB ? [14:51:24] it doesn't match the size logs, right [14:51:35] cgoubert@mw1437:/var/log$ ps aux | grep 37888 < oldest pid I could find [14:51:49] Wed Apr 10 06:41:55 [14:51:58] however you got a point that it hasn't managed to get transcoded [14:51:59] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:52:03] whereas the next one has [14:52:12] so it could be just a synthom, not a cause [14:52:19] url? [14:52:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:52:35] cwhite: you around? [14:53:03] jynus: harder to find x) [14:53:11] !incidents [14:53:12] 4579 (ACKED) [2x] ProbeDown sre (ip4 probes/service eqiad) [14:53:12] 4578 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [14:53:12] 4577 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [14:53:12] 4576 (RESOLVED) db1152 (paged)/MariaDB read only x2 (paged) [14:53:14] ok, hnowlan is done, I 'll kill ffmpegs in the 1st host [14:53:19] and adjust weight [14:53:26] ack [14:53:27] that should hopefully patch the bleeding [14:53:40] soft limited jobs can be manually retried via the mw UI btw [14:53:49] if that's an issue, but I don't think it is [14:54:02] but just a til https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandler/blob/master/includes/WebVideoTranscode/WebVideoTranscodeJob.php#L652 [14:54:50] !log akosiaris@cumin1002 conftool action : set/weight=30; selector: name=mw1437.*.wmnet,dc=eqiad [14:54:57] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:47] volans: in a meeting, but yes [14:55:48] !log kill all ffmpegs on mw1437 and increase weight of mw1347 from 10 to 30 to direct most queries to it while the other 3 videoscalers serve the backlog [14:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:59] cwhite: page >>> meeting ;) [14:56:49] akosiaris: this is assuming the issue doesn't get re-enqued correct? [14:57:18] yes [14:57:25] we are at 63 ffmpeg right now [14:57:30] on mw1437 [14:57:36] so, already more than the CPUs [14:57:44] but it's not rising exponentially or anyting [14:58:15] (JobrunnerPHPBusyWorkers) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [14:58:38] !log installing debian-archive-keyring updates on buster [14:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:10] can't the "resolved" and firing be the first thing in those messages ^ and in caps ? [14:59:28] it would make my IRC life a tag easier [14:59:45] indeed [14:59:46] +1 [15:00:23] 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9704116 (10VRiley-WMF) That works for me. I'll be there to assist with it. Thank you! [15:01:36] added a couple of action items to the doc [15:01:41] including the above [15:01:45] https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&var-dc=eqiad%20prometheus%2Fk8s&from=now-3d&to=now&var-job=webVideoTranscodePrioritized you can see the impact increasing from 00:00 on the 9th [15:01:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P60256 and previous config saved to /var/cache/conftool/dbconfig/20240410-150152-arnaudb.json [15:02:23] processing rates up, backlog becomes more consistent (even if the time remains similar) [15:02:36] we should reduce concurrency on prioritised in light of that I'd say [15:03:01] (03PS6) 10Majavah: hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025) [15:03:01] (03PS1) 10Majavah: hieradata: Update striker container to add staging env warning [puppet] - 10https://gerrit.wikimedia.org/r/1018730 (https://phabricator.wikimedia.org/T254598) [15:03:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T356166)', diff saved to https://phabricator.wikimedia.org/P60257 and previous config saved to /var/cache/conftool/dbconfig/20240410-150304-marostegui.json [15:03:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance [15:03:14] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [15:03:20] akosiaris: re: firing and resolved first thing in the message. That's a good idea, I'll discuss this with o11y to see if we can modify it. [15:03:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance [15:03:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T356166)', diff saved to https://phabricator.wikimedia.org/P60258 and previous config saved to /var/cache/conftool/dbconfig/20240410-150327-marostegui.json [15:03:41] (03PS1) 10Hnowlan: jobqueue: temporarily reduce prioritised video transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018731 [15:03:53] (03CR) 10Majavah: [C:03+2] hieradata: Update striker container to add staging env warning [puppet] - 10https://gerrit.wikimedia.org/r/1018730 (https://phabricator.wikimedia.org/T254598) (owner: 10Majavah) [15:04:56] concurrency reductions for the other job https://gerrit.wikimedia.org/r/101873 [15:07:13] hnowlan: missing a digit ;) [15:07:23] that's from 2013 :D [15:07:47] this one?: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018731 [15:08:07] hehhh yes [15:08:38] (03CR) 10Volans: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018731 (owner: 10Hnowlan) [15:08:42] what do you mean, don't we need portugese wikibooks for this problem [15:08:49] lol [15:08:55] (03PS2) 10Hnowlan: jobqueue: temporarily reduce prioritised video transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018731 [15:09:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:10:51] (03CR) 10Hnowlan: [C:03+2] jobqueue: temporarily reduce prioritised video transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018731 (owner: 10Hnowlan) [15:11:07] !incide [15:11:09] !incidents [15:11:10] 4579 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [15:11:10] 4578 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [15:11:10] 4577 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [15:11:10] 4576 (RESOLVED) db1152 (paged)/MariaDB read only x2 (paged) [15:11:39] I've tried to put a summary of the actions take, but please adjust it if I misrepresented anything [15:11:56] (03Merged) 10jenkins-bot: jobqueue: temporarily reduce prioritised video transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018731 (owner: 10Hnowlan) [15:12:40] (03PS2) 10Jcrespo: mariadb: Migrate db2097 backups to db2197 [puppet] - 10https://gerrit.wikimedia.org/r/1018247 (https://phabricator.wikimedia.org/T360751) [15:13:35] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:14:00] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:14:01] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:14:34] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:17:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P60259 and previous config saved to /var/cache/conftool/dbconfig/20240410-151659-arnaudb.json [15:23:28] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:23:32] might be okay to call this one resolved or at least mitigated? [15:24:04] no entries in mediawiki-errors since the kills [15:24:08] +1 from me [15:24:23] (03PS2) 10BCornwall: ssl_ciphersuite: Reorder suite preferences [puppet] - 10https://gerrit.wikimedia.org/r/1018356 (https://phabricator.wikimedia.org/T362197) [15:25:24] RPS and 200 rate isn't quite back to normal but it's recovering [15:25:35] hnowlan: it works for me, as you want, the CPU is still fairly close to 100% [15:25:40] but I'll let you decide [15:25:54] let's keep an eye for another while [15:26:01] it's ~90% on 1437 [15:26:06] and stuck at 100% on the others [15:26:40] but yes it looks promising [15:26:43] (03CR) 10BCornwall: [V:03+2 C:03+2] ssl_ciphersuite: Reorder suite preferences [puppet] - 10https://gerrit.wikimedia.org/r/1018356 (https://phabricator.wikimedia.org/T362197) (owner: 10BCornwall) [15:26:51] (03CR) 10BCornwall: [V:03+1 C:03+2] ncredir: Set ssl_ciphersuite to strong [puppet] - 10https://gerrit.wikimedia.org/r/1018355 (https://phabricator.wikimedia.org/T362197) (owner: 10BCornwall) [15:27:26] urandom, cwhite: any of you that could take over IC? I'll be offcal in ~3 minutes [15:27:32] !incide [15:27:34] !incidents [15:27:34] 4579 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [15:27:35] 4578 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [15:27:35] 4577 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [15:27:35] 4576 (RESOLVED) db1152 (paged)/MariaDB read only x2 (paged) [15:27:47] I can take it [15:27:49] * volans insit expecting to qork [15:28:15] thanks [15:29:15] Is there a runbook we followed to reduce the load on the jobrunners? [15:29:56] No, not that I know of [15:30:08] ad-hoc work from a.kosiaris and h.nowlan [15:30:45] also the runbook is outdated, adding it o the doc [15:31:07] Thanks :) [15:32:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T360332)', diff saved to https://phabricator.wikimedia.org/P60260 and previous config saved to /var/cache/conftool/dbconfig/20240410-153207-arnaudb.json [15:32:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance [15:32:12] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:32:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance [15:32:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T360332)', diff saved to https://phabricator.wikimedia.org/P60261 and previous config saved to /var/cache/conftool/dbconfig/20240410-153229-arnaudb.json [15:33:28] (JobUnavailable) firing: (2) Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:06] (03PS2) 10Elukey: kask: allow to configure tls options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) [15:35:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T360332)', diff saved to https://phabricator.wikimedia.org/P60262 and previous config saved to /var/cache/conftool/dbconfig/20240410-153516-arnaudb.json [15:37:50] (03PS3) 10Elukey: kask: allow to configure tls options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) [15:43:05] (03PS1) 10BCornwall: Revert "ssl_ciphersuite: Reorder suite preferences" [puppet] - 10https://gerrit.wikimedia.org/r/1018690 [15:46:07] (03CR) 10CI reject: [V:04-1] Revert "ssl_ciphersuite: Reorder suite preferences" [puppet] - 10https://gerrit.wikimedia.org/r/1018690 (owner: 10BCornwall) [15:46:18] (03CR) 10BCornwall: [V:03+2 C:03+2] "08:44 IRC +1" [puppet] - 10https://gerrit.wikimedia.org/r/1018690 (owner: 10BCornwall) [15:46:54] (03PS2) 10BCornwall: Revert "ssl_ciphersuite: Reorder suite preferences" [puppet] - 10https://gerrit.wikimedia.org/r/1018690 [15:48:28] (JobUnavailable) resolved: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:50:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P60264 and previous config saved to /var/cache/conftool/dbconfig/20240410-155024-arnaudb.json [15:52:50] (03CR) 10CI reject: [V:04-1] Create a new and profile for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [15:52:55] (03PS1) 10Daniel Kinzler: LogStash: log HtmlOutputRendererHelper channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018738 (https://phabricator.wikimedia.org/T356157) [15:53:53] (03PS5) 10Elukey: kask: allow to configure tls options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) [15:55:36] (03PS1) 10Btullis: Add dummy data for the new matomo service. [labs/private] - 10https://gerrit.wikimedia.org/r/1018739 (https://phabricator.wikimedia.org/T351552) [15:56:13] (03CR) 10Btullis: [V:03+2 C:03+2] Add dummy data for the new matomo service. [labs/private] - 10https://gerrit.wikimedia.org/r/1018739 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [15:58:32] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [16:01:13] (03PS2) 10Btullis: Create a new and profile for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) [16:03:33] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [16:04:12] (03CR) 10Mmartorana: Implementing security.txt standard (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana) [16:04:21] (03CR) 10CI reject: [V:04-1] Create a new and profile for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [16:05:28] jouncebot: now [16:05:28] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [16:05:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P60265 and previous config saved to /var/cache/conftool/dbconfig/20240410-160531-arnaudb.json [16:05:54] I’ll try to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1007643 then if that’s alright :) [16:07:01] hm, or maybe not, there are uncommitted changes in `/src/deployment-charts` o_O [16:10:01] (03PS3) 10Lucas Werkmeister (WMDE): termbox: update to 2024-03-14-121904-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007643 (https://phabricator.wikimedia.org/T343239) [16:10:05] (03PS1) 10Hashar: TitleLibrary: Don't register external titles as dependencies [extensions/Scribunto] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018691 (https://phabricator.wikimedia.org/T362222) [16:10:25] (03CR) 10Eevans: kask: allow to configure tls options (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:10:29] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "I’ll deploy this now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007643 (https://phabricator.wikimedia.org/T343239) (owner: 10Lucas Werkmeister (WMDE)) [16:10:34] (03CR) 10Hashar: "I'll deploy it tomorrow unless someone does it tonight :)" [extensions/Scribunto] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018691 (https://phabricator.wikimedia.org/T362222) (owner: 10Hashar) [16:11:23] (03Merged) 10jenkins-bot: termbox: update to 2024-03-14-121904-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007643 (https://phabricator.wikimedia.org/T343239) (owner: 10Lucas Werkmeister (WMDE)) [16:12:01] (03CR) 10Elukey: kask: allow to configure tls options (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:12:14] !log uploaded etcd-mirror 0.0.11-1 to apt.wikimedia.org (T358636) [16:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:26] T358636: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636 [16:13:22] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply [16:13:45] (03CR) 10Eevans: [C:03+1] kask: allow to configure tls options (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:14:01] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply [16:14:04] (03CR) 10FNegri: [C:03+1] alertmanager: karma: Set group too [puppet] - 10https://gerrit.wikimedia.org/r/1018727 (owner: 10Majavah) [16:14:47] test wikidata termbox seems to work [16:14:57] staging too [16:15:05] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] START helmfile.d/services/termbox: apply [16:15:29] (03CR) 10Elukey: [C:03+2] kask: allow to configure tls options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:15:54] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [16:16:00] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] START helmfile.d/services/termbox: apply [16:16:48] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [16:16:49] hm, eqiad’s being a bit slower that codfw [16:16:52] ah, there it goes :) [16:17:51] real wikidata termbox also looking good [16:17:56] * Lucas_WMDE done [16:19:30] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync [16:19:41] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [16:19:50] (03CR) 10Eevans: [C:03+1] Force PKI TLS certs for cassandra instances on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:20:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T360332)', diff saved to https://phabricator.wikimedia.org/P60267 and previous config saved to /var/cache/conftool/dbconfig/20240410-162039-arnaudb.json [16:20:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1228.eqiad.wmnet with reason: Maintenance [16:20:44] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [16:20:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1228.eqiad.wmnet with reason: Maintenance [16:21:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T360332)', diff saved to https://phabricator.wikimedia.org/P60268 and previous config saved to /var/cache/conftool/dbconfig/20240410-162101-arnaudb.json [16:23:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T360332)', diff saved to https://phabricator.wikimedia.org/P60269 and previous config saved to /var/cache/conftool/dbconfig/20240410-162344-arnaudb.json [16:23:54] (03CR) 10Jcrespo: [C:03+2] mariadb: Migrate db2097 backups to db2197 [puppet] - 10https://gerrit.wikimedia.org/r/1018247 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [16:26:53] (03PS1) 10Eevans: echostore: configure TLS verification in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018742 (https://phabricator.wikimedia.org/T352647) [16:34:26] (03CR) 10Hashar: logging: default to log any error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [16:36:18] (03PS3) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [16:37:06] (03CR) 10Hashar: "Ideally we would have a CI job that diff the effective configuration between the proposed change and its parent commit :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [16:38:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P60270 and previous config saved to /var/cache/conftool/dbconfig/20240410-163851-arnaudb.json [16:39:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1018724 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:39:34] (03CR) 10Hashar: [C:03+2] TitleLibrary: Don't register external titles as dependencies [extensions/Scribunto] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018691 (https://phabricator.wikimedia.org/T362222) (owner: 10Hashar) [16:40:14] (03CR) 10Dzahn: [C:03+1] ssl: Delete dummy TLS key for the Prometheus hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1018724 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:41:02] I am backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/1018691 which causes some log spam as part of this week train [16:41:10] should be on time for the next deployment window which has a comment about the deploy happening in the second half of the scheduled window [16:42:00] swfrench-wmf: ^ :) [16:42:12] I should be done in 30 minutes [16:42:40] (03PS1) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1018743 [16:43:23] hashar: ack, thank you! yes, I won't be starting until 17:30 or so [16:43:30] (03CR) 10Dzahn: "seems like this broke puppet https://phabricator.wikimedia.org/P60271" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [16:44:08] (03CR) 10Dzahn: "could not parse expression: 1:87: parse error: unexpected "{"" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [16:46:36] (03PS2) 10Jcrespo: mariadb: Migrate db2098 backups to db2198 and upgrade dbprov2002 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1018276 (https://phabricator.wikimedia.org/T360751) [16:46:37] (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2199, db2200 after setup [puppet] - 10https://gerrit.wikimedia.org/r/1018744 (https://phabricator.wikimedia.org/T355422) [16:50:05] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:50:22] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:50:29] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:50:35] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:52:15] (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [16:54:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P60272 and previous config saved to /var/cache/conftool/dbconfig/20240410-165359-arnaudb.json [16:54:09] (03PS9) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 [16:55:27] (03PS1) 10Andrea Denisse: prometheus: Ensure the Benthos metrics are correctly parsed [puppet] - 10https://gerrit.wikimedia.org/r/1018745 (https://phabricator.wikimedia.org/T361845) [16:55:39] (03CR) 10Jcrespo: "Another 2 hosts setup now:" [puppet] - 10https://gerrit.wikimedia.org/r/1018744 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [16:56:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [16:56:38] (03PS1) 10Fabfur: prometheus: fix typo in aggregate rules [puppet] - 10https://gerrit.wikimedia.org/r/1018746 (https://phabricator.wikimedia.org/T361845) [16:56:54] jobrunners are still looking quite spicy [16:57:23] (03CR) 10CI reject: [V:04-1] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins) [16:57:25] (03CR) 10Fabfur: [C:03+1] "thanks for fixing this!" [puppet] - 10https://gerrit.wikimedia.org/r/1018745 (https://phabricator.wikimedia.org/T361845) (owner: 10Andrea Denisse) [16:58:01] (03Abandoned) 10Fabfur: prometheus: fix typo in aggregate rules [puppet] - 10https://gerrit.wikimedia.org/r/1018746 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur) [16:59:07] (03Merged) 10jenkins-bot: TitleLibrary: Don't register external titles as dependencies [extensions/Scribunto] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018691 (https://phabricator.wikimedia.org/T362222) (owner: 10Hashar) [16:59:27] (03PS3) 10Btullis: Create a new and profile for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) [17:00:05] swfrench-wmf: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1700). [17:00:45] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [17:00:58] holding until 17:30 UTC [17:01:34] :) [17:01:59] I propose killing ffmpegs that have been running for more than say 7 hours on videoscalers to free things up [17:02:27] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1018691|TitleLibrary: Don't register external titles as dependencies (T362222)]] [17:02:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:02:43] T362222: PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::addTemplate with interwiki link was deprecated in MediaWiki 1.42. [Called from MediaWiki\Extension\Scribunto\Engines\LuaCommon\TitleLibrary::getContentInternal] - https://phabricator.wikimedia.org/T362222 [17:02:54] !log killing long-running videoscaler ffmpegs [17:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:06] you are on your own hnowlan :) [17:03:16] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:03:41] I dont' know anything about the long tail of expected time to do a video transcoding [17:04:09] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:04:16] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:04:26] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:04:29] !log depool cp1115 for firmware downgrade for PXE boot testing: T350179 [17:04:32] (03CR) 10Andrea Denisse: [C:03+2] prometheus: Ensure the Benthos metrics are correctly parsed [puppet] - 10https://gerrit.wikimedia.org/r/1018745 (https://phabricator.wikimedia.org/T361845) (owner: 10Andrea Denisse) [17:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:07] T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 [17:05:17] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1115.eqiad.wmnet,service=(cdn|ats-be) [17:05:42] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp1115.eqiad.wmnet [17:06:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp1115.eqiad.wmnet [17:06:36] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp1115.eqiad.wmnet [17:07:29] !log hashar@deploy1002 hashar: Backport for [[gerrit:1018691|TitleLibrary: Don't register external titles as dependencies (T362222)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:07:32] !log hashar@deploy1002 hashar: Continuing with sync [17:07:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:07:44] T362222: PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::addTemplate with interwiki link was deprecated in MediaWiki 1.42. [Called from MediaWiki\Extension\Scribunto\Engines\LuaCommon\TitleLibrary::getContentInternal] - https://phabricator.wikimedia.org/T362222 [17:07:49] (03PS10) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 [17:09:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T360332)', diff saved to https://phabricator.wikimedia.org/P60274 and previous config saved to /var/cache/conftool/dbconfig/20240410-170907-arnaudb.json [17:09:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance [17:09:21] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] ssl: Delete dummy TLS key for the Prometheus hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1018724 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:09:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance [17:09:26] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [17:09:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T360332)', diff saved to https://phabricator.wikimedia.org/P60275 and previous config saved to /var/cache/conftool/dbconfig/20240410-170930-arnaudb.json [17:10:19] pff the canaries are failing [17:10:41] (03PS1) 10Dzahn: create wikipedia-sysop-pl.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1018747 (https://phabricator.wikimedia.org/T361041) [17:10:54] (Avg. errors per 10 seconds: Before: 0.10, After: 4.00, Threshold: 1.01) [17:11:05] (03CR) 10Dzahn: "Amir, how about this alternative?" [dns] - 10https://gerrit.wikimedia.org/r/1018747 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn) [17:11:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:12:02] and that is not related [17:12:07] * hashar retry [17:12:12] retries [17:12:13] err [17:12:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T360332)', diff saved to https://phabricator.wikimedia.org/P60276 and previous config saved to /var/cache/conftool/dbconfig/20240410-171229-arnaudb.json [17:14:29] (03PS4) 10Btullis: Create a new and profile for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) [17:14:48] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp1115.eqiad.wmnet [17:15:59] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [17:16:31] so it passed this time [17:16:39] and I am now waiting for kubernetes [17:21:20] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1018691|TitleLibrary: Don't register external titles as dependencies (T362222)]] (duration: 18m 53s) [17:21:28] swfrench-wmf: done! :) [17:21:36] T362222: PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::addTemplate with interwiki link was deprecated in MediaWiki 1.42. [Called from MediaWiki\Extension\Scribunto\Engines\LuaCommon\TitleLibrary::getContentInternal] - https://phabricator.wikimedia.org/T362222 [17:22:01] (03CR) 10Btullis: [V:03+1 C:03+2] Create a new and profile for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [17:23:26] hashar: ack - thank you! [17:25:39] (03PS1) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) [17:27:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:27:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P60277 and previous config saved to /var/cache/conftool/dbconfig/20240410-172736-arnaudb.json [17:29:18] urandom, hnowlan: I think we're seeing jobrunner beginning to overload again. [17:29:46] yeah I just did a few changes to no avail [17:32:33] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:33:57] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:34:06] and there's the page [17:34:32] ya [17:34:46] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp1115.eqiad.wmnet [17:35:11] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp1115.eqiad.wmnet [17:35:26] looks like the last action taken was to kill all ffmpegs on mw1437 [17:36:29] hnowlan: what else did you try? [17:36:32] (03CR) 10Muehlenhoff: prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:37:21] I'm preparing to kill ffmpegs on mw1437 - any objections? [17:37:22] !log restarting etcd-mirror on conf2005.codfw.wmnet for T358636 [17:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:28] T358636: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636 [17:37:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:37:35] (03CR) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:37:55] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [17:37:57] cwhite: maybe hold [17:38:04] * cwhite holds [17:38:06] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9704892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1115.eqiad.wmnet with OS b... [17:38:20] as it stands we're getting slow performance, but killing all jobs will actually cause errors [17:38:24] they'll retry in most cases [17:38:38] killing problematic jobs might be a better route although it hasn't won out yet [17:38:57] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:39:18] I dunno though, it's late here and I don't have any good options [17:41:16] hnowlan: you said earlier you tried a few things, anything that warrants noting in the doc? [17:41:18] we killed all jobs on mw1437 earlier and now loads are back up to around the same [17:41:26] urandom: changed the concurrency of the jobs [17:41:44] after the initial change? from 10 to 5, and 5 to 3? [17:41:59] or is that what you were referring to [17:42:09] tried killing the longer running of the jobs (7h+) [17:42:16] that's what I'm referring to [17:42:20] ok [17:42:25] the bad trend started around 8:00 https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1&from=1712734511374&to=1712770575924 [17:42:35] Should we try killing longer jobs? [17:42:40] without better knowledge of what files are causing it I dunno what to do [17:42:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P60278 and previous config saved to /var/cache/conftool/dbconfig/20240410-174244-arnaudb.json [17:42:46] (03PS2) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) [17:43:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:44:04] I suspect there's a good reason we can't but I wonder whether we could pool the codfw videoscalers also [17:45:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:46:40] !log finished updating A:conf hosts to etcd-mirror 0.0.11-1 (T358636) [17:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:45] T358636: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636 [17:47:53] hnowlan: AFAICT, codfw videoscalers are pooled? [17:48:10] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye [17:48:16] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9704944 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1115.eqiad.wmnet with OS bulls... [17:48:27] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [17:48:32] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9704945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1115.eqiad.wmnet with OS b... [17:48:56] cwhite: what are you basing that off? I'm just looking at discovery but I could be wrong [17:49:12] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:49:27] `confctl select 'dc=codfw,cluster=videoscaler' get` [17:51:37] cwhite: discovery only points to eqiad [17:51:49] service is active/passive [17:52:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:52:36] that makes sense - naming is hard [17:52:48] !incidents [17:52:48] 4581 (ACKED) [2x] ProbeDown sre (ip4 probes/service eqiad) [17:52:48] 4580 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [17:52:49] 4579 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [17:52:49] 4578 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [17:52:49] 4577 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [17:52:49] 4576 (RESOLVED) db1152 (paged)/MariaDB read only x2 (paged) [17:53:24] (03PS1) 10Btullis: Add missing file to the matomo profile [puppet] - 10https://gerrit.wikimedia.org/r/1018756 (https://phabricator.wikimedia.org/T351552) [17:54:12] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:56:14] What would the effect be if we made codfw videoscalers active? Would new jobs go there and any old ones simply finish in eqiad? [17:56:46] by old, I mean long-running jobs [17:57:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T360332)', diff saved to https://phabricator.wikimedia.org/P60279 and previous config saved to /var/cache/conftool/dbconfig/20240410-175752-arnaudb.json [17:57:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance [17:58:00] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [17:58:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance [17:58:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T360332)', diff saved to https://phabricator.wikimedia.org/P60280 and previous config saved to /var/cache/conftool/dbconfig/20240410-175816-arnaudb.json [17:58:59] (03CR) 10Btullis: [C:03+2] Add missing file to the matomo profile [puppet] - 10https://gerrit.wikimedia.org/r/1018756 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis) [17:59:12] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:59:27] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:59:40] cwhite: that would be my hope but I don't really know what the risks would be [17:59:43] for now [17:59:49] let's just drop the concurrency in the jobqueue to 1 for both [17:59:57] I don't really know whether that will fix it [18:00:05] hashar and jnuche: Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1800) [18:00:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [18:00:16] ^ I have done it earlier today [18:00:49] (03CR) 10Muehlenhoff: prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:01:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T360332)', diff saved to https://phabricator.wikimedia.org/P60281 and previous config saved to /var/cache/conftool/dbconfig/20240410-180111-arnaudb.json [18:01:51] hnowlan: I guess that will result in scaling getting ((very ) far) behind, but otherwise preserve the cluster/stop the paging? [18:03:14] is there anything that shows the backlog? [18:03:33] (03Abandoned) 10Dwisehaupt: Enable https with apache for community civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [18:03:44] down the bottom here https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&var-dc=eqiad%20prometheus%2Fk8s&var-job=webVideoTranscodePrioritized&from=now-12h&to=now [18:05:12] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage [18:05:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [18:05:32] perfect. [18:08:16] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage [18:08:26] shall I do the concurrency change then [18:08:46] I'm about to submit a gerrit [18:08:57] you can if you want, or I can add you to review! [18:09:03] (03CR) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:09:12] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:09:22] urandom: please do [18:10:06] (03PS1) 10Jforrester: Parser::statelessFetchTemplate: don't add interwiki redirects to dependencies [core] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018692 (https://phabricator.wikimedia.org/T362221) [18:10:52] (03PS1) 10Eevans: changeprop-jobqueue: temporarily reduce video transcode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018759 [18:11:38] done ^^^ [18:12:31] (03CR) 10Cwhite: [C:03+1] changeprop-jobqueue: temporarily reduce video transcode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018759 (owner: 10Eevans) [18:12:51] (03CR) 10Hnowlan: [C:03+1] changeprop-jobqueue: temporarily reduce video transcode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018759 (owner: 10Eevans) [18:13:07] (03CR) 10Eevans: [C:03+2] changeprop-jobqueue: temporarily reduce video transcode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018759 (owner: 10Eevans) [18:13:58] (03Merged) 10jenkins-bot: changeprop-jobqueue: temporarily reduce video transcode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018759 (owner: 10Eevans) [18:14:03] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018409 [18:15:02] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [18:15:04] (03CR) 10Muehlenhoff: prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:15:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [18:15:33] (03CR) 10Herron: prometheus: Ensure TLS certificates are provided by CFSSL (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:15:43] I guess staging wasn't applied earlier [18:16:05] I assume it is OK to do so though? [18:16:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [18:16:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P60282 and previous config saved to /var/cache/conftool/dbconfig/20240410-181618-arnaudb.json [18:16:24] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [18:16:31] !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [18:17:11] !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [18:18:41] (03PS3) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) [18:19:25] (03CR) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:19:40] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:19:51] acked that. [18:21:31] I wonder if we'll need to kill some more ffmeg processes to create headroom [18:21:40] (again) [18:22:47] I'd let it sit a bit [18:24:12] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:24:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3071.esams.wmnet,service=(cdn|ats-be) [18:24:40] (03CR) 10Ssingh: [C:03+2] cp3071: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015973 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh) [18:26:40] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1115.eqiad.wmnet with OS bullseye [18:26:49] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9705040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1115.eqiad.wmnet with OS bulls... [18:28:28] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp3071.esams.wmnet with OS bullseye [18:28:41] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9705041 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp3071.esams.wmnet with OS bullseye [18:30:09] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1115.eqiad.wmnet,service=(cdn|ats-be) [18:31:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P60283 and previous config saved to /var/cache/conftool/dbconfig/20240410-183126-arnaudb.json [18:32:30] (03CR) 10Eevans: [C:03+2] echostore: configure TLS verification in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018742 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [18:32:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:33:27] (03Merged) 10jenkins-bot: echostore: configure TLS verification in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018742 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [18:34:24] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply [18:34:53] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply [18:36:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [18:37:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:39:46] (03PS4) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) [18:40:29] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9705080 (10ssingh) For `cp1115` that we tried today, I downgraded the BIOS, NIC and iDRAC firmwares, to match what we have in esams, whe... [18:41:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [18:46:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [18:46:30] (ProbeDown) firing: (2) Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:46:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T360332)', diff saved to https://phabricator.wikimedia.org/P60284 and previous config saved to /var/cache/conftool/dbconfig/20240410-184633-arnaudb.json [18:46:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance [18:46:38] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [18:46:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance [18:46:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T360332)', diff saved to https://phabricator.wikimedia.org/P60285 and previous config saved to /var/cache/conftool/dbconfig/20240410-184656-arnaudb.json [18:51:30] (ProbeDown) resolved: (2) Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:51:32] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3071.esams.wmnet with reason: host reimage [18:54:41] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3071.esams.wmnet with reason: host reimage [18:57:41] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1018749/1858/" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:58:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [19:03:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [19:03:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T360332)', diff saved to https://phabricator.wikimedia.org/P60287 and previous config saved to /var/cache/conftool/dbconfig/20240410-190347-arnaudb.json [19:04:02] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:06:20] (03PS17) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [19:06:39] (03PS18) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [19:08:36] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:09:53] cwhite: looks like it is recovering a bit [19:10:16] s/recovering a bit/death-spiraling less/ [19:10:49] definitely seeming a bit better, more stable responses, less active workers [19:10:52] https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1 [19:11:18] ya [19:14:24] cluster is still under heavy load, but at least monitoring isn't complaining :) [19:18:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P60288 and previous config saved to /var/cache/conftool/dbconfig/20240410-191854-arnaudb.json [19:20:29] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3071.esams.wmnet with OS bullseye [19:20:45] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9705132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp3071.esams.wmnet with OS bullseye completed: - cp3071 (**PASS**)... [19:24:06] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9705133 (10ssingh) [19:24:52] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3071.esams.wmnet,service=(cdn|ats-be) [19:28:19] (03CR) 10Herron: "NOOPs on the prometheus pop hosts e.g. prometheus6002 seems off to me" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [19:34:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P60289 and previous config saved to /var/cache/conftool/dbconfig/20240410-193402-arnaudb.json [19:42:15] (JobrunnerPHPBusyWorkers) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [19:42:54] (03CR) 10Bartosz Dziewoński: [C:03+1] LogStash: log HtmlOutputRendererHelper channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018738 (https://phabricator.wikimedia.org/T356157) (owner: 10Daniel Kinzler) [19:43:20] (03CR) 10Bartosz Dziewoński: [C:03+1] "I scheduled it for deployment: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T2000" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018738 (https://phabricator.wikimedia.org/T356157) (owner: 10Daniel Kinzler) [19:46:29] (03PS19) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [19:47:35] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:49:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T360332)', diff saved to https://phabricator.wikimedia.org/P60290 and previous config saved to /var/cache/conftool/dbconfig/20240410-194909-arnaudb.json [19:49:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [19:49:19] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:49:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [19:50:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [19:50:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [19:50:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [19:51:00] (03PS1) 10Andrew Bogott: nova-fullstack: switch test image from Bullseye to Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1018777 [19:51:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [19:51:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance [19:52:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance [19:52:17] (03CR) 10Andrew Bogott: [C:03+2] nova-fullstack: switch test image from Bullseye to Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1018777 (owner: 10Andrew Bogott) [19:53:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [19:53:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [19:54:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [19:54:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [19:54:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2112 (T360332)', diff saved to https://phabricator.wikimedia.org/P60291 and previous config saved to /var/cache/conftool/dbconfig/20240410-195430-arnaudb.json [19:54:35] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:56:38] (03PS1) 10DCausse: cirrus-streaming-updater: swith to "failure-rate" retry strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018778 [19:57:09] (03PS20) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [19:57:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T360332)', diff saved to https://phabricator.wikimedia.org/P60292 and previous config saved to /var/cache/conftool/dbconfig/20240410-195730-arnaudb.json [19:58:15] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T2000). [20:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] hi [20:00:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [20:00:19] o/ [20:00:22] i can deploy [20:00:29] just a trivial patch today, no way to really test it, so it can go out directly [20:00:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018738 (https://phabricator.wikimedia.org/T356157) (owner: 10Daniel Kinzler) [20:00:55] sounds good - will sync [20:01:05] thanks [20:01:12] np! [20:01:18] (03Merged) 10jenkins-bot: LogStash: log HtmlOutputRendererHelper channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018738 (https://phabricator.wikimedia.org/T356157) (owner: 10Daniel Kinzler) [20:01:48] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1018738|LogStash: log HtmlOutputRendererHelper channel (T356157)]] [20:02:04] T356157: Unable to fetch Parsoid HTML - https://phabricator.wikimedia.org/T356157 [20:04:23] !log cjming@deploy1002 cjming and daniel: Backport for [[gerrit:1018738|LogStash: log HtmlOutputRendererHelper channel (T356157)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:04:31] !log cjming@deploy1002 cjming and daniel: Continuing with sync [20:09:40] (03PS21) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [20:10:47] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [20:12:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P60293 and previous config saved to /var/cache/conftool/dbconfig/20240410-201237-arnaudb.json [20:15:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [20:15:39] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1018738|LogStash: log HtmlOutputRendererHelper channel (T356157)]] (duration: 13m 51s) [20:15:48] T356157: Unable to fetch Parsoid HTML - https://phabricator.wikimedia.org/T356157 [20:15:53] MatmaRex: should be live! [20:16:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [20:16:18] thanks cjming [20:16:28] yw! [20:16:28] hopefully we'll see some logs in that channel [20:16:35] 🤞 [20:16:58] i'm going to close the backport window cuz i gotta run to a mtg [20:17:38] !log end of UTC late backport window [20:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:10] (03PS4) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [20:19:41] (03CR) 10Hashar: [C:04-1] "I found out we have some tests, I will look at adding a test covering the behavior." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [20:26:02] (03PS22) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [20:27:08] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [20:27:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P60294 and previous config saved to /var/cache/conftool/dbconfig/20240410-202745-arnaudb.json [20:29:40] (03PS1) 10Volans: quotereviewer: support tables with Qty field [software] - 10https://gerrit.wikimedia.org/r/1018783 (https://phabricator.wikimedia.org/T362260) [20:31:02] (03PS23) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [20:31:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [20:32:09] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [20:35:21] (03CR) 10RobH: [C:03+2] quotereviewer: support tables with Qty field [software] - 10https://gerrit.wikimedia.org/r/1018783 (https://phabricator.wikimedia.org/T362260) (owner: 10Volans) [20:35:57] (03Merged) 10jenkins-bot: quotereviewer: support tables with Qty field [software] - 10https://gerrit.wikimedia.org/r/1018783 (https://phabricator.wikimedia.org/T362260) (owner: 10Volans) [20:37:32] (03CR) 10RobH: [C:03+2] "recheck" [software] - 10https://gerrit.wikimedia.org/r/1018783 (https://phabricator.wikimedia.org/T362260) (owner: 10Volans) [20:39:15] (03PS24) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [20:40:21] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [20:42:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T360332)', diff saved to https://phabricator.wikimedia.org/P60295 and previous config saved to /var/cache/conftool/dbconfig/20240410-204253-arnaudb.json [20:42:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [20:42:58] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [20:43:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [20:43:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T360332)', diff saved to https://phabricator.wikimedia.org/P60296 and previous config saved to /var/cache/conftool/dbconfig/20240410-204316-arnaudb.json [20:44:33] !incidents [20:44:34] 4583 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [20:44:34] 4582 (RESOLVED) ProbeDown sre (ip4 probes/service eqiad) [20:44:34] 4581 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [20:44:34] 4580 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [20:44:35] 4579 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [20:44:35] 4578 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [20:44:35] 4577 (RESOLVED) ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad) [20:44:35] 4576 (RESOLVED) db1152 (paged)/MariaDB read only x2 (paged) [20:46:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T360332)', diff saved to https://phabricator.wikimedia.org/P60297 and previous config saved to /var/cache/conftool/dbconfig/20240410-204617-arnaudb.json [20:59:37] (03PS25) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T2100) [21:00:46] (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [21:01:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P60298 and previous config saved to /var/cache/conftool/dbconfig/20240410-210125-arnaudb.json [21:16:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P60300 and previous config saved to /var/cache/conftool/dbconfig/20240410-211632-arnaudb.json [21:31:15] (03PS1) 10EoghanGaffney: gitlab: Unquote rsync path with glob [puppet] - 10https://gerrit.wikimedia.org/r/1018796 [21:31:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T360332)', diff saved to https://phabricator.wikimedia.org/P60301 and previous config saved to /var/cache/conftool/dbconfig/20240410-213140-arnaudb.json [21:31:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [21:31:48] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [21:31:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [21:32:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T360332)', diff saved to https://phabricator.wikimedia.org/P60302 and previous config saved to /var/cache/conftool/dbconfig/20240410-213203-arnaudb.json [21:35:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T360332)', diff saved to https://phabricator.wikimedia.org/P60303 and previous config saved to /var/cache/conftool/dbconfig/20240410-213506-arnaudb.json [21:40:33] (03CR) 10Dzahn: [C:03+1] gitlab: Unquote rsync path with glob [puppet] - 10https://gerrit.wikimedia.org/r/1018796 (owner: 10EoghanGaffney) [21:40:42] (03CR) 10EoghanGaffney: [C:03+2] gitlab: Unquote rsync path with glob [puppet] - 10https://gerrit.wikimedia.org/r/1018796 (owner: 10EoghanGaffney) [21:50:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P60304 and previous config saved to /var/cache/conftool/dbconfig/20240410-215014-arnaudb.json [21:56:52] !log prometheus - recreating deleted TLS certs/keys in private repo [21:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P60305 and previous config saved to /var/cache/conftool/dbconfig/20240410-220521-arnaudb.json [22:13:56] (03PS5) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) [22:20:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T360332)', diff saved to https://phabricator.wikimedia.org/P60306 and previous config saved to /var/cache/conftool/dbconfig/20240410-222028-arnaudb.json [22:20:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [22:20:35] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [22:20:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [22:21:30] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [22:21:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [22:21:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T360332)', diff saved to https://phabricator.wikimedia.org/P60307 and previous config saved to /var/cache/conftool/dbconfig/20240410-222150-arnaudb.json [22:24:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T360332)', diff saved to https://phabricator.wikimedia.org/P60308 and previous config saved to /var/cache/conftool/dbconfig/20240410-222445-arnaudb.json [22:31:55] (03PS6) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) [22:36:37] (03CR) 10Krinkle: [C:03+1] static.php: Handle mediawiki.org/ontology/ontology.owl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [22:37:47] (03CR) 10Krinkle: [C:03+1] static.php: Handle mediawiki.org/ontology/ontology.owl (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [22:39:38] (03PS3) 10Ahmon Dancy: static.php: Handle mediawiki.org/ontology/ontology.owl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) [22:39:46] (03CR) 10Ahmon Dancy: static.php: Handle mediawiki.org/ontology/ontology.owl (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [22:39:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P60309 and previous config saved to /var/cache/conftool/dbconfig/20240410-223953-arnaudb.json [22:41:57] (03CR) 10Krinkle: [C:03+1] static.php: Handle mediawiki.org/ontology/ontology.owl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy) [22:49:40] (03CR) 10Dzahn: [C:04-1] "I don't see an "include ::profile::tlsproxy::envoy" in the prometheus::pop role. Also there is no "ensure: present" for "tlsproxy::envoy" " [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [22:55:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P60310 and previous config saved to /var/cache/conftool/dbconfig/20240410-225500-arnaudb.json [22:56:18] (03PS1) 10Andrea Denisse: prometheus: Ensure the Prometheus PoP role uses TLSProxy [puppet] - 10https://gerrit.wikimedia.org/r/1018802 (https://phabricator.wikimedia.org/T360414) [23:10:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T360332)', diff saved to https://phabricator.wikimedia.org/P60311 and previous config saved to /var/cache/conftool/dbconfig/20240410-231008-arnaudb.json [23:10:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [23:10:13] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [23:10:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [23:10:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T360332)', diff saved to https://phabricator.wikimedia.org/P60312 and previous config saved to /var/cache/conftool/dbconfig/20240410-231032-arnaudb.json [23:13:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T360332)', diff saved to https://phabricator.wikimedia.org/P60313 and previous config saved to /var/cache/conftool/dbconfig/20240410-231335-arnaudb.json [23:28:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P60314 and previous config saved to /var/cache/conftool/dbconfig/20240410-232842-arnaudb.json [23:37:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1018410 [23:37:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1018410 (owner: 10TrainBranchBot) [23:43:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P60315 and previous config saved to /var/cache/conftool/dbconfig/20240410-234350-arnaudb.json [23:58:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T360332)', diff saved to https://phabricator.wikimedia.org/P60316 and previous config saved to /var/cache/conftool/dbconfig/20240410-235857-arnaudb.json [23:59:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [23:59:02] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [23:59:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [23:59:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T360332)', diff saved to https://phabricator.wikimedia.org/P60317 and previous config saved to /var/cache/conftool/dbconfig/20240410-235920-arnaudb.json [23:59:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T356166)', diff saved to https://phabricator.wikimedia.org/P60318 and previous config saved to /var/cache/conftool/dbconfig/20240410-235950-marostegui.json [23:59:55] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166