[01:16:25] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:38:28] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:23:28] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:30:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T356166)', diff saved to https://phabricator.wikimedia.org/P60169 and previous config saved to /var/cache/conftool/dbconfig/20240410-033019-marostegui.json
[03:30:23] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[03:45:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P60170 and previous config saved to /var/cache/conftool/dbconfig/20240410-034526-marostegui.json
[04:00:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P60171 and previous config saved to /var/cache/conftool/dbconfig/20240410-040033-marostegui.json
[04:15:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T356166)', diff saved to https://phabricator.wikimedia.org/P60172 and previous config saved to /var/cache/conftool/dbconfig/20240410-041541-marostegui.json
[04:15:44] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance
[04:15:45] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[04:15:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance
[04:16:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T356166)', diff saved to https://phabricator.wikimedia.org/P60173 and previous config saved to /var/cache/conftool/dbconfig/20240410-041604-marostegui.json
[04:46:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Kernel reboot
[04:46:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Kernel reboot
[04:49:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60174 and previous config saved to /var/cache/conftool/dbconfig/20240410-044928-root.json
[04:52:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 11.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:55:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad api_appserver POST/200: ...
[04:55:15] <jinxer-wm>	 0.4246627549440708s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:55:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1223 T362134', diff saved to https://phabricator.wikimedia.org/P60175 and previous config saved to /var/cache/conftool/dbconfig/20240410-045534-marostegui.json
[04:55:40] <stashbot>	 T362134: Upgrade s3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362134
[04:56:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool db1223', diff saved to https://phabricator.wikimedia.org/P60176 and previous config saved to /var/cache/conftool/dbconfig/20240410-045632-marostegui.json
[04:57:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1166 T362134', diff saved to https://phabricator.wikimedia.org/P60177 and previous config saved to /var/cache/conftool/dbconfig/20240410-045710-marostegui.json
[04:57:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.01% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:58:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1166.eqiad.wmnet with OS bookworm
[05:00:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad api_appserver POST/200: ...
[05:00:15] <jinxer-wm>	 0.4246627549440708s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:04:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60178 and previous config saved to /var/cache/conftool/dbconfig/20240410-050434-root.json
[05:10:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1166.eqiad.wmnet with reason: host reimage
[05:12:56] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1166.eqiad.wmnet with reason: host reimage
[05:16:25] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:19:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60179 and previous config saved to /var/cache/conftool/dbconfig/20240410-051939-root.json
[05:28:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60180 and previous config saved to /var/cache/conftool/dbconfig/20240410-052854-root.json
[05:33:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1166.eqiad.wmnet with OS bookworm
[05:34:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60181 and previous config saved to /var/cache/conftool/dbconfig/20240410-053445-root.json
[05:44:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60182 and previous config saved to /var/cache/conftool/dbconfig/20240410-054400-root.json
[05:47:54] <marostegui>	 fixed
[05:49:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60183 and previous config saved to /var/cache/conftool/dbconfig/20240410-054952-root.json
[05:55:00] <hashar>	 marostegui: what have you fixed? :)
[05:55:38] <hashar>	 asking cause we had an unbreak now about VisualEditor not being to save draft parsoid html ( https://phabricator.wikimedia.org/T362210 )
[05:56:07] <hashar>	 and that magically resolved ( https://grafana.wikimedia.org/d/t_x3DEu4k/parsoid-health?forceLogin=&from=1712706703415&orgId=1&to=1712728303415&refresh=15m&viewPanel=6 )
[05:56:07] <hashar>	 :)
[05:57:05] <marostegui>	 We had a p4ge
[05:59:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60184 and previous config saved to /var/cache/conftool/dbconfig/20240410-055906-root.json
[05:59:49] <hashar>	 marostegui: what page was it? Cause non sre don't get them so I can't know what has happened
[06:00:04] <hashar>	 looks like some DB went wild maybe?
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T0600)
[06:00:30] <marostegui>	 It was a db from x2 yeah
[06:01:12] <hashar>	 ah I see you commented on the task :) thanks!
[06:01:39] <marostegui>	 I didn't close it yet but I'm sure it was the same thing 
[06:02:26] <hashar>	 subbu: so my guess is we can remove the train blocker
[06:03:04] <subbu>	 yes .. it also impacted dewiki which didn't have the train roll out to yet.
[06:03:05] <hashar>	 and maybe want to investigate why `HtmlOutputRendererHelper` errors are not logged anywhere (or at least I haven't found them)
[06:04:07] <hashar>	 orI misunderstood the MediaWiki code
[06:04:21] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:04:26] <hashar>	 anyway that seems solved, and I am going to have breakfast with kids
[06:04:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60185 and previous config saved to /var/cache/conftool/dbconfig/20240410-060457-root.json
[06:09:21] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:14:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60186 and previous config saved to /var/cache/conftool/dbconfig/20240410-061411-root.json
[06:20:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60187 and previous config saved to /var/cache/conftool/dbconfig/20240410-062003-root.json
[06:21:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 5%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60188 and previous config saved to /var/cache/conftool/dbconfig/20240410-062114-arnaudb.json
[06:29:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60189 and previous config saved to /var/cache/conftool/dbconfig/20240410-062917-root.json
[06:36:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 10%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60190 and previous config saved to /var/cache/conftool/dbconfig/20240410-063620-arnaudb.json
[06:37:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 1%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60191 and previous config saved to /var/cache/conftool/dbconfig/20240410-063734-arnaudb.json
[06:44:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60192 and previous config saved to /var/cache/conftool/dbconfig/20240410-064423-root.json
[06:51:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 20%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60193 and previous config saved to /var/cache/conftool/dbconfig/20240410-065125-arnaudb.json
[06:52:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 2%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60194 and previous config saved to /var/cache/conftool/dbconfig/20240410-065239-arnaudb.json
[06:59:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60195 and previous config saved to /var/cache/conftool/dbconfig/20240410-065929-root.json
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T0700). nyaa~
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:04:26] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Steph Toyofuku - https://phabricator.wikimedia.org/T362113#9703016 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff
[07:06:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 25%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60196 and previous config saved to /var/cache/conftool/dbconfig/20240410-070631-arnaudb.json
[07:07:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 4%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60197 and previous config saved to /var/cache/conftool/dbconfig/20240410-070745-arnaudb.json
[07:21:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 50%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60198 and previous config saved to /var/cache/conftool/dbconfig/20240410-072137-arnaudb.json
[07:22:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 8%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60199 and previous config saved to /var/cache/conftool/dbconfig/20240410-072253-arnaudb.json
[07:25:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dumps::generation::server::spare
[07:29:42] <logmsgbot>	 !log akosiaris@deploy1002 Synchronized wmf-config/mc.php: Dummy sync for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018332 (duration: 14m 03s)
[07:33:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dumps::generation::server::spare
[07:36:36] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9703049 (10MoritzMuehlenhoff)
[07:36:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 75%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60200 and previous config saved to /var/cache/conftool/dbconfig/20240410-073644-arnaudb.json
[07:37:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 16%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60201 and previous config saved to /var/cache/conftool/dbconfig/20240410-073759-arnaudb.json
[07:50:13] <wikibugs>	 06SRE, 10Phabricator, 13Patch-For-Review: 14have any task put into ops-access-requests automatically generate an ops-access-review task - 14https://phabricator.wikimedia.org/T87467#9703058 (10Aklapper) 14For archaeology researchers: This functionality got broken/removed in February 2016 by https://gerri...
[07:50:23] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3070.esams.wmnet
[07:51:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2112 (re)pooling @ 100%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P60202 and previous config saved to /var/cache/conftool/dbconfig/20240410-075150-arnaudb.json
[07:52:01] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp3070.esams.wmnet with OS bullseye
[07:52:11] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9703059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3070.esams.wmnet with OS bullseye
[07:53:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 25%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60203 and previous config saved to /var/cache/conftool/dbconfig/20240410-075304-arnaudb.json
[07:54:44] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Remove names for old cloudmetrics redirects [dns] - 10https://gerrit.wikimedia.org/r/1018312 (owner: 10Majavah)
[07:55:50] <wikibugs>	 (03PS2) 10Muehlenhoff: Add stoyofuku to analytics-privatedata-access [puppet] - 10https://gerrit.wikimedia.org/r/1018634 (https://phabricator.wikimedia.org/T362113)
[07:56:30] <moritzm>	 !log installing glibc security updates on bullseye
[07:56:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:22] <wikibugs>	 (03PS1) 10Slyngshede: Change ssh key validator from class to function. [software/bitu] - 10https://gerrit.wikimedia.org/r/1018635
[08:00:05] <jouncebot>	 hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T0800)
[08:00:13] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] API: Username validation API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1017244 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede)
[08:01:17] <wikibugs>	 (03Merged) 10jenkins-bot: API: Username validation API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1017244 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede)
[08:04:16] <wikibugs>	 (03PS1) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838)
[08:06:37] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018638 (https://phabricator.wikimedia.org/T360158)
[08:06:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018638 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot)
[08:07:14] <wikibugs>	 (03CR) 10Hashar: "I have no idea of how many logs that would generate and what kind of pressure that can adds to the logging stack." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[08:07:22] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018638 (https://phabricator.wikimedia.org/T360158) (owner: 10TrainBranchBot)
[08:08:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 50%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60204 and previous config saved to /var/cache/conftool/dbconfig/20240410-080810-arnaudb.json
[08:11:05] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::deployment Add Django REST Framework. [puppet] - 10https://gerrit.wikimedia.org/r/1018640
[08:11:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] puppetdb::microservice: Use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1017769 (owner: 10Muehlenhoff)
[08:11:37] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: deploy mistral-7b-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018633 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos)
[08:15:03] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3070.esams.wmnet with reason: host reimage
[08:15:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1018640 (owner: 10Slyngshede)
[08:15:42] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] C:idm::deployment Add Django REST Framework. [puppet] - 10https://gerrit.wikimedia.org/r/1018640 (owner: 10Slyngshede)
[08:18:38] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3070.esams.wmnet with reason: host reimage
[08:21:13] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: deploy mistral-7b-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018633 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos)
[08:21:27] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.26  refs T360158
[08:21:33] <stashbot>	 T360158: 1.42.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T360158
[08:21:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Add andyrussg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1018641 (https://phabricator.wikimedia.org/T361742)
[08:22:07] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: deploy mistral-7b-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018633 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos)
[08:22:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add andyrussg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1018641 (https://phabricator.wikimedia.org/T361742) (owner: 10Muehlenhoff)
[08:23:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 75%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60205 and previous config saved to /var/cache/conftool/dbconfig/20240410-082316-arnaudb.json
[08:24:47] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:24:59] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:25:13] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:25:15] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:25:43] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:25:53] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:25:57] <wikibugs>	 (03PS2) 10Muehlenhoff: Add andyrussg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1018641 (https://phabricator.wikimedia.org/T361742)
[08:34:20] <logmsgbot>	 !log gmodena@deploy1002 Started deploy [airflow-dags/analytics@46818a3]: Deploying cassandra_load_pageview_top_articles changes MR#648
[08:34:32] <logmsgbot>	 !log hashar@deploy1002 Synchronized php: group1 wikis to 1.42.0-wmf.26  refs T360158 (duration: 13m 05s)
[08:34:38] <stashbot>	 T360158: 1.42.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T360158
[08:34:54] <logmsgbot>	 !log gmodena@deploy1002 Finished deploy [airflow-dags/analytics@46818a3]: Deploying cassandra_load_pageview_top_articles changes MR#648 (duration: 00m 33s)
[08:35:49] <hashar>	 looks like it is working
[08:35:53] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[08:36:21] <wikibugs>	 (03PS1) 10EoghanGaffney: gitlab: Fix typo in systemctl timer command [puppet] - 10https://gerrit.wikimedia.org/r/1018642
[08:36:55] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9703148 (10BTullis)
[08:38:00] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1018642 (owner: 10EoghanGaffney)
[08:38:11] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] gitlab: Fix typo in systemctl timer command [puppet] - 10https://gerrit.wikimedia.org/r/1018642 (owner: 10EoghanGaffney)
[08:38:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 100%: Post clone repool (dst)', diff saved to https://phabricator.wikimedia.org/P60206 and previous config saved to /var/cache/conftool/dbconfig/20240410-083822-arnaudb.json
[08:39:13] <wikibugs>	 06SRE, 06cloud-services-team, 10Data-Services, 06Infrastructure-Foundations: 14Switch labstore servers to default SSH configuration - 14https://phabricator.wikimedia.org/T177914#9703154 (10taavi) 05Open→03Invalid 14Closing as we've moved the NFS servers to Cloud VPS VMs and I'm pretty sure we did...
[08:39:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: titan: trim 5m retention to 3y + 2w [puppet] - 10https://gerrit.wikimedia.org/r/1018644 (https://phabricator.wikimedia.org/T351927)
[08:41:00] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis)
[08:41:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 27.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:42:03] <wikibugs>	 (03Merged) 10jenkins-bot: Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis)
[08:42:28] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3070.esams.wmnet with OS bullseye
[08:42:43] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9703163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3070.esams.wmnet with OS bullseye completed: - cp3070 (**PASS**)...
[08:44:03] <wikibugs>	 (03PS2) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838)
[08:46:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 27.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:49:01] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3070.esams.wmnet
[08:50:30] <Lucas_WMDE>	 jouncebot: nowandnext
[08:50:31] <jouncebot>	 For the next 1 hour(s) and 9 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T0800)
[08:50:31] <jouncebot>	 In 1 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1000)
[08:53:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] Use oauth2-proxy for opensearch dashboards [puppet] - 10https://gerrit.wikimedia.org/r/1015045 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi)
[08:56:20] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: fix indentation in mistral model resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018646 (https://phabricator.wikimedia.org/T357986)
[08:58:18] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9703181 (10Fabfur)
[09:07:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: test sso for opensearch-dashboards in cloud vps [puppet] - 10https://gerrit.wikimedia.org/r/1018647 (https://phabricator.wikimedia.org/T337818)
[09:07:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hieradata: test sso for opensearch-dashboards in cloud vps [puppet] - 10https://gerrit.wikimedia.org/r/1018647 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi)
[09:13:28] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:16:25] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:18:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1015045 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi)
[09:21:45] <effie>	 !jouncebot  now
[09:21:45] <wm-bot>	 a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot
[09:21:54] <effie>	 jouncebot: now
[09:21:54] <jouncebot>	 For the next 0 hour(s) and 38 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T0800)
[09:23:05] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: create new account and database on m5 for striker_toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149)
[09:25:40] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: test sso for opensearch-dashboards in cloud vps [puppet] - 10https://gerrit.wikimedia.org/r/1018647 (https://phabricator.wikimedia.org/T337818)
[09:26:25] <wikibugs>	 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9703207 (10fgiunchedi)
[09:28:13] <wikibugs>	 (03CR) 10Marostegui: mariadb: create new account and database on m5 for striker_toolsbeta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb)
[09:28:39] <wikibugs>	 (03PS2) 10Fabfur: prometheus: add aggregate metrics for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845)
[09:29:17] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: create new account and database on m5 for striker_toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149)
[09:29:30] <wikibugs>	 (03CR) 10Arnaudb: mariadb: create new account and database on m5 for striker_toolsbeta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb)
[09:32:46] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "thanks for submitting this" [puppet] - 10https://gerrit.wikimedia.org/r/1018355 (https://phabricator.wikimedia.org/T362197) (owner: 10BCornwall)
[09:40:28] <logmsgbot>	 !log jiji@deploy1002 Started scap: (no justification provided)
[09:41:52] <wikibugs>	 (03CR) 10Marostegui: "Remember to drop those users with: drop user if exists 'USERNAME'@'IPS_REMOVED';" [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb)
[09:42:45] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: create new account and database on m5 for striker_toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb)
[09:42:53] <effie>	 !log running  scap sync-world to rebuild mw image and pick up gerrit:1015338
[09:42:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:41] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: create new account and database on m5 for striker_toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/1018408 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb)
[09:45:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 966.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:50:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 961.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:51:54] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[09:52:02] <wikibugs>	 (03PS1) 10Majavah: Update example Striker hiera [labs/private] - 10https://gerrit.wikimedia.org/r/1018652
[09:52:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[09:52:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T360332)', diff saved to https://phabricator.wikimedia.org/P60207 and previous config saved to /var/cache/conftool/dbconfig/20240410-095214-arnaudb.json
[09:52:23] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[09:54:33] <wikibugs>	 (03CR) 10Majavah: [V:03+2 C:03+2] Update example Striker hiera [labs/private] - 10https://gerrit.wikimedia.org/r/1018652 (owner: 10Majavah)
[09:55:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T360332)', diff saved to https://phabricator.wikimedia.org/P60208 and previous config saved to /var/cache/conftool/dbconfig/20240410-095508-arnaudb.json
[09:55:50] <wikibugs>	 (03PS4) 10Majavah: P:wmcs::striker: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/1011174 (https://phabricator.wikimedia.org/T360025)
[09:57:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: test sso for opensearch-dashboards in cloud vps [puppet] - 10https://gerrit.wikimedia.org/r/1018647 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi)
[09:57:34] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1839/co" [puppet] - 10https://gerrit.wikimedia.org/r/1011174 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah)
[09:58:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::sanitarium_master
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1000)
[10:01:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mariadb::sanitarium_master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018653 (https://phabricator.wikimedia.org/T349619)
[10:02:02] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch mariadb::sanitarium_master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018653 (https://phabricator.wikimedia.org/T349619)
[10:02:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 26.26% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:02:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: opensearch: fix sso support [puppet] - 10https://gerrit.wikimedia.org/r/1018654 (https://phabricator.wikimedia.org/T337818)
[10:03:23] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes: move 6 eqiad api_appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1018655 (https://phabricator.wikimedia.org/T351074)
[10:03:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch mariadb::sanitarium_master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1018653 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:04:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.11s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:06:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] opensearch: fix sso support [puppet] - 10https://gerrit.wikimedia.org/r/1018654 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi)
[10:07:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 26.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:08:28] <logmsgbot>	 !log jiji@deploy1002 Finished scap: (no justification provided) (duration: 27m 59s)
[10:08:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1011174 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah)
[10:09:25] <wikibugs>	 (03PS1) 10Majavah: Add toolsadmin-toolsbeta [dns] - 10https://gerrit.wikimedia.org/r/1018656 (https://phabricator.wikimedia.org/T360025)
[10:09:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "It doesn't impact scandium at all. The only user of this destination was RESTBase and now it uses the mw-parsoid destination." [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris)
[10:10:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P60209 and previous config saved to /var/cache/conftool/dbconfig/20240410-101015-arnaudb.json
[10:11:31] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris)
[10:11:51] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::striker: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/1011174 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah)
[10:12:02] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris)
[10:12:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::sanitarium_master
[10:13:28] <jinxer-wm>	 (JobUnavailable) resolved: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:14:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 872.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:14:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: opensearch: use Sensitive[String] for sso secrets [puppet] - 10https://gerrit.wikimedia.org/r/1018657 (https://phabricator.wikimedia.org/T337818)
[10:14:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] opensearch: use Sensitive[String] for sso secrets [puppet] - 10https://gerrit.wikimedia.org/r/1018657 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi)
[10:16:11] <claime>	 !log Disabling puppet on O:docker_registry_ha::registry - T360636
[10:16:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:22] <stashbot>	 T360636: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636
[10:16:54] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+1 C:03+2] docker_registry_ha: Migrate to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1018251 (https://phabricator.wikimedia.org/T360636) (owner: 10Clément Goubert)
[10:17:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1175 T362036', diff saved to https://phabricator.wikimedia.org/P60210 and previous config saved to /var/cache/conftool/dbconfig/20240410-101746-root.json
[10:17:50] <stashbot>	 T362036: Switchover s2 master (db1162 -> db1222) - https://phabricator.wikimedia.org/T362036
[10:18:31] <wikibugs>	 (03PS1) 10Marostegui: db1175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018658
[10:18:40] <claime>	 !log Enabling and running puppet on registry1003.eqiad.wmnet - T360636
[10:18:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:45] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9703294 (10MoritzMuehlenhoff)
[10:19:15] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1018658 (owner: 10Marostegui)
[10:19:21] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1175.eqiad.wmnet with OS bookworm
[10:21:12] <claime>	 !log Enabling and running puppet on O:docker_registry_ha::registry - T360636
[10:21:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: opensearch: move apache-auth-sso.erb to the right location [puppet] - 10https://gerrit.wikimedia.org/r/1018659 (https://phabricator.wikimedia.org/T337818)
[10:22:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove parsoid-php certificates from mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018660 (https://phabricator.wikimedia.org/T359387)
[10:22:27] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: fixtures: Rename all parsoid-php references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018661 (https://phabricator.wikimedia.org/T359387)
[10:22:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] opensearch: move apache-auth-sso.erb to the right location [puppet] - 10https://gerrit.wikimedia.org/r/1018659 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi)
[10:25:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P60211 and previous config saved to /var/cache/conftool/dbconfig/20240410-102523-arnaudb.json
[10:26:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.052s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:26:19] <wikibugs>	 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9703307 (10Clement_Goubert)
[10:27:31] <wikibugs>	 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9703310 (10Clement_Goubert) chartmuseum and docker-registry done
[10:27:55] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetboard: Remove obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1018662
[10:28:57] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetboard: Remove obsolete cert [labs/private] - 10https://gerrit.wikimedia.org/r/1018663
[10:31:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 994.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:32:24] <wikibugs>	 (03PS1) 10Majavah: hieradata: Add Striker toolsbeta instance [puppet] - 10https://gerrit.wikimedia.org/r/1018664 (https://phabricator.wikimedia.org/T360025)
[10:32:27] <wikibugs>	 (03PS1) 10Majavah: hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025)
[10:32:56] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1175.eqiad.wmnet with reason: host reimage
[10:33:28] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:33:39] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] puppetboard: Remove obsolete cert [labs/private] - 10https://gerrit.wikimedia.org/r/1018663 (owner: 10Muehlenhoff)
[10:34:33] <wikibugs>	 (03PS2) 10Majavah: hieradata: Add Striker toolsbeta instance [puppet] - 10https://gerrit.wikimedia.org/r/1018664 (https://phabricator.wikimedia.org/T360025)
[10:34:33] <wikibugs>	 (03PS2) 10Majavah: hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025)
[10:34:39] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] puppetboard: Remove obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1018662 (owner: 10Muehlenhoff)
[10:35:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1175.eqiad.wmnet with reason: host reimage
[10:36:05] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1841/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018664 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah)
[10:38:21] <wikibugs>	 (03PS1) 10Filippo Giunchedi: opensearch: set vhost and issuer url for dashboards sso test [puppet] - 10https://gerrit.wikimedia.org/r/1018667 (https://phabricator.wikimedia.org/T337818)
[10:40:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T360332)', diff saved to https://phabricator.wikimedia.org/P60212 and previous config saved to /var/cache/conftool/dbconfig/20240410-104030-arnaudb.json
[10:40:33] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[10:40:40] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[10:40:46] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[10:40:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T360332)', diff saved to https://phabricator.wikimedia.org/P60213 and previous config saved to /var/cache/conftool/dbconfig/20240410-104053-arnaudb.json
[10:43:38] <wikibugs>	 (03CR) 10Clément Goubert: "Bunch of nitpicking to make ports match up with the actual services_proxy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018661 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris)
[10:43:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T360332)', diff saved to https://phabricator.wikimedia.org/P60214 and previous config saved to /var/cache/conftool/dbconfig/20240410-104345-arnaudb.json
[10:45:20] <wikibugs>	 (03CR) 10Effie Mouzeli: mediawiki: add MW__MCROUTER_SERVER variable in chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[10:45:28] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mediawiki: add MW__MCROUTER_SERVER variable in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[10:46:18] <wikibugs>	 (03PS5) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690)
[10:46:57] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: add MW__MCROUTER_SERVER variable in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[10:46:58] <wikibugs>	 (03PS6) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690)
[10:47:11] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018382
[10:47:23] <wikibugs>	 (03PS2) 10Marostegui: Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018382
[10:48:28] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:48:56] <wikibugs>	 (03CR) 10Clément Goubert: Remove parsoid-php certificates from mw deployments (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018660 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris)
[10:49:20] <wikibugs>	 (03PS1) 10Mvolz: Revert "Revert "citoid: pipeline bot promote"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018383
[10:50:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch testreduce to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1018199 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff)
[10:52:51] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "🎉 yay" [puppet] - 10https://gerrit.wikimedia.org/r/1018664 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah)
[10:53:08] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[10:53:12] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[10:53:19] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Add Striker toolsbeta instance [puppet] - 10https://gerrit.wikimedia.org/r/1018664 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah)
[10:53:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1018382 (owner: 10Marostegui)
[10:54:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60215 and previous config saved to /var/cache/conftool/dbconfig/20240410-105444-root.json
[10:55:03] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018670
[10:55:54] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "A nit on the commit message so we don't confuse ourselves, otherwise LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[10:56:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 850.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:56:19] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018670 (owner: 10Effie Mouzeli)
[10:56:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1175.eqiad.wmnet with OS bookworm
[10:57:38] <wikibugs>	 06SRE, 10Citoid, 06serviceops, 13Patch-For-Review: 14Create a readiness probe for zotero - 14https://phabricator.wikimedia.org/T213689#9703351 (10Mvolz) 14I notice that Zotero is not part of this dashboard: https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?orgId=1  Is there a re...
[10:58:39] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] kubernetes: move 6 eqiad api_appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1018655 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[10:58:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P60216 and previous config saved to /var/cache/conftool/dbconfig/20240410-105852-arnaudb.json
[10:58:53] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] DNS-related cookbooks: adapt for conftool state [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans)
[10:59:22] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018670 (owner: 10Effie Mouzeli)
[10:59:42] <claime>	 !log Depooling mw1421.eqiad.wmnet,mw1422.eqiad.wmnet,mw1491.eqiad.wmnet,mw1492.eqiad.wmnet,mw1493.eqiad.wmnet - T351074
[10:59:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:46] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[11:00:05] <jouncebot>	 mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1100).
[11:00:56] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018670 (owner: 10Effie Mouzeli)
[11:01:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 850.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:01:26] <wikibugs>	 (03CR) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:01:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove certs for docker-registry and testreduce [puppet] - 10https://gerrit.wikimedia.org/r/1018671 (https://phabricator.wikimedia.org/T360636)
[11:01:29] <wikibugs>	 (03PS7) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690)
[11:01:52] <wikibugs>	 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9703359 (10MoritzMuehlenhoff)
[11:02:18] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1018671 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff)
[11:02:19] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:02:19] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] kubernetes: move 6 eqiad api_appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1018655 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[11:02:23] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:02:49] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:03:03] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] Revert "Revert "citoid: pipeline bot promote"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018383 (owner: 10Mvolz)
[11:03:26] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:03:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "citoid: pipeline bot promote"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018383 (owner: 10Mvolz)
[11:05:09] <wikibugs>	 06SRE, 10Citoid, 06serviceops, 13Patch-For-Review: 14Create a readiness probe for zotero - 14https://phabricator.wikimedia.org/T213689#9703363 (10Clement_Goubert) 14I think it's because monitoring is disabled in the service's `values.yaml`
[11:07:05] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:07:07] <wikibugs>	 (03PS1) 10Clément Goubert: zotero: Turn on monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689)
[11:07:36] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:07:59] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:08:10] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1421.eqiad.wmnet with OS bullseye
[11:08:35] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1422.eqiad.wmnet with OS bullseye
[11:08:37] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:09:11] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1491.eqiad.wmnet with OS bullseye
[11:09:37] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1492.eqiad.wmnet with OS bullseye
[11:09:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60217 and previous config saved to /var/cache/conftool/dbconfig/20240410-110949-root.json
[11:10:02] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1493.eqiad.wmnet with OS bullseye
[11:12:31] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply
[11:12:48] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[11:13:29] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply
[11:14:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P60218 and previous config saved to /var/cache/conftool/dbconfig/20240410-111400-arnaudb.json
[11:14:02] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[11:15:11] <wikibugs>	 (03PS3) 10Majavah: hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025)
[11:15:11] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::striker: Set the port to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1018675 (https://phabricator.wikimedia.org/T360025)
[11:15:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[11:15:21] <wikibugs>	 (03PS1) 10Slyngshede: SSH Keymanagement: Fix label on SSH public key field. [software/bitu] - 10https://gerrit.wikimedia.org/r/1018676 (https://phabricator.wikimedia.org/T362049)
[11:15:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:wmcs::striker: Set the port to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1018675 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah)
[11:15:55] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[11:15:55] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Remove certs for docker-registry and testreduce [puppet] - 10https://gerrit.wikimedia.org/r/1018671 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff)
[11:16:24] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[11:16:40] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::striker: Set the port to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1018675 (https://phabricator.wikimedia.org/T360025)
[11:16:40] <wikibugs>	 (03PS4) 10Majavah: hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025)
[11:17:08] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:17:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove certs for docker-registry and testreduce [puppet] - 10https://gerrit.wikimedia.org/r/1018671 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff)
[11:18:36] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: fix indentation in mistral model resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018646 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos)
[11:18:47] <claime>	 The appservers unreachable alert is a false positive due to reimaging
[11:18:55] <claime>	 looking at the httpbb issue
[11:19:24] <effie>	 jouncebot: now
[11:19:30] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1100)
[11:19:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete dummy certs for docker-registry and testreduce [labs/private] - 10https://gerrit.wikimedia.org/r/1018678 (https://phabricator.wikimedia.org/T360636)
[11:19:49] <logmsgbot>	 !log jiji@deploy1002 Started scap: Deploy chart changes in gerrit:1015342
[11:19:53] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::striker: Set the port to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1018675 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah)
[11:20:09] <claime>	 httpbb issue was transient
[11:20:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[11:20:41] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Remove obsolete dummy certs for docker-registry and testreduce [labs/private] - 10https://gerrit.wikimedia.org/r/1018678 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff)
[11:21:05] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1421.eqiad.wmnet with reason: host reimage
[11:21:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:21:27] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] SSH Keymanagement: Fix label on SSH public key field. [software/bitu] - 10https://gerrit.wikimedia.org/r/1018676 (https://phabricator.wikimedia.org/T362049) (owner: 10Slyngshede)
[11:21:36] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1422.eqiad.wmnet with reason: host reimage
[11:21:39] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete dummy certs for docker-registry and testreduce [labs/private] - 10https://gerrit.wikimedia.org/r/1018678 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff)
[11:22:13] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1491.eqiad.wmnet with reason: host reimage
[11:22:31] <wikibugs>	 (03Merged) 10jenkins-bot: SSH Keymanagement: Fix label on SSH public key field. [software/bitu] - 10https://gerrit.wikimedia.org/r/1018676 (https://phabricator.wikimedia.org/T362049) (owner: 10Slyngshede)
[11:22:43] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1492.eqiad.wmnet with reason: host reimage
[11:23:01] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1493.eqiad.wmnet with reason: host reimage
[11:24:00] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1421.eqiad.wmnet with reason: host reimage
[11:24:43] <wikibugs>	 (03CR) 10Mvolz: [C:03+1] "LGTM but I'm not sure in retrospect the spec.yaml for Zotero will work-" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) (owner: 10Clément Goubert)
[11:24:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60219 and previous config saved to /var/cache/conftool/dbconfig/20240410-112455-root.json
[11:26:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (5) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:27:20] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1493.eqiad.wmnet with reason: host reimage
[11:28:07] <logmsgbot>	 !log jiji@deploy1002 Finished scap: Deploy chart changes in gerrit:1015342 (duration: 08m 18s)
[11:28:54] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:29:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T360332)', diff saved to https://phabricator.wikimedia.org/P60220 and previous config saved to /var/cache/conftool/dbconfig/20240410-112907-arnaudb.json
[11:29:10] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[11:29:12] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[11:29:15] <wikibugs>	 (03PS5) 10Majavah: hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025)
[11:29:15] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::striker::docker: bind on 0.0.0.0 instead [puppet] - 10https://gerrit.wikimedia.org/r/1018679
[11:29:23] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[11:29:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T360332)', diff saved to https://phabricator.wikimedia.org/P60221 and previous config saved to /var/cache/conftool/dbconfig/20240410-112929-arnaudb.json
[11:29:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9703432 (10MoritzMuehlenhoff)
[11:30:54] <wikibugs>	 (03CR) 10Clément Goubert: "Hey Filippo, can you weigh in on swagger monitoring for zotero please?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) (owner: 10Clément Goubert)
[11:31:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1422.eqiad.wmnet with reason: host reimage
[11:31:40] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:31:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:32:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "This would instruct prometheus to scrape zotero (or at least the sidecar statsd-exporter living next to zotero that exposes metrics from z" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) (owner: 10Clément Goubert)
[11:32:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T360332)', diff saved to https://phabricator.wikimedia.org/P60222 and previous config saved to /var/cache/conftool/dbconfig/20240410-113220-arnaudb.json
[11:33:20] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::striker::docker: bind on 0.0.0.0 instead [puppet] - 10https://gerrit.wikimedia.org/r/1018679 (owner: 10Majavah)
[11:34:12] <wikibugs>	 (03CR) 10Clément Goubert: "Yeah, that's what I was starting to piece together. I think we need to add the swagger probe type to the service definition, but I am unsu" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) (owner: 10Clément Goubert)
[11:34:57] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1491.eqiad.wmnet with reason: host reimage
[11:36:40] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:36:55] <jinxer-wm>	 (SystemdUnitFailed) resolved: (10) httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:37:47] <wikibugs>	 06SRE, 10Citoid, 06serviceops, 13Patch-For-Review: 14Create a readiness probe for zotero - 14https://phabricator.wikimedia.org/T213689#9703471 (10Clement_Goubert) 14Summing up the discussion on the patch set, this is not what is wanted, turning monitoring on in the service would turn on prometheus met...
[11:38:21] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1492.eqiad.wmnet with reason: host reimage
[11:38:52] <wikibugs>	 (03PS1) 10Btullis: Update third-party/matomo repository definition [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552)
[11:38:54] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:39:18] <wikibugs>	 (03PS1) 10Mvolz: Revert "Update zotero to node18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018686 (https://phabricator.wikimedia.org/T361728)
[11:39:44] <wikibugs>	 (03Abandoned) 10Mvolz: Revert "Update zotero to node18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018686 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz)
[11:40:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60223 and previous config saved to /var/cache/conftool/dbconfig/20240410-114001-root.json
[11:41:14] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1842/console" [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[11:42:01] <wikibugs>	 (03Abandoned) 10Clément Goubert: zotero: Turn on monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) (owner: 10Clément Goubert)
[11:42:04] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1421.eqiad.wmnet with OS bullseye
[11:42:21] <wikibugs>	 (03CR) 10Muehlenhoff: Update third-party/matomo repository definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[11:42:42] <wikibugs>	 (03PS2) 10Btullis: Update third-party/matomo repository definition [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552)
[11:44:15] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1843/console" [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[11:45:50] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1493.eqiad.wmnet with OS bullseye
[11:47:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P60224 and previous config saved to /var/cache/conftool/dbconfig/20240410-114728-arnaudb.json
[11:48:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9703513 (10MoritzMuehlenhoff)
[11:49:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1422.eqiad.wmnet with OS bullseye
[11:51:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková)
[11:53:03] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1491.eqiad.wmnet with OS bullseye
[11:54:42] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1492.eqiad.wmnet with OS bullseye
[11:55:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60225 and previous config saved to /var/cache/conftool/dbconfig/20240410-115506-root.json
[12:01:31] <claime>	 !log Running homer 'cr*eqiad*' commit 'T351074' and homer 'lsw1-e3-eqiad*' commit 'T351074'
[12:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:39] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[12:02:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P60226 and previous config saved to /var/cache/conftool/dbconfig/20240410-120235-arnaudb.json
[12:02:59] <wikibugs>	 (03PS3) 10Btullis: Update third-party/matomo repository definition [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552)
[12:04:22] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp-test1002.wikimedia.org with OS bookworm
[12:04:40] <wikibugs>	 (03CR) 10Btullis: Update third-party/matomo repository definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[12:05:52] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1844/console" [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[12:08:49] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "I'd say this looks reasonable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková)
[12:09:15] <wikibugs>	 06SRE, 10Citoid, 06serviceops, 13Patch-For-Review: 14Create a readiness probe for zotero - 14https://phabricator.wikimedia.org/T213689#9703551 (10Mvolz) 14   >>! In T213689#9703471, @Clement_Goubert wrote: > Summing up the discussion on the patch set, this is not what is wanted, turning monitoring on...
[12:09:16] <Lucas_WMDE>	 jouncebot: now
[12:09:16] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 50 minute(s)
[12:10:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60227 and previous config saved to /var/cache/conftool/dbconfig/20240410-121012-root.json
[12:11:53] <claime>	 !log Pooling and uncordoning mw1421.eqiad.wmnet,mw1422.eqiad.wmnet,mw1491.eqiad.wmnet,mw1492.eqiad.wmnet,mw1493.eqiad.wmnet - T351074
[12:11:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:57] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[12:12:09] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1421.eqiad.wmnet|mw1422.eqiad.wmnet|mw1491.eqiad.wmnet|mw1492.eqiad.wmnet|mw1493.eqiad.wmnet),cluster=kubernetes,service=kubesvc
[12:14:58] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@deploy1002 ~ $ mwscript-k8s extensions/Wikibase/repo/maintenance/changePropertyDataType.php wikidatawiki --property-id P4496 --new-data-type external-id --summary '[[phabricator:T359297|T359297]]' # failed, will retry with non-k8s mwscript
[12:15:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:07] <stashbot>	 T359297: Change Property datatypes from String to External Identifier for NACE code rev.2 (P4496)  - https://phabricator.wikimedia.org/T359297
[12:15:36] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php wikidatawiki --property-id P4496 --new-data-type external-id --summary '[[phabricator:T359297|T359297]]' # succeeded
[12:15:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T360332)', diff saved to https://phabricator.wikimedia.org/P60228 and previous config saved to /var/cache/conftool/dbconfig/20240410-121743-arnaudb.json
[12:17:47] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[12:17:49] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[12:18:01] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[12:18:02] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[12:18:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[12:18:12] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test1002.wikimedia.org with reason: host reimage
[12:18:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T360332)', diff saved to https://phabricator.wikimedia.org/P60229 and previous config saved to /var/cache/conftool/dbconfig/20240410-121814-arnaudb.json
[12:18:52] <wikibugs>	 06SRE, 10Citoid, 06serviceops, 13Patch-For-Review: 14Create a readiness probe for zotero - 14https://phabricator.wikimedia.org/T213689#9703583 (10Clement_Goubert) 14>>! In T213689#9703551, @Mvolz wrote: > Thanks for linking the actual current Zotero probe - I see it checks the export endpoint? Where c...
[12:19:42] <wikibugs>	 (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018710
[12:20:12] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test1002.wikimedia.org with reason: host reimage
[12:20:16] <wikibugs>	 (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018710 (owner: 10Peter Fischer)
[12:21:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T360332)', diff saved to https://phabricator.wikimedia.org/P60230 and previous config saved to /var/cache/conftool/dbconfig/20240410-122104-arnaudb.json
[12:21:09] <wikibugs>	 (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018710 (owner: 10Peter Fischer)
[12:25:00] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[12:25:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60231 and previous config saved to /var/cache/conftool/dbconfig/20240410-122518-root.json
[12:25:43] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:26:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Idea LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur)
[12:27:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] opensearch: set vhost and issuer url for dashboards sso test [puppet] - 10https://gerrit.wikimedia.org/r/1018667 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi)
[12:30:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: "FWIW for the service-wide checks you can add a probe of type: swagger in service::catalog (see wikifeeds for example)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018673 (https://phabricator.wikimedia.org/T213689) (owner: 10Clément Goubert)
[12:31:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T356166)', diff saved to https://phabricator.wikimedia.org/P60232 and previous config saved to /var/cache/conftool/dbconfig/20240410-123130-marostegui.json
[12:35:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1018635 (owner: 10Slyngshede)
[12:36:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[12:36:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P60233 and previous config saved to /var/cache/conftool/dbconfig/20240410-123612-arnaudb.json
[12:37:53] <wikibugs>	 (03PS3) 10Fabfur: prometheus: add aggregate metrics for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845)
[12:38:26] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp-test1002.wikimedia.org with OS bookworm
[12:38:35] <wikibugs>	 (03CR) 10Fabfur: "Thanks for the comments!" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur)
[12:44:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update for latest VMs - jmm@cumin2002"
[12:45:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update for latest VMs - jmm@cumin2002"
[12:46:03] <wikibugs>	 (03PS5) 10Ayounsi: Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325)
[12:46:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P60234 and previous config saved to /var/cache/conftool/dbconfig/20240410-124638-marostegui.json
[12:48:21] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp-test2003.wikimedia.org
[12:49:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Uninstall eject on production VMs [puppet] - 10https://gerrit.wikimedia.org/r/1017275 (owner: 10Muehlenhoff)
[12:49:29] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] cassandra::instance: fix PKI keystore for each instance [puppet] - 10https://gerrit.wikimedia.org/r/1018311 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[12:50:18] <moritzm>	 elukey: ok to merge your patch along?
[12:50:23] <elukey>	 moritzm: +1 thanks!
[12:51:18] <moritzm>	 ack, merged now
[12:51:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P60235 and previous config saved to /var/cache/conftool/dbconfig/20240410-125119-arnaudb.json
[12:51:38] <wikibugs>	 (03PS3) 10Elukey: Force PKI TLS certs for cassandra instances on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647)
[12:52:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) (owner: 10Ayounsi)
[12:53:17] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox
[12:56:20] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002"
[12:56:58] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=authdns-update
[12:59:04] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002"
[12:59:04] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:59:04] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test2003.wikimedia.org
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:01:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: "There's a fix to make, rest LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur)
[13:01:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P60236 and previous config saved to /var/cache/conftool/dbconfig/20240410-130145-marostegui.json
[13:02:15] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.netbox
[13:04:15] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test removing dns entry - volans@cumin2002"
[13:05:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=authdns-update
[13:05:07] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test removing dns entry - volans@cumin2002"
[13:05:08] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:06:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T360332)', diff saved to https://phabricator.wikimedia.org/P60237 and previous config saved to /var/cache/conftool/dbconfig/20240410-130626-arnaudb.json
[13:06:29] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1206.eqiad.wmnet with reason: Maintenance
[13:06:42] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[13:06:43] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1206.eqiad.wmnet with reason: Maintenance
[13:06:50] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=(cdn|ats-be)
[13:06:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T360332)', diff saved to https://phabricator.wikimedia.org/P60238 and previous config saved to /var/cache/conftool/dbconfig/20240410-130650-arnaudb.json
[13:06:58] <sukhe>	 !log depool cp4052 for PXE boot issue testing
[13:07:00] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.netbox
[13:07:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:55] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye
[13:08:03] <wikibugs>	 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9703728 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp4052.ulsfo.wmnet with OS b...
[13:08:08] <wikibugs>	 (03PS1) 10Elukey: services: update the rec-api's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018717 (https://phabricator.wikimedia.org/T205870)
[13:09:06] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test restoring dns entry - volans@cumin2002"
[13:09:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T360332)', diff saved to https://phabricator.wikimedia.org/P60239 and previous config saved to /var/cache/conftool/dbconfig/20240410-130940-arnaudb.json
[13:09:55] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test restoring dns entry - volans@cumin2002"
[13:09:56] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:10:39] <wikibugs>	 (03CR) 10Volans: [C:03+2] DNS-related cookbooks: adapt for conftool state [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans)
[13:14:55] <wikibugs>	 (03Merged) 10jenkins-bot: DNS-related cookbooks: adapt for conftool state [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans)
[13:16:11] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1112.eqiad.wmnet,service=(cdn|ats-be)
[13:16:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T356166)', diff saved to https://phabricator.wikimedia.org/P60240 and previous config saved to /var/cache/conftool/dbconfig/20240410-131653-marostegui.json
[13:16:56] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance
[13:16:58] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[13:17:00] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye
[13:17:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: 14Q1:Install cp11[00-15] and rotate into production - 14https://phabricator.wikimedia.org/T349244#9703743 (10ops-monitoring-bot) 14Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1112.eqiad.wmnet with OS bullseye
[13:17:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance
[13:17:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T356166)', diff saved to https://phabricator.wikimedia.org/P60241 and previous config saved to /var/cache/conftool/dbconfig/20240410-131716-marostegui.json
[13:19:22] <wikibugs>	 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9703751 (10MoritzMuehlenhoff) >>! In T360414#9702570, @andrea.denisse wrote: > I've documented the migration process on Wikitech: https:/...
[13:19:41] <wikibugs>	 (03PS4) 10Elukey: ml-services: force HTTP in revert-risk agnostic staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984215 (https://phabricator.wikimedia.org/T353622)
[13:24:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P60242 and previous config saved to /var/cache/conftool/dbconfig/20240410-132447-arnaudb.json
[13:24:57] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore configure TLS verification in staging for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017935 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans)
[13:26:07] <wikibugs>	 (03Merged) 10jenkins-bot: sessionstore configure TLS verification in staging for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017935 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans)
[13:26:39] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2088-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[13:26:46] <logmsgbot>	 !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS bullseye
[13:26:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: 14Q1:Install cp11[00-15] and rotate into production - 14https://phabricator.wikimedia.org/T349244#9703759 (10ops-monitoring-bot) 14Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1112.eqiad.wmnet with OS bullseye executed with errors:...
[13:27:12] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye
[13:27:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: 14Q1:Install cp11[00-15] and rotate into production - 14https://phabricator.wikimedia.org/T349244#9703762 (10ops-monitoring-bot) 14Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1112.eqiad.wmnet with OS bullseye
[13:28:21] <wikibugs>	 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9703774 (10andrea.denisse) >>! In T360414#9703751, @MoritzMuehlenhoff wrote: >>>! In T360414#9702570, @andrea.denisse wrote: >> I've docu...
[13:28:34] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9703767 (10Volans) a:05Volans→03None De-assigning it from me as B...
[13:28:40] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[13:28:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove now obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016313 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff)
[13:29:00] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove now obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016313 (https://phabricator.wikimedia.org/T360412)
[13:29:41] <wikibugs>	 (03PS1) 10Volans: sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306)
[13:30:16] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1018644 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi)
[13:30:40] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[13:30:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on elastic2088.codfw.wmnet with reason: T361525
[13:30:57] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), and 2 others: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9703798 (10bking) a:03RKemper
[13:30:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on elastic2088.codfw.wmnet with reason: T361525
[13:30:58] <stashbot>	 T361525: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525
[13:31:17] <wikibugs>	 10ops-codfw, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9703808 (10bking) Sorry for the noise, I've just downtimed this host.
[13:31:33] <wikibugs>	 (03CR) 10Krinkle: logging: default to log any error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[13:31:38] <wikibugs>	 (03PS4) 10Fabfur: prometheus: add aggregate metrics for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845)
[13:31:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: shellbox: add PHP + Apache timeout settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková)
[13:32:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková)
[13:32:09] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), and 2 others: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9703819 (10bking) Assigning to @RKemper /adding DPE SRE tags.
[13:32:27] <wikibugs>	 (03CR) 10Fabfur: prometheus: add aggregate metrics for benthos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur)
[13:33:06] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[13:35:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur)
[13:36:19] <wikibugs>	 (03CR) 10Bking: [C:03+1] search: Wait for young pool alert to fail for 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/1013575 (owner: 10Ebernhardson)
[13:38:15] <jinxer-wm>	 (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers
[13:39:18] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[13:39:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P60243 and previous config saved to /var/cache/conftool/dbconfig/20240410-133955-arnaudb.json
[13:41:08] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2] Remove now obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016313 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff)
[13:42:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] configmaster: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1004132 (owner: 10Muehlenhoff)
[13:43:33] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage
[13:46:06] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage
[13:46:16] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes: Move 7 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1018719 (https://phabricator.wikimedia.org/T351074)
[13:47:44] <moritzm>	 !log installing unbound security updates
[13:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:08] <denisse>	 !log Delete unused Prometheus TLS certificates - T360414
[13:49:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:13] <stashbot>	 T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414
[13:53:07] <wikibugs>	 (03CR) 10Herron: [C:03+1] titan: trim 5m retention to 3y + 2w [puppet] - 10https://gerrit.wikimedia.org/r/1018644 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi)
[13:54:54] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bullseye
[13:55:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T360332)', diff saved to https://phabricator.wikimedia.org/P60244 and previous config saved to /var/cache/conftool/dbconfig/20240410-135502-arnaudb.json
[13:55:05] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[13:55:05] <wikibugs>	 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9703921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp4052.ulsfo.wmnet with OS bulls...
[13:55:09] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[13:55:18] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[13:55:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T360332)', diff saved to https://phabricator.wikimedia.org/P60245 and previous config saved to /var/cache/conftool/dbconfig/20240410-135525-arnaudb.json
[13:58:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T360332)', diff saved to https://phabricator.wikimedia.org/P60246 and previous config saved to /var/cache/conftool/dbconfig/20240410-135814-arnaudb.json
[13:58:26] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet,service=(cdn|ats-be)
[13:59:10] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 70% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018721 (https://phabricator.wikimedia.org/T360763)
[13:59:42] <wikibugs>	 (03PS1) 10Elukey: kask: allow to configure tls options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647)
[14:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1400)
[14:00:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] kask: allow to configure tls options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[14:00:56] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: move 70% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1018723 (https://phabricator.wikimedia.org/T360763)
[14:01:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Remove parsoid-php certificates from mw deployments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018660 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris)
[14:03:07] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] prometheus: add aggregate metrics for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur)
[14:03:33] <wikibugs>	 (03PS1) 10Andrea Denisse: ssl: Delete dummy TLS key for the Prometheus hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1018724 (https://phabricator.wikimedia.org/T360414)
[14:07:55] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1112.eqiad.wmnet with OS bullseye
[14:08:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: 14Q1:Install cp11[00-15] and rotate into production - 14https://phabricator.wikimedia.org/T349244#9703948 (10ops-monitoring-bot) 14Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1112.eqiad.wmnet with OS bullseye completed: - cp1112 (...
[14:13:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P60248 and previous config saved to /var/cache/conftool/dbconfig/20240410-141322-arnaudb.json
[14:15:57] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:16:33] <claime>	 ugh
[14:16:42] <volans>	 acked
[14:16:48] <volans>	 !incidents
[14:16:49] <sirenbot>	 4577 (ACKED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[14:16:49] <urandom>	 o/
[14:16:49] <sirenbot>	 4576 (RESOLVED)  db1152 (paged)/MariaDB read only x2 (paged)
[14:17:00] <claime>	 So it's been exhausting workers more or less steadily since this morning
[14:17:01] <volans>	 claime: related to any WIP?
[14:17:09] <claime>	 https://grafana.wikimedia.org/goto/i70n34aSg?orgId=1
[14:17:13] <claime>	 Not that I know of
[14:17:20] <volans>	 ok
[14:17:27] <jynus>	 let me check upload log on commons
[14:17:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:17:36] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1112.eqiad.wmnet,service=(cdn|ats-be)
[14:17:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T356166)', diff saved to https://phabricator.wikimedia.org/P60249 and previous config saved to /var/cache/conftool/dbconfig/20240410-141742-marostegui.json
[14:17:46] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[14:17:58] <volans>	 claime: your link doesn't work
[14:18:10] <volans>	 redirects to the rw-grafana home
[14:18:15] <claime>	 volans: because i'm logged in probably, great
[14:18:23] <volans>	 logging in
[14:18:47] <volans>	 nope, same...
[14:19:10] <claime>	 https://grafana.wikimedia.org/goto/3XnvqV-IR?orgId=1
[14:19:18] <volans>	 thx
[14:20:05] <jynus>	 I saw this, but it looks far from massive: https://commons.wikimedia.org/wiki/Special:Log?type=upload&user=Trade&page=&wpdate=&tagfilter=&wpfilters%5B%5D=newusers&wpFormIdentifier=logeventslist
[14:20:53] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4052.ulsfo.wmnet
[14:20:56] <volans>	 claime: are the docs still valid in the k8s world?
[14:20:57] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:21:10] <claime>	 volans: This is not k8s
[14:21:11] <volans>	 I see mnuch more hosts in codfw than equiad
[14:21:13] <jynus>	 I don't think we can get gameplays on commons, but not a current concern
[14:21:24] <claime>	 This is the only remnants of bare metal for jobs, videoscalers
[14:21:27] <volans>	 also different weights
[14:21:36] <volans>	 for that matter
[14:21:51] <jynus>	 do you metrics of # of current enqueed or pending jobs?
[14:21:55] <logmsgbot>	 !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp4052.ulsfo.wmnet
[14:22:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:22:50] <claime>	 volans: Yes, because in codfw the hosts have different CPUs, so they are weighted differently
[14:22:51] <volans>	 !incidents
[14:22:51] <sirenbot>	 4577 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[14:22:51] <sirenbot>	 4576 (RESOLVED)  db1152 (paged)/MariaDB read only x2 (paged)
[14:22:55] <claime>	 Shouldn't be the case in eqiad
[14:22:57] <volans>	 ok
[14:23:13] <claime>	 In any case, all transcodes are done the primary DC
[14:23:56] <claime>	 but as far as workers go, you're right, we have a big imbalance between codfw and eqiad
[14:23:57] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:24:22] <claime>	 We may have gone a bit too fast in reimaging jobrunners in eqiad
[14:24:32] <volans>	 see lso https://grafana.wikimedia.org/goto/RyCxq4aSg?orgId=1
[14:24:39] <volans>	 acked
[14:25:09] <volans>	 the cluster is totally CPU-bound
[14:25:16] <wikibugs>	 (03CR) 10Jforrester: Implementing security.txt standard (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana)
[14:25:21] <jynus>	 I belive in the past there was some bad balancing on transcoding jobs, or at least I remember mentions of it when there was mass video uploads
[14:25:25] <urandom>	 those are ominous graphs
[14:25:34] <claime>	 it doesn't correlate to an increase in jobs 
[14:25:37] <claime>	 That's what I don´t  like
[14:26:01] <claime>	 https://grafana.wikimedia.org/goto/_rPL34-IR?orgId=1
[14:26:05] <claime>	 Even the prioritized ones
[14:26:09] <volans>	 filesystem usage went from 10% to 40% and growind
[14:26:17] <volans>	 for /
[14:26:31] <akosiaris>	 sigh
[14:26:34] * akosiaris around
[14:26:37] <volans>	 so something weird is happening
[14:26:46] <volans>	 that uses a lot of disk on those hosts
[14:26:56] <volans>	 very large videos to encode?
[14:27:20] <claime>	 or shellbox not cleaning up after itself
[14:27:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:27:43] <akosiaris>	 w1437:~$ pgrep -f /usr/bin/ffmpeg  |wc -l
[14:27:44] <akosiaris>	 432
[14:27:45] <akosiaris>	 wow
[14:27:49] <volans>	 disk IOs are not crazy hight
[14:28:09] <jynus>	 do you have a wiki? I see nothing on commons
[14:28:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P60250 and previous config saved to /var/cache/conftool/dbconfig/20240410-142829-arnaudb.json
[14:28:57] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:29:39] <claime>	 akosiaris: did you kill processes?
[14:29:41] <volans>	 !incidents
[14:29:42] <sirenbot>	 4578 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[14:29:42] <sirenbot>	 4577 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[14:29:42] <claime>	 It says 72 now
[14:29:42] <sirenbot>	 4576 (RESOLVED)  db1152 (paged)/MariaDB read only x2 (paged)
[14:29:44] <akosiaris>	 no
[14:29:56] <claime>	 Ah no
[14:29:58] <claime>	 -f
[14:30:06] <akosiaris>	 still 432 for me 
[14:30:23] <volans>	 yeo
[14:30:24] <volans>	 $ pgrep -cf /usr/bin/ffmpeg
[14:30:24] <volans>	 414
[14:30:31] <volans>	 I'm on another host
[14:31:07] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Update third-party/matomo repository definition [puppet] - 10https://gerrit.wikimedia.org/r/1018680 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[14:31:29] <volans>	 is there a simple log to tail to check what they are encoding?
[14:31:33] <volans>	 the ffmpeg I mean
[14:31:57] <akosiaris>	 not that I can remember
[14:32:03] <volans>	 sigh
[14:32:35] <volans>	 is it possible is the same set of videos over and over that maybe fails and gets re-enqued?
[14:32:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P60251 and previous config saved to /var/cache/conftool/dbconfig/20240410-143249-marostegui.json
[14:33:02] <akosiaris>	 this started around 8am 
[14:33:30] <jynus>	 ah, then let me search earlier
[14:33:40] <volans>	 at 7:30 there was a sync file
[14:33:52] <volans>	 at 10 scp
[14:33:54] <volans>	 *scap
[14:34:06] <akosiaris>	 yeah first one was me deploying a helm chart change for /docs/
[14:34:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:34:17] <volans>	 that's the one more aligned :D
[14:34:21] <akosiaris>	 and the next one was eff.ie for adding the mcrouter env var
[14:34:22] <volans>	 although seems unrelated
[14:34:28] <claime>	 https://logstash.wikimedia.org/goto/3d82a32f7c64af5adedc10122bc5c5f1
[14:34:46] <wikibugs>	 (03PS1) 10Majavah: alertmanager: karma: Set group too [puppet] - 10https://gerrit.wikimedia.org/r/1018727
[14:34:57] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:34:58] <claime>	 It's not that many events
[14:35:07] <claime>	 although it's only the errors
[14:35:08] <volans>	 acked
[14:35:45] <claime>	 we started having webVideoTranscode job execution errors around 3:00 in the morning actually
[14:36:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] alertmanager: karma: Set group too [puppet] - 10https://gerrit.wikimedia.org/r/1018727 (owner: 10Majavah)
[14:36:50] <hnowlan>	 logstash showing errors like "estimated file size 2784886 KiB over soft limit 2097152 KiB"
[14:36:52] <volans>	 claime, akosiaris: do we need to create an incident and start calling people? I can do IC
[14:37:22] <akosiaris>	 estimated file size 8755934 KiB over hard limit 3145728 KiB
[14:37:30] <akosiaris>	 for a random failure
[14:37:33] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:37:44] <akosiaris>	 volans: yeah, sure, makes sense by now
[14:37:54] <volans>	 ok opening doc
[14:37:56] <wikibugs>	 06SRE, 10Observability-Alerting: prometheus-icinga-am.service Fails to Start on alert2001 - https://phabricator.wikimedia.org/T358838#9704024 (10lmata)
[14:38:01] * volans becomes IC
[14:38:28] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:55] <akosiaris>	 it's not even a lot of reuqests per apache logs
[14:39:00] <akosiaris>	 just some weird video ?
[14:39:17] <volans>	 that's my guess
[14:39:22] <volans>	 but no hard evidence
[14:39:36] <jynus>	 if it is just 1 video, maybe jobs can be killed and later restarted for mitigation?
[14:39:55] <hnowlan>	 just in case should we reduce the concurrency of video transcoding jobs? 
[14:40:20] <akosiaris>	 there isn't btw any serious impact to anything (aside from the one on oncallers getting pages). Jobs aren't on the videoscalers for some time now
[14:40:43] <volans>	 doc is https://docs.google.com/document/d/1k9eYWPpY8QsKfLpLgYXsLPaHTN4y1m5serPOv8H5Gd8/edit#heading=h.95p2g5d67t9q
[14:41:01] <akosiaris>	 hnowlan: I suppose it won't hurt
[14:41:04] <akosiaris>	 wanna do that ?
[14:41:27] <claime>	 The errors are from TMH
[14:41:30] <jynus>	 this was a video uploaded after 3am and transcoding seems stuck since then: https://commons.wikimedia.org/w/index.php?title=File:Key_Bridge_Response_Photos_(240401-G-TL908-2303).webm&action=history
[14:41:44] <hnowlan>	 akosiaris: sure
[14:42:08] <hnowlan>	 afaict there are a few filenames that show up repeatedly fwiw, but not over long periods of time 
[14:42:31] <jynus>	 it was renamed after being uploaded
[14:42:34] <volans>	 akosiaris: do you think that throwing more hosts at the cluster would help?
[14:43:27] <wikibugs>	 (03PS1) 10Hnowlan: jobqueue: reduce webvideotranscode concurrency temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018729
[14:43:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T360332)', diff saved to https://phabricator.wikimedia.org/P60252 and previous config saved to /var/cache/conftool/dbconfig/20240410-144336-arnaudb.json
[14:43:39] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance
[14:43:42] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[14:43:46] <akosiaris>	 volans: sure, but in fact on the legacy infra we 've never done that IIRC
[14:43:53] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance
[14:44:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T360332)', diff saved to https://phabricator.wikimedia.org/P60253 and previous config saved to /var/cache/conftool/dbconfig/20240410-144400-arnaudb.json
[14:44:02] <akosiaris>	 it's video transcoding, it can be delayed quite a bit, no harm done
[14:44:15] <volans>	 ok
[14:44:22] <akosiaris>	 we used to separate the 2 clusters (jobrunners vs videoscalers) functionally when we had incidents like these
[14:44:39] <akosiaris>	 making sure that video transcoding wouldn't consume resources meant for jobs
[14:44:44] <claime>	 Now the only resources we have for videoscaling are those hosts
[14:44:52] <akosiaris>	 but in this case, we no longer have jobs on that cluster
[14:45:03] <claime>	 Or we repurpose some appservers quickly and throw them at it
[14:45:04] <volans>	 ack
[14:45:21] <akosiaris>	 the more I think about it, the more I start to wonder whether ACKing it for say 20 hours is ok. 
[14:45:46] <hnowlan>	 fwiw up until now, 4 was *plenty* for videoscaling 
[14:45:59] <wikibugs>	 06SRE, 10Observability-Logging, 10SRE Observability (FY2023/2024-Q4): Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998#9704059 (10fgiunchedi)
[14:46:06] <akosiaris>	 now, if I could isolate whatever file is causing the issue and just moving in the back of the queue, it would be better
[14:46:07] <jynus>	 I belive something like that was done last time, akosiaris, and someone told us "not to worry"
[14:46:13] <hnowlan>	 concurrency reduction: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018729  
[14:46:27] <akosiaris>	 jynus: yeah it's not the first time. It's like the nth occurence
[14:46:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] jobqueue: reduce webvideotranscode concurrency temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018729 (owner: 10Hnowlan)
[14:46:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T360332)', diff saved to https://phabricator.wikimedia.org/P60254 and previous config saved to /var/cache/conftool/dbconfig/20240410-144644-arnaudb.json
[14:46:59] <hnowlan>	 the queue itself doesn't seem to have spiked in a notable fashion either 
[14:47:23] <jynus>	 akosiaris: please do, my bet is on that boat video, but I have 0 proof
[14:47:26] <hnowlan>	 steady at .05 jobs/s  or so 
[14:47:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:47:38] <akosiaris>	 actually let's do something. Let's get that patch ^ deployed (I +1ed already) and I 'll kill ffmpegs on 1 host and increase their weight 
[14:47:44] <volans>	 so I think we need to identify if we got a bunch of weird videos that create problems
[14:47:51] <volans>	 or the code started to have issues
[14:47:55] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] jobqueue: reduce webvideotranscode concurrency temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018729 (owner: 10Hnowlan)
[14:47:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P60255 and previous config saved to /var/cache/conftool/dbconfig/20240410-144757-marostegui.json
[14:47:58] <volans>	 or the infra started to have issues
[14:48:08] <akosiaris>	 volans: bunch? I am willing to bet it's one
[14:48:17] <akosiaris>	 jynus: how did you identify that one?
[14:48:23] <jynus>	 volans: my suspicion is on that upload + rename
[14:48:32] <volans>	 akosiaris: how can one affect all hosts at the same time?
[14:48:32] <jynus>	 akosiaris: first video stuck after 3am
[14:48:42] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: reduce webvideotranscode concurrency temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018729 (owner: 10Hnowlan)
[14:49:01] <jynus>	 checked commons on special new files and filtered by video
[14:49:06] <jynus>	 let me get you an url
[14:49:17] <jynus>	 akosiaris: https://commons.wikimedia.org/wiki/Special:NewFiles?user=&showbots=1&mediatype%5B%5D=VIDEO&start=&end=&wpFormIdentifier=specialnewimages&limit=500&offset=
[14:49:55] <jynus>	 the rename makes it specially suspicious (not the first time a rename breaks things due to mw bug)
[14:50:01] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[14:50:29] <jynus>	 but please check if you have a way to compare it with log or processes
[14:50:42] <jynus>	 at the moment it is just a guess
[14:50:50] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:51:14] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[14:51:15] <akosiaris>	 it's not even a really big video. 435MB ? 
[14:51:24] <jynus>	 it doesn't match the size logs, right
[14:51:35] <claime>	 cgoubert@mw1437:/var/log$ ps aux | grep 37888 < oldest pid I could find
[14:51:49] <claime>	 Wed Apr 10 06:41:55
[14:51:58] <akosiaris>	 however you got a point that it hasn't managed to get transcoded
[14:51:59] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:52:03] <akosiaris>	 whereas the next one has
[14:52:12] <jynus>	 so it could be just a synthom, not a cause
[14:52:19] <jynus>	 url?
[14:52:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:52:35] <volans>	 cwhite: you around?
[14:53:03] <claime>	 jynus: harder to find x)
[14:53:11] <volans>	 !incidents
[14:53:12] <sirenbot>	 4579 (ACKED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[14:53:12] <sirenbot>	 4578 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[14:53:12] <sirenbot>	 4577 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[14:53:12] <sirenbot>	 4576 (RESOLVED)  db1152 (paged)/MariaDB read only x2 (paged)
[14:53:14] <akosiaris>	 ok, hnowlan is done, I 'll kill ffmpegs in the 1st host
[14:53:19] <akosiaris>	 and adjust weight 
[14:53:26] <volans>	 ack
[14:53:27] <akosiaris>	 that should hopefully patch the bleeding
[14:53:40] <hnowlan>	 soft limited jobs can be manually retried via the mw UI btw 
[14:53:49] <hnowlan>	 if that's an issue, but I don't think it is 
[14:54:02] <hnowlan>	 but just a til https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandler/blob/master/includes/WebVideoTranscode/WebVideoTranscodeJob.php#L652 
[14:54:50] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/weight=30; selector: name=mw1437.*.wmnet,dc=eqiad
[14:54:57] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:55:47] <cwhite>	 volans: in a meeting, but yes
[14:55:48] <akosiaris>	 !log kill all ffmpegs on mw1437 and increase weight of mw1347 from 10 to 30 to direct most queries to it while the other 3 videoscalers serve the backlog
[14:55:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:59] <volans>	 cwhite: page >>> meeting ;)
[14:56:49] <volans>	 akosiaris: this is assuming the issue doesn't get re-enqued correct?
[14:57:18] <akosiaris>	 yes
[14:57:25] <akosiaris>	 we are at 63 ffmpeg right now 
[14:57:30] <akosiaris>	 on mw1437
[14:57:36] <akosiaris>	 so, already more than the CPUs
[14:57:44] <akosiaris>	 but it's not rising exponentially or anyting 
[14:58:15] <jinxer-wm>	 (JobrunnerPHPBusyWorkers) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers
[14:58:38] <moritzm>	 !log installing debian-archive-keyring updates on buster
[14:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:10] <akosiaris>	 can't the "resolved" and firing be the first thing in those messages ^ and in caps ? 
[14:59:28] <akosiaris>	 it would make my IRC life a tag easier
[14:59:45] <volans>	 indeed
[14:59:46] <jynus>	 +1
[15:00:23] <wikibugs>	 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9704116 (10VRiley-WMF) That works for me. I'll be there to assist with it. Thank you!
[15:01:36] <volans>	 added a couple of action items to the doc
[15:01:41] <volans>	 including the above
[15:01:45] <hnowlan>	 https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&var-dc=eqiad%20prometheus%2Fk8s&from=now-3d&to=now&var-job=webVideoTranscodePrioritized you can see the impact increasing from 00:00 on the 9th
[15:01:52] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P60256 and previous config saved to /var/cache/conftool/dbconfig/20240410-150152-arnaudb.json
[15:02:23] <hnowlan>	 processing rates up, backlog becomes more consistent (even if the time remains similar) 
[15:02:36] <hnowlan>	 we should reduce concurrency on prioritised in light of that I'd say 
[15:03:01] <wikibugs>	 (03PS6) 10Majavah: hieradata: Add CDN config for toolsadmin-toolsbeta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1018665 (https://phabricator.wikimedia.org/T360025)
[15:03:01] <wikibugs>	 (03PS1) 10Majavah: hieradata: Update striker container to add staging env warning [puppet] - 10https://gerrit.wikimedia.org/r/1018730 (https://phabricator.wikimedia.org/T254598)
[15:03:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T356166)', diff saved to https://phabricator.wikimedia.org/P60257 and previous config saved to /var/cache/conftool/dbconfig/20240410-150304-marostegui.json
[15:03:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance
[15:03:14] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[15:03:20] <denisse>	 akosiaris: re: firing and resolved first thing in the message. That's a good idea, I'll discuss this with o11y to see if we can modify it.
[15:03:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance
[15:03:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T356166)', diff saved to https://phabricator.wikimedia.org/P60258 and previous config saved to /var/cache/conftool/dbconfig/20240410-150327-marostegui.json
[15:03:41] <wikibugs>	 (03PS1) 10Hnowlan: jobqueue: temporarily reduce prioritised video transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018731
[15:03:53] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Update striker container to add staging env warning [puppet] - 10https://gerrit.wikimedia.org/r/1018730 (https://phabricator.wikimedia.org/T254598) (owner: 10Majavah)
[15:04:56] <hnowlan>	 concurrency reductions for the other job https://gerrit.wikimedia.org/r/101873
[15:07:13] <volans>	 hnowlan: missing a digit ;)
[15:07:23] <volans>	 that's from 2013 :D
[15:07:47] <cwhite>	 this one?: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018731
[15:08:07] <hnowlan>	 hehhh yes
[15:08:38] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018731 (owner: 10Hnowlan)
[15:08:42] <hnowlan>	 what do you mean, don't we need portugese wikibooks for this problem
[15:08:49] <volans>	 lol
[15:08:55] <wikibugs>	 (03PS2) 10Hnowlan: jobqueue: temporarily reduce prioritised video transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018731
[15:09:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[15:10:51] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] jobqueue: temporarily reduce prioritised video transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018731 (owner: 10Hnowlan)
[15:11:07] <volans>	 !incide
[15:11:09] <volans>	 !incidents
[15:11:10] <sirenbot>	 4579 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[15:11:10] <sirenbot>	 4578 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[15:11:10] <sirenbot>	 4577 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[15:11:10] <sirenbot>	 4576 (RESOLVED)  db1152 (paged)/MariaDB read only x2 (paged)
[15:11:39] <volans>	 I've tried to put a summary of the actions take, but please adjust it if I misrepresented anything
[15:11:56] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: temporarily reduce prioritised video transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018731 (owner: 10Hnowlan)
[15:12:40] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Migrate db2097 backups to db2197 [puppet] - 10https://gerrit.wikimedia.org/r/1018247 (https://phabricator.wikimedia.org/T360751)
[15:13:35] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[15:14:00] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:14:01] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[15:14:34] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:17:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P60259 and previous config saved to /var/cache/conftool/dbconfig/20240410-151659-arnaudb.json
[15:23:28] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:23:32] <hnowlan>	 might be okay to call this one resolved or at least mitigated? 
[15:24:04] <hnowlan>	 no entries in mediawiki-errors since the kills
[15:24:08] <cwhite>	 +1 from me
[15:24:23] <wikibugs>	 (03PS2) 10BCornwall: ssl_ciphersuite: Reorder suite preferences [puppet] - 10https://gerrit.wikimedia.org/r/1018356 (https://phabricator.wikimedia.org/T362197)
[15:25:24] <hnowlan>	 RPS and 200 rate isn't quite back to normal but it's recovering 
[15:25:35] <volans>	 hnowlan: it works for me, as you want, the CPU is still fairly close to 100%
[15:25:40] <volans>	 but I'll let you decide
[15:25:54] <hnowlan>	 let's keep an eye for another while 
[15:26:01] <volans>	 it's ~90% on 1437
[15:26:06] <volans>	 and stuck at 100% on the others
[15:26:40] <volans>	 but yes it looks promising
[15:26:43] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] ssl_ciphersuite: Reorder suite preferences [puppet] - 10https://gerrit.wikimedia.org/r/1018356 (https://phabricator.wikimedia.org/T362197) (owner: 10BCornwall)
[15:26:51] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] ncredir: Set ssl_ciphersuite to strong [puppet] - 10https://gerrit.wikimedia.org/r/1018355 (https://phabricator.wikimedia.org/T362197) (owner: 10BCornwall)
[15:27:26] <volans>	 urandom, cwhite: any of you that could take over IC? I'll be offcal in ~3 minutes
[15:27:32] <volans>	 !incide
[15:27:34] <volans>	 !incidents
[15:27:34] <sirenbot>	 4579 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[15:27:35] <sirenbot>	 4578 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[15:27:35] <sirenbot>	 4577 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[15:27:35] <sirenbot>	 4576 (RESOLVED)  db1152 (paged)/MariaDB read only x2 (paged)
[15:27:47] <cwhite>	 I can take it
[15:27:49] * volans insit expecting <tab> to qork
[15:28:15] <volans>	 thanks
[15:29:15] <cwhite>	 Is there a runbook we followed to reduce the load on the jobrunners?
[15:29:56] <claime>	 No, not that I know of
[15:30:08] <claime>	 ad-hoc work from a.kosiaris and h.nowlan
[15:30:45] <volans>	 also the runbook is outdated, adding it o the doc
[15:31:07] <cwhite>	 Thanks :)
[15:32:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T360332)', diff saved to https://phabricator.wikimedia.org/P60260 and previous config saved to /var/cache/conftool/dbconfig/20240410-153207-arnaudb.json
[15:32:09] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[15:32:12] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[15:32:22] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[15:32:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T360332)', diff saved to https://phabricator.wikimedia.org/P60261 and previous config saved to /var/cache/conftool/dbconfig/20240410-153229-arnaudb.json
[15:33:28] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:06] <wikibugs>	 (03PS2) 10Elukey: kask: allow to configure tls options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647)
[15:35:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T360332)', diff saved to https://phabricator.wikimedia.org/P60262 and previous config saved to /var/cache/conftool/dbconfig/20240410-153516-arnaudb.json
[15:37:50] <wikibugs>	 (03PS3) 10Elukey: kask: allow to configure tls options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647)
[15:43:05] <wikibugs>	 (03PS1) 10BCornwall: Revert "ssl_ciphersuite: Reorder suite preferences" [puppet] - 10https://gerrit.wikimedia.org/r/1018690
[15:46:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "ssl_ciphersuite: Reorder suite preferences" [puppet] - 10https://gerrit.wikimedia.org/r/1018690 (owner: 10BCornwall)
[15:46:18] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] "08:44 <vgutierrez> IRC +1" [puppet] - 10https://gerrit.wikimedia.org/r/1018690 (owner: 10BCornwall)
[15:46:54] <wikibugs>	 (03PS2) 10BCornwall: Revert "ssl_ciphersuite: Reorder suite preferences" [puppet] - 10https://gerrit.wikimedia.org/r/1018690
[15:48:28] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:50:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P60264 and previous config saved to /var/cache/conftool/dbconfig/20240410-155024-arnaudb.json
[15:52:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Create a new and profile for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[15:52:55] <wikibugs>	 (03PS1) 10Daniel Kinzler: LogStash: log HtmlOutputRendererHelper channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018738 (https://phabricator.wikimedia.org/T356157)
[15:53:53] <wikibugs>	 (03PS5) 10Elukey: kask: allow to configure tls options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647)
[15:55:36] <wikibugs>	 (03PS1) 10Btullis: Add dummy data for the new matomo service. [labs/private] - 10https://gerrit.wikimedia.org/r/1018739 (https://phabricator.wikimedia.org/T351552)
[15:56:13] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] Add dummy data for the new matomo service. [labs/private] - 10https://gerrit.wikimedia.org/r/1018739 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[15:58:32] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[16:01:13] <wikibugs>	 (03PS2) 10Btullis: Create a new and profile for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552)
[16:03:33] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[16:04:12] <wikibugs>	 (03CR) 10Mmartorana: Implementing security.txt standard (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010971 (https://phabricator.wikimedia.org/T337949) (owner: 10Mmartorana)
[16:04:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Create a new and profile for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[16:05:28] <Lucas_WMDE>	 jouncebot: now
[16:05:28] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 54 minute(s)
[16:05:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P60265 and previous config saved to /var/cache/conftool/dbconfig/20240410-160531-arnaudb.json
[16:05:54] <Lucas_WMDE>	 I’ll try to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1007643 then if that’s alright :)
[16:07:01] <Lucas_WMDE>	 hm, or maybe not, there are uncommitted changes in `/src/deployment-charts` o_O
[16:10:01] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): termbox: update to 2024-03-14-121904-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007643 (https://phabricator.wikimedia.org/T343239)
[16:10:05] <wikibugs>	 (03PS1) 10Hashar: TitleLibrary: Don't register external titles as dependencies [extensions/Scribunto] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018691 (https://phabricator.wikimedia.org/T362222)
[16:10:25] <wikibugs>	 (03CR) 10Eevans: kask: allow to configure tls options (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[16:10:29] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "I’ll deploy this now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007643 (https://phabricator.wikimedia.org/T343239) (owner: 10Lucas Werkmeister (WMDE))
[16:10:34] <wikibugs>	 (03CR) 10Hashar: "I'll deploy it tomorrow unless someone does it tonight :)" [extensions/Scribunto] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018691 (https://phabricator.wikimedia.org/T362222) (owner: 10Hashar)
[16:11:23] <wikibugs>	 (03Merged) 10jenkins-bot: termbox: update to 2024-03-14-121904-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007643 (https://phabricator.wikimedia.org/T343239) (owner: 10Lucas Werkmeister (WMDE))
[16:12:01] <wikibugs>	 (03CR) 10Elukey: kask: allow to configure tls options (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[16:12:14] <swfrench-wmf>	 !log uploaded etcd-mirror 0.0.11-1 to apt.wikimedia.org (T358636)
[16:12:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:26] <stashbot>	 T358636: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636
[16:13:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply
[16:13:45] <wikibugs>	 (03CR) 10Eevans: [C:03+1] kask: allow to configure tls options (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[16:14:01] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply
[16:14:04] <wikibugs>	 (03CR) 10FNegri: [C:03+1] alertmanager: karma: Set group too [puppet] - 10https://gerrit.wikimedia.org/r/1018727 (owner: 10Majavah)
[16:14:47] <Lucas_WMDE>	 test wikidata termbox seems to work
[16:14:57] <Lucas_WMDE>	 staging too
[16:15:05] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] START helmfile.d/services/termbox: apply
[16:15:29] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kask: allow to configure tls options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018722 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[16:15:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply
[16:16:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] START helmfile.d/services/termbox: apply
[16:16:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply
[16:16:49] <Lucas_WMDE>	 hm, eqiad’s being a bit slower that codfw
[16:16:52] <Lucas_WMDE>	 ah, there it goes :)
[16:17:51] <Lucas_WMDE>	 real wikidata termbox also looking good
[16:17:56] * Lucas_WMDE done
[16:19:30] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync
[16:19:41] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync
[16:19:50] <wikibugs>	 (03CR) 10Eevans: [C:03+1] Force PKI TLS certs for cassandra instances on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/1018309 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[16:20:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T360332)', diff saved to https://phabricator.wikimedia.org/P60267 and previous config saved to /var/cache/conftool/dbconfig/20240410-162039-arnaudb.json
[16:20:41] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1228.eqiad.wmnet with reason: Maintenance
[16:20:44] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[16:20:54] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1228.eqiad.wmnet with reason: Maintenance
[16:21:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T360332)', diff saved to https://phabricator.wikimedia.org/P60268 and previous config saved to /var/cache/conftool/dbconfig/20240410-162101-arnaudb.json
[16:23:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T360332)', diff saved to https://phabricator.wikimedia.org/P60269 and previous config saved to /var/cache/conftool/dbconfig/20240410-162344-arnaudb.json
[16:23:54] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Migrate db2097 backups to db2197 [puppet] - 10https://gerrit.wikimedia.org/r/1018247 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo)
[16:26:53] <wikibugs>	 (03PS1) 10Eevans: echostore: configure TLS verification in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018742 (https://phabricator.wikimedia.org/T352647)
[16:34:26] <wikibugs>	 (03CR) 10Hashar: logging: default to log any error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[16:36:18] <wikibugs>	 (03PS3) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838)
[16:37:06] <wikibugs>	 (03CR) 10Hashar: "Ideally we would have a CI job that diff the effective configuration between the proposed change and its parent commit :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[16:38:52] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P60270 and previous config saved to /var/cache/conftool/dbconfig/20240410-163851-arnaudb.json
[16:39:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1018724 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[16:39:34] <wikibugs>	 (03CR) 10Hashar: [C:03+2] TitleLibrary: Don't register external titles as dependencies [extensions/Scribunto] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018691 (https://phabricator.wikimedia.org/T362222) (owner: 10Hashar)
[16:40:14] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] ssl: Delete dummy TLS key for the Prometheus hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1018724 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[16:41:02] <hashar>	 I am backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/1018691 which causes some log spam as part of this week train
[16:41:10] <hashar>	 should be on time for the next deployment window which has a comment about the deploy happening in the second half of the scheduled window
[16:42:00] <hashar>	 swfrench-wmf: ^ :)
[16:42:12] <hashar>	 I should be done in 30 minutes
[16:42:40] <wikibugs>	 (03PS1) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1018743
[16:43:23] <swfrench-wmf>	 hashar: ack, thank you! yes, I won't be starting until 17:30 or so
[16:43:30] <wikibugs>	 (03CR) 10Dzahn: "seems like this broke puppet https://phabricator.wikimedia.org/P60271" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur)
[16:44:08] <wikibugs>	 (03CR) 10Dzahn: "could not parse expression: 1:87: parse error: unexpected "{"" [puppet] - 10https://gerrit.wikimedia.org/r/1018255 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur)
[16:46:36] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Migrate db2098 backups to db2198 and upgrade dbprov2002 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1018276 (https://phabricator.wikimedia.org/T360751)
[16:46:37] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2199, db2200 after setup [puppet] - 10https://gerrit.wikimedia.org/r/1018744 (https://phabricator.wikimedia.org/T355422)
[16:50:05] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:50:22] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:50:29] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:50:35] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:52:15] <jinxer-wm>	 (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers
[16:54:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P60272 and previous config saved to /var/cache/conftool/dbconfig/20240410-165359-arnaudb.json
[16:54:09] <wikibugs>	 (03PS9) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321
[16:55:27] <wikibugs>	 (03PS1) 10Andrea Denisse: prometheus: Ensure the Benthos metrics are correctly parsed [puppet] - 10https://gerrit.wikimedia.org/r/1018745 (https://phabricator.wikimedia.org/T361845)
[16:55:39] <wikibugs>	 (03CR) 10Jcrespo: "Another 2 hosts setup now:" [puppet] - 10https://gerrit.wikimedia.org/r/1018744 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo)
[16:56:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[16:56:38] <wikibugs>	 (03PS1) 10Fabfur: prometheus: fix typo in aggregate rules [puppet] - 10https://gerrit.wikimedia.org/r/1018746 (https://phabricator.wikimedia.org/T361845)
[16:56:54] <hnowlan>	 jobrunners are still looking quite spicy 
[16:57:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321 (owner: 10CDobbins)
[16:57:25] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "thanks for fixing this!" [puppet] - 10https://gerrit.wikimedia.org/r/1018745 (https://phabricator.wikimedia.org/T361845) (owner: 10Andrea Denisse)
[16:58:01] <wikibugs>	 (03Abandoned) 10Fabfur: prometheus: fix typo in aggregate rules [puppet] - 10https://gerrit.wikimedia.org/r/1018746 (https://phabricator.wikimedia.org/T361845) (owner: 10Fabfur)
[16:59:07] <wikibugs>	 (03Merged) 10jenkins-bot: TitleLibrary: Don't register external titles as dependencies [extensions/Scribunto] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018691 (https://phabricator.wikimedia.org/T362222) (owner: 10Hashar)
[16:59:27] <wikibugs>	 (03PS3) 10Btullis: Create a new and profile for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552)
[17:00:05] <jouncebot>	 swfrench-wmf: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1700).
[17:00:45] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[17:00:58] <swfrench-wmf>	 holding until 17:30 UTC
[17:01:34] <hashar>	 :)
[17:01:59] <hnowlan>	 I propose killing ffmpegs that have been running for more than say 7 hours on videoscalers to free things up 
[17:02:27] <logmsgbot>	 !log hashar@deploy1002 Started scap: Backport for [[gerrit:1018691|TitleLibrary: Don't register external titles as dependencies (T362222)]]
[17:02:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:02:43] <stashbot>	 T362222: PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::addTemplate with interwiki link was deprecated in MediaWiki 1.42. [Called from MediaWiki\Extension\Scribunto\Engines\LuaCommon\TitleLibrary::getContentInternal] - https://phabricator.wikimedia.org/T362222
[17:02:54] <hnowlan>	 !log killing long-running videoscaler ffmpegs 
[17:02:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:06] <hashar>	 you are on your own hnowlan :)
[17:03:16] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[17:03:41] <hashar>	 I dont' know anything about the long tail of expected time to do a video transcoding
[17:04:09] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[17:04:16] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[17:04:26] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[17:04:29] <sukhe>	 !log depool cp1115 for firmware downgrade for PXE boot testing: T350179
[17:04:32] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] prometheus: Ensure the Benthos metrics are correctly parsed [puppet] - 10https://gerrit.wikimedia.org/r/1018745 (https://phabricator.wikimedia.org/T361845) (owner: 10Andrea Denisse)
[17:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:07] <stashbot>	 T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179
[17:05:17] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1115.eqiad.wmnet,service=(cdn|ats-be)
[17:05:42] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp1115.eqiad.wmnet
[17:06:28] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp1115.eqiad.wmnet
[17:06:36] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp1115.eqiad.wmnet
[17:07:29] <logmsgbot>	 !log hashar@deploy1002 hashar: Backport for [[gerrit:1018691|TitleLibrary: Don't register external titles as dependencies (T362222)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:07:32] <logmsgbot>	 !log hashar@deploy1002 hashar: Continuing with sync
[17:07:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:07:44] <stashbot>	 T362222: PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::addTemplate with interwiki link was deprecated in MediaWiki 1.42. [Called from MediaWiki\Extension\Scribunto\Engines\LuaCommon\TitleLibrary::getContentInternal] - https://phabricator.wikimedia.org/T362222
[17:07:49] <wikibugs>	 (03PS10) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1017321
[17:09:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T360332)', diff saved to https://phabricator.wikimedia.org/P60274 and previous config saved to /var/cache/conftool/dbconfig/20240410-170907-arnaudb.json
[17:09:10] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance
[17:09:21] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+2 C:03+2] ssl: Delete dummy TLS key for the Prometheus hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1018724 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[17:09:23] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance
[17:09:26] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[17:09:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T360332)', diff saved to https://phabricator.wikimedia.org/P60275 and previous config saved to /var/cache/conftool/dbconfig/20240410-170930-arnaudb.json
[17:10:19] <hashar>	 pff the canaries are failing
[17:10:41] <wikibugs>	 (03PS1) 10Dzahn: create wikipedia-sysop-pl.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1018747 (https://phabricator.wikimedia.org/T361041)
[17:10:54] <hashar>	 (Avg. errors per 10 seconds: Before: 0.10, After: 4.00, Threshold: 1.01)
[17:11:05] <wikibugs>	 (03CR) 10Dzahn: "Amir, how about this alternative?" [dns] - 10https://gerrit.wikimedia.org/r/1018747 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn)
[17:11:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:12:02] <hashar>	 and that is not related
[17:12:07] * hashar retry
[17:12:12] <hashar>	 retries
[17:12:13] <hashar>	 err
[17:12:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T360332)', diff saved to https://phabricator.wikimedia.org/P60276 and previous config saved to /var/cache/conftool/dbconfig/20240410-171229-arnaudb.json
[17:14:29] <wikibugs>	 (03PS4) 10Btullis: Create a new and profile for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552)
[17:14:48] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp1115.eqiad.wmnet
[17:15:59] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[17:16:31] <hashar>	 so it passed this time
[17:16:39] <hashar>	 and I am now waiting for kubernetes
[17:21:20] <logmsgbot>	 !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1018691|TitleLibrary: Don't register external titles as dependencies (T362222)]] (duration: 18m 53s)
[17:21:28] <hashar>	 swfrench-wmf: done! :)
[17:21:36] <stashbot>	 T362222: PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::addTemplate with interwiki link was deprecated in MediaWiki 1.42. [Called from MediaWiki\Extension\Scribunto\Engines\LuaCommon\TitleLibrary::getContentInternal] - https://phabricator.wikimedia.org/T362222
[17:22:01] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Create a new and profile for the new matomo server [puppet] - 10https://gerrit.wikimedia.org/r/1018737 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[17:23:26] <swfrench-wmf>	 hashar: ack - thank you!
[17:25:39] <wikibugs>	 (03PS1) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414)
[17:27:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:27:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P60277 and previous config saved to /var/cache/conftool/dbconfig/20240410-172736-arnaudb.json
[17:29:18] <cwhite>	 urandom, hnowlan: I think we're seeing jobrunner beginning to overload again.
[17:29:46] <hnowlan>	 yeah I just did a few changes to no avail 
[17:32:33] <jinxer-wm>	 (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:33:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:33:57] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:34:06] <cwhite>	 and there's the page
[17:34:32] <urandom>	 ya
[17:34:46] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp1115.eqiad.wmnet
[17:35:11] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp1115.eqiad.wmnet
[17:35:26] <cwhite>	 looks like the last action taken was to kill all ffmpegs on mw1437
[17:36:29] <urandom>	 hnowlan: what else did you try?
[17:36:32] <wikibugs>	 (03CR) 10Muehlenhoff: prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[17:37:21] <cwhite>	 I'm preparing to kill ffmpegs on mw1437 - any objections?
[17:37:22] <swfrench-wmf>	 !log restarting etcd-mirror on conf2005.codfw.wmnet for T358636
[17:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:28] <stashbot>	 T358636: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636
[17:37:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:37:35] <wikibugs>	 (03CR) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[17:37:55] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye
[17:37:57] <hnowlan>	 cwhite: maybe hold 
[17:38:04] * cwhite holds
[17:38:06] <wikibugs>	 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9704892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1115.eqiad.wmnet with OS b...
[17:38:20] <hnowlan>	 as it stands we're getting slow performance, but killing all jobs will actually cause errors 
[17:38:24] <hnowlan>	 they'll retry in most cases 
[17:38:38] <hnowlan>	 killing problematic jobs might be a better route although it hasn't won out yet
[17:38:57] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:39:18] <hnowlan>	 I dunno though, it's late here and I don't have any good options 
[17:41:16] <urandom>	 hnowlan: you said earlier you tried a few things, anything that warrants noting in the doc?
[17:41:18] <hnowlan>	 we killed all jobs on mw1437 earlier and now loads are back up to around the same 
[17:41:26] <hnowlan>	 urandom: changed the concurrency of the jobs 
[17:41:44] <urandom>	 after the initial change?  from 10 to 5, and 5 to 3?
[17:41:59] <urandom>	 or is that what you were referring to
[17:42:09] <hnowlan>	 tried killing the longer running of the jobs (7h+) 
[17:42:16] <hnowlan>	 that's what I'm referring to 
[17:42:20] <urandom>	 ok
[17:42:25] <hnowlan>	 the bad trend started around 8:00 https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1&from=1712734511374&to=1712770575924
[17:42:35] <cwhite>	 Should we try killing longer jobs?
[17:42:40] <hnowlan>	 without better knowledge of what files are causing it I dunno what to do 
[17:42:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P60278 and previous config saved to /var/cache/conftool/dbconfig/20240410-174244-arnaudb.json
[17:42:46] <wikibugs>	 (03PS2) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414)
[17:43:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:44:04] <hnowlan>	 I suspect there's a good reason we can't but I wonder whether we could pool the codfw videoscalers also 
[17:45:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:46:40] <swfrench-wmf>	 !log finished updating A:conf hosts to etcd-mirror 0.0.11-1 (T358636)
[17:46:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:45] <stashbot>	 T358636: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636
[17:47:53] <cwhite>	 hnowlan: AFAICT, codfw videoscalers are pooled?
[17:48:10] <logmsgbot>	 !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye
[17:48:16] <wikibugs>	 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9704944 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1115.eqiad.wmnet with OS bulls...
[17:48:27] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye
[17:48:32] <wikibugs>	 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9704945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1115.eqiad.wmnet with OS b...
[17:48:56] <hnowlan>	 cwhite: what are you basing that off? I'm just looking at discovery but I could be wrong
[17:49:12] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:49:27] <cwhite>	 `confctl select 'dc=codfw,cluster=videoscaler' get`
[17:51:37] <urandom>	 cwhite: discovery only points to eqiad
[17:51:49] <sukhe>	 service is active/passive
[17:52:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:52:36] <cwhite>	 that makes sense - naming is hard
[17:52:48] <urandom>	 !incidents
[17:52:48] <sirenbot>	 4581 (ACKED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[17:52:48] <sirenbot>	 4580 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[17:52:49] <sirenbot>	 4579 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[17:52:49] <sirenbot>	 4578 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[17:52:49] <sirenbot>	 4577 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[17:52:49] <sirenbot>	 4576 (RESOLVED)  db1152 (paged)/MariaDB read only x2 (paged)
[17:53:24] <wikibugs>	 (03PS1) 10Btullis: Add missing file to the matomo profile [puppet] - 10https://gerrit.wikimedia.org/r/1018756 (https://phabricator.wikimedia.org/T351552)
[17:54:12] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:56:14] <cwhite>	 What would the effect be if we made codfw videoscalers active?  Would new jobs go there and any old ones simply finish in eqiad?
[17:56:46] <cwhite>	 by old, I mean long-running jobs
[17:57:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T360332)', diff saved to https://phabricator.wikimedia.org/P60279 and previous config saved to /var/cache/conftool/dbconfig/20240410-175752-arnaudb.json
[17:57:56] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[17:58:00] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[17:58:09] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[17:58:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T360332)', diff saved to https://phabricator.wikimedia.org/P60280 and previous config saved to /var/cache/conftool/dbconfig/20240410-175816-arnaudb.json
[17:58:59] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add missing file to the matomo profile [puppet] - 10https://gerrit.wikimedia.org/r/1018756 (https://phabricator.wikimedia.org/T351552) (owner: 10Btullis)
[17:59:12] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:59:27] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:59:40] <hnowlan>	 cwhite: that would be my hope but I don't really know what the risks would be 
[17:59:43] <hnowlan>	 for now
[17:59:49] <hnowlan>	 let's just drop the concurrency in the jobqueue to 1 for both
[17:59:57] <hnowlan>	 I don't really know whether that will fix it 
[18:00:05] <jouncebot>	 hashar and jnuche: Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T1800)
[18:00:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[18:00:16] <hashar>	 ^ I have done it earlier today
[18:00:49] <wikibugs>	 (03CR) 10Muehlenhoff: prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[18:01:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T360332)', diff saved to https://phabricator.wikimedia.org/P60281 and previous config saved to /var/cache/conftool/dbconfig/20240410-180111-arnaudb.json
[18:01:51] <urandom>	 hnowlan: I guess that will result in scaling getting ((very ) far) behind, but otherwise preserve the cluster/stop the paging?
[18:03:14] <urandom>	 is there anything that shows the backlog?
[18:03:33] <wikibugs>	 (03Abandoned) 10Dwisehaupt: Enable https with apache for community civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[18:03:44] <hnowlan>	 down the bottom here https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&var-dc=eqiad%20prometheus%2Fk8s&var-job=webVideoTranscodePrioritized&from=now-12h&to=now 
[18:05:12] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage
[18:05:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[18:05:32] <urandom>	 perfect.
[18:08:16] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage
[18:08:26] <hnowlan>	 shall I do the concurrency change then
[18:08:46] <urandom>	 I'm about to submit a gerrit
[18:08:57] <urandom>	 you can if you want, or I can add you to review!
[18:09:03] <wikibugs>	 (03CR) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[18:09:12] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:09:22] <hnowlan>	 urandom: please do
[18:10:06] <wikibugs>	 (03PS1) 10Jforrester: Parser::statelessFetchTemplate: don't add interwiki redirects to dependencies [core] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018692 (https://phabricator.wikimedia.org/T362221)
[18:10:52] <wikibugs>	 (03PS1) 10Eevans: changeprop-jobqueue: temporarily reduce video transcode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018759
[18:11:38] <urandom>	 done ^^^
[18:12:31] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] changeprop-jobqueue: temporarily reduce video transcode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018759 (owner: 10Eevans)
[18:12:51] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] changeprop-jobqueue: temporarily reduce video transcode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018759 (owner: 10Eevans)
[18:13:07] <wikibugs>	 (03CR) 10Eevans: [C:03+2] changeprop-jobqueue: temporarily reduce video transcode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018759 (owner: 10Eevans)
[18:13:58] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop-jobqueue: temporarily reduce video transcode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018759 (owner: 10Eevans)
[18:14:03] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018409
[18:15:02] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[18:15:04] <wikibugs>	 (03CR) 10Muehlenhoff: prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[18:15:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[18:15:33] <wikibugs>	 (03CR) 10Herron: prometheus: Ensure TLS certificates are provided by CFSSL (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[18:15:43] <urandom>	 I guess staging wasn't applied earlier
[18:16:05] <urandom>	 I assume it is OK to do so though?
[18:16:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[18:16:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P60282 and previous config saved to /var/cache/conftool/dbconfig/20240410-181618-arnaudb.json
[18:16:24] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[18:16:31] <logmsgbot>	 !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[18:17:11] <logmsgbot>	 !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[18:18:41] <wikibugs>	 (03PS3) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414)
[18:19:25] <wikibugs>	 (03CR) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[18:19:40] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:19:51] <urandom>	 acked that.
[18:21:31] <urandom>	 I wonder if we'll need to kill some more ffmeg processes to create headroom
[18:21:40] <urandom>	 (again)
[18:22:47] <hnowlan>	 I'd let it sit a bit 
[18:24:12] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:24:33] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3071.esams.wmnet,service=(cdn|ats-be)
[18:24:40] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] cp3071: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015973 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh)
[18:26:40] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1115.eqiad.wmnet with OS bullseye
[18:26:49] <wikibugs>	 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9705040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1115.eqiad.wmnet with OS bulls...
[18:28:28] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp3071.esams.wmnet with OS bullseye
[18:28:41] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9705041 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp3071.esams.wmnet with OS bullseye
[18:30:09] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1115.eqiad.wmnet,service=(cdn|ats-be)
[18:31:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P60283 and previous config saved to /var/cache/conftool/dbconfig/20240410-183126-arnaudb.json
[18:32:30] <wikibugs>	 (03CR) 10Eevans: [C:03+2] echostore: configure TLS verification in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018742 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans)
[18:32:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:33:27] <wikibugs>	 (03Merged) 10jenkins-bot: echostore: configure TLS verification in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018742 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans)
[18:34:24] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply
[18:34:53] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply
[18:36:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[18:37:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:39:46] <wikibugs>	 (03PS4) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414)
[18:40:29] <wikibugs>	 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9705080 (10ssingh) For `cp1115` that we tried today, I downgraded the BIOS, NIC and iDRAC firmwares, to match what we have in esams, whe...
[18:41:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[18:46:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[18:46:30] <jinxer-wm>	 (ProbeDown) firing: (2) Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:46:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T360332)', diff saved to https://phabricator.wikimedia.org/P60284 and previous config saved to /var/cache/conftool/dbconfig/20240410-184633-arnaudb.json
[18:46:36] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[18:46:38] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[18:46:49] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[18:46:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T360332)', diff saved to https://phabricator.wikimedia.org/P60285 and previous config saved to /var/cache/conftool/dbconfig/20240410-184656-arnaudb.json
[18:51:30] <jinxer-wm>	 (ProbeDown) resolved: (2) Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:51:32] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3071.esams.wmnet with reason: host reimage
[18:54:41] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3071.esams.wmnet with reason: host reimage
[18:57:41] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1018749/1858/" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[18:58:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[19:03:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[19:03:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T360332)', diff saved to https://phabricator.wikimedia.org/P60287 and previous config saved to /var/cache/conftool/dbconfig/20240410-190347-arnaudb.json
[19:04:02] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[19:06:20] <wikibugs>	 (03PS17) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[19:06:39] <wikibugs>	 (03PS18) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[19:08:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[19:09:53] <urandom>	 cwhite: looks like it is recovering a bit
[19:10:16] <urandom>	 s/recovering a bit/death-spiraling less/
[19:10:49] <hnowlan>	 definitely seeming a bit better, more stable responses, less active workers 
[19:10:52] <hnowlan>	 https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1
[19:11:18] <urandom>	 ya
[19:14:24] <cwhite>	 cluster is still under heavy load, but at least monitoring isn't complaining :)
[19:18:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P60288 and previous config saved to /var/cache/conftool/dbconfig/20240410-191854-arnaudb.json
[19:20:29] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3071.esams.wmnet with OS bullseye
[19:20:45] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9705132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp3071.esams.wmnet with OS bullseye completed: - cp3071 (**PASS**)...
[19:24:06] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9705133 (10ssingh)
[19:24:52] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3071.esams.wmnet,service=(cdn|ats-be)
[19:28:19] <wikibugs>	 (03CR) 10Herron: "NOOPs on the prometheus pop hosts e.g. prometheus6002 seems off to me" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[19:34:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P60289 and previous config saved to /var/cache/conftool/dbconfig/20240410-193402-arnaudb.json
[19:42:15] <jinxer-wm>	 (JobrunnerPHPBusyWorkers) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers
[19:42:54] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] LogStash: log HtmlOutputRendererHelper channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018738 (https://phabricator.wikimedia.org/T356157) (owner: 10Daniel Kinzler)
[19:43:20] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] "I scheduled it for deployment: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T2000" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018738 (https://phabricator.wikimedia.org/T356157) (owner: 10Daniel Kinzler)
[19:46:29] <wikibugs>	 (03PS19) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[19:47:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[19:49:10] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T360332)', diff saved to https://phabricator.wikimedia.org/P60290 and previous config saved to /var/cache/conftool/dbconfig/20240410-194909-arnaudb.json
[19:49:12] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[19:49:19] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[19:49:25] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[19:50:05] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[19:50:18] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[19:50:58] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[19:51:00] <wikibugs>	 (03PS1) 10Andrew Bogott: nova-fullstack: switch test image from Bullseye to Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1018777
[19:51:12] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[19:51:54] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance
[19:52:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance
[19:52:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] nova-fullstack: switch test image from Bullseye to Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1018777 (owner: 10Andrew Bogott)
[19:53:17] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance
[19:53:31] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance
[19:54:21] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance
[19:54:23] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance
[19:54:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2112 (T360332)', diff saved to https://phabricator.wikimedia.org/P60291 and previous config saved to /var/cache/conftool/dbconfig/20240410-195430-arnaudb.json
[19:54:35] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[19:56:38] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: swith to "failure-rate" retry strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018778
[19:57:09] <wikibugs>	 (03PS20) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[19:57:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T360332)', diff saved to https://phabricator.wikimedia.org/P60292 and previous config saved to /var/cache/conftool/dbconfig/20240410-195730-arnaudb.json
[19:58:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T2000).
[20:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:13] <MatmaRex>	 hi
[20:00:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[20:00:19] <cjming>	 o/
[20:00:22] <cjming>	 i can deploy
[20:00:29] <MatmaRex>	 just a trivial patch today, no way to really test it, so it can go out directly
[20:00:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018738 (https://phabricator.wikimedia.org/T356157) (owner: 10Daniel Kinzler)
[20:00:55] <cjming>	 sounds good - will sync
[20:01:05] <MatmaRex>	 thanks
[20:01:12] <cjming>	 np!
[20:01:18] <wikibugs>	 (03Merged) 10jenkins-bot: LogStash: log HtmlOutputRendererHelper channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018738 (https://phabricator.wikimedia.org/T356157) (owner: 10Daniel Kinzler)
[20:01:48] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:1018738|LogStash: log HtmlOutputRendererHelper channel (T356157)]]
[20:02:04] <stashbot>	 T356157: Unable to fetch Parsoid HTML - https://phabricator.wikimedia.org/T356157
[20:04:23] <logmsgbot>	 !log cjming@deploy1002 cjming and daniel: Backport for [[gerrit:1018738|LogStash: log HtmlOutputRendererHelper channel (T356157)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:04:31] <logmsgbot>	 !log cjming@deploy1002 cjming and daniel: Continuing with sync
[20:09:40] <wikibugs>	 (03PS21) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[20:10:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[20:12:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P60293 and previous config saved to /var/cache/conftool/dbconfig/20240410-201237-arnaudb.json
[20:15:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[20:15:39] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1018738|LogStash: log HtmlOutputRendererHelper channel (T356157)]] (duration: 13m 51s)
[20:15:48] <stashbot>	 T356157: Unable to fetch Parsoid HTML - https://phabricator.wikimedia.org/T356157
[20:15:53] <cjming>	 MatmaRex: should be live!
[20:16:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[20:16:18] <MatmaRex>	 thanks cjming
[20:16:28] <cjming>	 yw!
[20:16:28] <MatmaRex>	 hopefully we'll see some logs in that channel
[20:16:35] <cjming>	 🤞
[20:16:58] <cjming>	 i'm going to close the backport window cuz i gotta run to a mtg
[20:17:38] <cjming>	 !log end of UTC late backport window
[20:17:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:10] <wikibugs>	 (03PS4) 10Hashar: logging: default to log any error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838)
[20:19:41] <wikibugs>	 (03CR) 10Hashar: [C:04-1] "I found out we have some tests, I will look at adding a test covering the behavior." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[20:26:02] <wikibugs>	 (03PS22) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[20:27:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[20:27:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P60294 and previous config saved to /var/cache/conftool/dbconfig/20240410-202745-arnaudb.json
[20:29:40] <wikibugs>	 (03PS1) 10Volans: quotereviewer: support tables with Qty field [software] - 10https://gerrit.wikimedia.org/r/1018783 (https://phabricator.wikimedia.org/T362260)
[20:31:02] <wikibugs>	 (03PS23) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[20:31:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[20:32:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[20:35:21] <wikibugs>	 (03CR) 10RobH: [C:03+2] quotereviewer: support tables with Qty field [software] - 10https://gerrit.wikimedia.org/r/1018783 (https://phabricator.wikimedia.org/T362260) (owner: 10Volans)
[20:35:57] <wikibugs>	 (03Merged) 10jenkins-bot: quotereviewer: support tables with Qty field [software] - 10https://gerrit.wikimedia.org/r/1018783 (https://phabricator.wikimedia.org/T362260) (owner: 10Volans)
[20:37:32] <wikibugs>	 (03CR) 10RobH: [C:03+2] "recheck" [software] - 10https://gerrit.wikimedia.org/r/1018783 (https://phabricator.wikimedia.org/T362260) (owner: 10Volans)
[20:39:15] <wikibugs>	 (03PS24) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[20:40:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[20:42:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T360332)', diff saved to https://phabricator.wikimedia.org/P60295 and previous config saved to /var/cache/conftool/dbconfig/20240410-204253-arnaudb.json
[20:42:55] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance
[20:42:58] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[20:43:09] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance
[20:43:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T360332)', diff saved to https://phabricator.wikimedia.org/P60296 and previous config saved to /var/cache/conftool/dbconfig/20240410-204316-arnaudb.json
[20:44:33] <urandom>	 !incidents
[20:44:34] <sirenbot>	 4583 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[20:44:34] <sirenbot>	 4582 (RESOLVED)  ProbeDown sre (ip4 probes/service eqiad)
[20:44:34] <sirenbot>	 4581 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[20:44:34] <sirenbot>	 4580 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[20:44:35] <sirenbot>	 4579 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[20:44:35] <sirenbot>	 4578 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[20:44:35] <sirenbot>	 4577 (RESOLVED)  ProbeDown sre (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 eqiad)
[20:44:35] <sirenbot>	 4576 (RESOLVED)  db1152 (paged)/MariaDB read only x2 (paged)
[20:46:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T360332)', diff saved to https://phabricator.wikimedia.org/P60297 and previous config saved to /var/cache/conftool/dbconfig/20240410-204617-arnaudb.json
[20:59:37] <wikibugs>	 (03PS25) 10Bking: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[21:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240410T2100)
[21:00:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking)
[21:01:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P60298 and previous config saved to /var/cache/conftool/dbconfig/20240410-210125-arnaudb.json
[21:16:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P60300 and previous config saved to /var/cache/conftool/dbconfig/20240410-211632-arnaudb.json
[21:31:15] <wikibugs>	 (03PS1) 10EoghanGaffney: gitlab: Unquote rsync path with glob [puppet] - 10https://gerrit.wikimedia.org/r/1018796
[21:31:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T360332)', diff saved to https://phabricator.wikimedia.org/P60301 and previous config saved to /var/cache/conftool/dbconfig/20240410-213140-arnaudb.json
[21:31:43] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance
[21:31:48] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[21:31:56] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance
[21:32:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T360332)', diff saved to https://phabricator.wikimedia.org/P60302 and previous config saved to /var/cache/conftool/dbconfig/20240410-213203-arnaudb.json
[21:35:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T360332)', diff saved to https://phabricator.wikimedia.org/P60303 and previous config saved to /var/cache/conftool/dbconfig/20240410-213506-arnaudb.json
[21:40:33] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gitlab: Unquote rsync path with glob [puppet] - 10https://gerrit.wikimedia.org/r/1018796 (owner: 10EoghanGaffney)
[21:40:42] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] gitlab: Unquote rsync path with glob [puppet] - 10https://gerrit.wikimedia.org/r/1018796 (owner: 10EoghanGaffney)
[21:50:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P60304 and previous config saved to /var/cache/conftool/dbconfig/20240410-215014-arnaudb.json
[21:56:52] <mutante>	 !log prometheus - recreating deleted TLS certs/keys in private repo
[21:56:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:05:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P60305 and previous config saved to /var/cache/conftool/dbconfig/20240410-220521-arnaudb.json
[22:13:56] <wikibugs>	 (03PS5) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414)
[22:20:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T360332)', diff saved to https://phabricator.wikimedia.org/P60306 and previous config saved to /var/cache/conftool/dbconfig/20240410-222028-arnaudb.json
[22:20:31] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance
[22:20:35] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[22:20:44] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance
[22:21:30] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance
[22:21:43] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance
[22:21:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T360332)', diff saved to https://phabricator.wikimedia.org/P60307 and previous config saved to /var/cache/conftool/dbconfig/20240410-222150-arnaudb.json
[22:24:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T360332)', diff saved to https://phabricator.wikimedia.org/P60308 and previous config saved to /var/cache/conftool/dbconfig/20240410-222445-arnaudb.json
[22:31:55] <wikibugs>	 (03PS6) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414)
[22:36:37] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] static.php: Handle mediawiki.org/ontology/ontology.owl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[22:37:47] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] static.php: Handle mediawiki.org/ontology/ontology.owl (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[22:39:38] <wikibugs>	 (03PS3) 10Ahmon Dancy: static.php: Handle mediawiki.org/ontology/ontology.owl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807)
[22:39:46] <wikibugs>	 (03CR) 10Ahmon Dancy: static.php: Handle mediawiki.org/ontology/ontology.owl (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[22:39:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P60309 and previous config saved to /var/cache/conftool/dbconfig/20240410-223953-arnaudb.json
[22:41:57] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] static.php: Handle mediawiki.org/ontology/ontology.owl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018354 (https://phabricator.wikimedia.org/T171807) (owner: 10Ahmon Dancy)
[22:49:40] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "I don't see an "include ::profile::tlsproxy::envoy" in the prometheus::pop role. Also there is no "ensure: present" for "tlsproxy::envoy" " [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[22:55:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P60310 and previous config saved to /var/cache/conftool/dbconfig/20240410-225500-arnaudb.json
[22:56:18] <wikibugs>	 (03PS1) 10Andrea Denisse: prometheus: Ensure the Prometheus PoP role uses TLSProxy [puppet] - 10https://gerrit.wikimedia.org/r/1018802 (https://phabricator.wikimedia.org/T360414)
[23:10:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T360332)', diff saved to https://phabricator.wikimedia.org/P60311 and previous config saved to /var/cache/conftool/dbconfig/20240410-231008-arnaudb.json
[23:10:11] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance
[23:10:13] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[23:10:24] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance
[23:10:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T360332)', diff saved to https://phabricator.wikimedia.org/P60312 and previous config saved to /var/cache/conftool/dbconfig/20240410-231032-arnaudb.json
[23:13:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T360332)', diff saved to https://phabricator.wikimedia.org/P60313 and previous config saved to /var/cache/conftool/dbconfig/20240410-231335-arnaudb.json
[23:28:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P60314 and previous config saved to /var/cache/conftool/dbconfig/20240410-232842-arnaudb.json
[23:37:47] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1018410
[23:37:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1018410 (owner: 10TrainBranchBot)
[23:43:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P60315 and previous config saved to /var/cache/conftool/dbconfig/20240410-234350-arnaudb.json
[23:58:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T360332)', diff saved to https://phabricator.wikimedia.org/P60316 and previous config saved to /var/cache/conftool/dbconfig/20240410-235857-arnaudb.json
[23:59:00] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance
[23:59:02] <stashbot>	 T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332
[23:59:13] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance
[23:59:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T360332)', diff saved to https://phabricator.wikimedia.org/P60317 and previous config saved to /var/cache/conftool/dbconfig/20240410-235920-arnaudb.json
[23:59:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T356166)', diff saved to https://phabricator.wikimedia.org/P60318 and previous config saved to /var/cache/conftool/dbconfig/20240410-235950-marostegui.json
[23:59:55] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166