[00:03:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1018421 (owner: 10TrainBranchBot) [01:05:25] (SystemdUnitFailed) firing: php7.4-fpm_check_restart.service on mw1445:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:05:50] !log Manually deleting /srv/syslog/.linux.dhcp.DictModel/syslog.log from November 30 on centrallog1002 and centrallog2002 after the prune_old_srv_syslog_directories.service failed to delete the non-empty directory - T362376 [01:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:55] T362376: The prune_old_srv_syslog_directories.service can't delete non-empty directories on centrallog instances - https://phabricator.wikimedia.org/T362376 [02:00:25] (SystemdUnitFailed) firing: prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:26:39] (03PS1) 10Andrea Denisse: syslog: Update log cleanup command in syslog central server [puppet] - 10https://gerrit.wikimedia.org/r/1019139 (https://phabricator.wikimedia.org/T362376) [02:31:03] (03CR) 10Cwhite: "Test was successful." [puppet] - 10https://gerrit.wikimedia.org/r/1018417 (https://phabricator.wikimedia.org/T348508) (owner: 10Cwhite) [02:31:36] (03CR) 10Andrea Denisse: "I tested this on pontoon-log-03.monitoring.eqiad1.wikimedia.cloud by creating sample logs older than the mtime required and sample logs mo" [puppet] - 10https://gerrit.wikimedia.org/r/1019139 (https://phabricator.wikimedia.org/T362376) (owner: 10Andrea Denisse) [02:39:15] (JobrunnerPHPBusyWorkers) firing: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [02:46:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [02:47:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T356166)', diff saved to https://phabricator.wikimedia.org/P60449 and previous config saved to /var/cache/conftool/dbconfig/20240412-024729-marostegui.json [02:47:35] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [02:53:28] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [03:02:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P60450 and previous config saved to /var/cache/conftool/dbconfig/20240412-030237-marostegui.json [03:07:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [03:09:48] (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudbackup200[12] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1017916 (https://phabricator.wikimedia.org/T356216) (owner: 10Andrew Bogott) [03:17:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P60451 and previous config saved to /var/cache/conftool/dbconfig/20240412-031744-marostegui.json [03:32:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T356166)', diff saved to https://phabricator.wikimedia.org/P60452 and previous config saved to /var/cache/conftool/dbconfig/20240412-033254-marostegui.json [03:32:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1249.eqiad.wmnet with reason: Maintenance [03:32:59] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [03:33:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1249.eqiad.wmnet with reason: Maintenance [03:33:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T356166)', diff saved to https://phabricator.wikimedia.org/P60453 and previous config saved to /var/cache/conftool/dbconfig/20240412-033317-marostegui.json [03:35:27] (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:44:15] (JobrunnerPHPBusyWorkers) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers [04:27:49] (ProbeDown) firing: (3) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:28:11] (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:32:49] (ProbeDown) resolved: (3) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:33:11] (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:05:25] (SystemdUnitFailed) firing: php7.4-fpm_check_restart.service on mw1445:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:57] (03PS1) 10Marostegui: db2109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1019146 [05:16:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2109', diff saved to https://phabricator.wikimedia.org/P60454 and previous config saved to /var/cache/conftool/dbconfig/20240412-051606-root.json [05:16:52] (03CR) 10Marostegui: [C:03+2] db2109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1019146 (owner: 10Marostegui) [05:17:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2109.codfw.wmnet with OS bookworm [05:18:03] (03CR) 10Muehlenhoff: [C:03+2] apt-staging: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/1012346 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [05:18:56] (03PS1) 10Marostegui: Revert "db2109: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1019166 [05:23:43] !log prune obsolete nginx debs on apt-staging after switch to new nginx provider scheme T329529 [05:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:48] T329529: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 [05:25:54] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529#9709060 (10MoritzMuehlenhoff) [05:32:05] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, a few nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [05:33:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2109.codfw.wmnet with reason: host reimage [05:33:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [05:35:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2109.codfw.wmnet with reason: host reimage [05:39:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, two nits inline." [puppet] - 10https://gerrit.wikimedia.org/r/1019115 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [05:54:00] (03CR) 10Marostegui: [C:03+2] Revert "db2109: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1019166 (owner: 10Marostegui) [05:54:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60455 and previous config saved to /var/cache/conftool/dbconfig/20240412-055401-root.json [05:56:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 928ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:56:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2109.codfw.wmnet with OS bookworm [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240412T0600) [06:00:25] (SystemdUnitFailed) firing: (2) prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:01:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 998.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60456 and previous config saved to /var/cache/conftool/dbconfig/20240412-060907-root.json [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:21:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:24:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60457 and previous config saved to /var/cache/conftool/dbconfig/20240412-062412-root.json [06:26:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:26:35] (03PS1) 10Muehlenhoff: Switch moss nodes away from insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1019147 (https://phabricator.wikimedia.org/T349619) [06:34:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [06:39:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60458 and previous config saved to /var/cache/conftool/dbconfig/20240412-063918-root.json [06:45:49] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9709114 (10MoritzMuehlenhoff) [06:45:56] (03CR) 10Arnaudb: "2002 was because of a puppet error yep: https://sal.toolforge.org/log/Lg7Moo4BGiVuUzOdsEuh" [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [06:54:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60459 and previous config saved to /var/cache/conftool/dbconfig/20240412-065424-root.json [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240412T0700) [07:09:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60460 and previous config saved to /var/cache/conftool/dbconfig/20240412-070930-root.json [07:24:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60461 and previous config saved to /var/cache/conftool/dbconfig/20240412-072435-root.json [07:25:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9709152 (10jcrespo) 05Resolved→03Open hi, we cannot ssh into dbprov1006.eqiad.wmnet [07:29:23] (03CR) 10JMeybohm: [C:03+2] kubernetes::master: Forward audit logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1019049 (https://phabricator.wikimedia.org/T290020) (owner: 10JMeybohm) [07:30:58] (03PS1) 10Muehlenhoff: Remove remaining diamond leftovers [puppet] - 10https://gerrit.wikimedia.org/r/1019153 [07:35:27] (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [07:38:01] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9709163 (10MoritzMuehlenhoff) [07:39:29] (03CR) 10Hashar: logging: default to log any error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [07:41:37] (03CR) 10Majavah: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1019153 (owner: 10Muehlenhoff) [07:42:02] (03CR) 10MVernon: [C:03+1] Switch moss nodes away from insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1019147 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:44:09] (03CR) 10Muehlenhoff: [C:03+2] Switch moss nodes away from insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1019147 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:59:18] (03CR) 10Muehlenhoff: [C:03+2] Remove remaining diamond leftovers [puppet] - 10https://gerrit.wikimedia.org/r/1019153 (owner: 10Muehlenhoff) [08:09:40] (03PS2) 10Majavah: Add toolsadmin-toolsbeta [dns] - 10https://gerrit.wikimedia.org/r/1018656 (https://phabricator.wikimedia.org/T360025) [08:12:38] (03CR) 10Jcrespo: [C:03+1] mariadb: Reenable notifications for db2201 & db2202 [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [08:13:07] (03CR) 10Majavah: [C:03+2] Add toolsadmin-toolsbeta [dns] - 10https://gerrit.wikimedia.org/r/1018656 (https://phabricator.wikimedia.org/T360025) (owner: 10Majavah) [08:13:59] (03CR) 10Jcrespo: [C:03+1] "DBAs: On the previous comment you will find the date of source of the data recovery for this last batch of 2 hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [08:36:14] (03PS1) 10Majavah: hieradata: Update Striker to 2024-04-12-081232-production [puppet] - 10https://gerrit.wikimedia.org/r/1019233 [08:37:13] (03CR) 10Majavah: [C:03+2] hieradata: Update Striker to 2024-04-12-081232-production [puppet] - 10https://gerrit.wikimedia.org/r/1019233 (owner: 10Majavah) [08:52:06] (03PS6) 10Winston Sung: zhwikivoyage: Make RelatedArticles extension usable on zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015551 (https://phabricator.wikimedia.org/T361427) (owner: 10S8321414) [08:56:45] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [09:05:25] (SystemdUnitFailed) firing: php7.4-fpm_check_restart.service on mw1445:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:38] ^checking [09:11:00] ah yeah obviously it's a videoscaler [09:13:10] 06SRE, 10Cloud-VPS, 10DNS, 06Traffic: 14DNS name resolution failure with www.spacecom.mil from Cloud VPS - 14https://phabricator.wikimedia.org/T346471#9709332 (10taavi) 05Open→03Resolved a:03taavi [09:13:53] all good, transient failure, probably due to it being a bit overloaded [09:15:25] (SystemdUnitFailed) resolved: php7.4-fpm_check_restart.service on mw1445:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:07] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on matomo1003.eqiad.wmnet with reason: Still in setup [09:25:22] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on matomo1003.eqiad.wmnet with reason: Still in setup [09:26:57] !log installing debootstrap bugfix updates from Bullseye point release [09:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:41] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9709387 (10MoritzMuehlenhoff) [09:36:01] !log installing postgresql-common bugfix updates from Bullseye point release [09:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:23] (03CR) 10Filippo Giunchedi: [C:03+1] opensearch: bump curator version to wmf4 [puppet] - 10https://gerrit.wikimedia.org/r/1018417 (https://phabricator.wikimedia.org/T348508) (owner: 10Cwhite) [09:45:03] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9709405 (10MoritzMuehlenhoff) [09:46:22] (03CR) 10Jcrespo: [C:03+2] "Ack" [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [09:46:38] (03PS2) 10Jcrespo: mariadb: Reenable notifications for db2201 & db2202 [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) [09:46:42] (03CR) 10Jcrespo: [V:03+2 C:03+2] mariadb: Reenable notifications for db2201 & db2202 [puppet] - 10https://gerrit.wikimedia.org/r/1019077 (https://phabricator.wikimedia.org/T355422) (owner: 10Jcrespo) [09:50:20] (03PS1) 10Hashar: logging: always register udp2log handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019253 (https://phabricator.wikimedia.org/T228838) [09:57:59] (03CR) 10Hashar: logging: default to log any error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [09:58:10] (03CR) 10Filippo Giunchedi: [C:04-1] "Thank you for working on this! I don't think we should go ahead with this, see below" [puppet] - 10https://gerrit.wikimedia.org/r/1019139 (https://phabricator.wikimedia.org/T362376) (owner: 10Andrea Denisse) [09:58:12] !log mwmaint1002: mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=eswiki --search-index (T362367) [09:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:16] T362367: [wmf.26 - eswiki] Homepage: task counter issues - "No suggestions found" incorrectly displayed - https://phabricator.wikimedia.org/T362367 [09:58:37] (03PS1) 10Muehlenhoff: Deprecate system::role for IF services (batch three) [puppet] - 10https://gerrit.wikimedia.org/r/1019255 [10:00:25] (SystemdUnitFailed) firing: (2) prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:15] (03PS1) 10Muehlenhoff: icinga: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1019261 [10:20:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1019261 (owner: 10Muehlenhoff) [10:31:38] (03CR) 10Filippo Giunchedi: [C:03+1] icinga: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1019261 (owner: 10Muehlenhoff) [10:40:12] (03CR) 10Dreamy Jazz: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1013130 (https://phabricator.wikimedia.org/T360516) (owner: 10Tchanders) [10:41:53] (03CR) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) (owner: 10Dreamy Jazz) [10:42:44] (03PS12) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) [10:47:40] (03PS4) 10Jcrespo: mariadb: Migrate db2098 backups to db2198 and upgrade dbprov2002 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1018276 (https://phabricator.wikimedia.org/T360751) [10:47:40] (03PS1) 10Jcrespo: mariadb: Move services db2101->db2201,db2099->db2199, upgrade dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/1019263 (https://phabricator.wikimedia.org/T358741) [10:47:58] Dreamy_Jazz: I kinda want to !bash “the year will start with 2 for the foreseeable future” :D [10:48:08] (from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1014526) [10:50:18] (03CR) 10Jcrespo: "I am going to prefer this to be merged before https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018276, so that db2201 can be decommis" [puppet] - 10https://gerrit.wikimedia.org/r/1019263 (https://phabricator.wikimedia.org/T358741) (owner: 10Jcrespo) [10:51:03] (03CR) 10Hnowlan: [C:03+1] "I am fine with disabling paging for videoscalers and just having this a quasi-business hours response. There might be better alerts we can" [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796) (owner: 10Cwhite) [10:52:02] (03PS1) 10Btullis: Swap accidental spaces for tabs in matomo partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1019264 (https://phabricator.wikimedia.org/T349397) [10:53:05] Lucas_WMDE: :) [10:53:57] (03PS2) 10Jcrespo: mariadb: Move services db2101->db2201,db2099->db2199, upgrade dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/1019263 (https://phabricator.wikimedia.org/T358741) [10:54:55] Lucas_WMDE: Dreamy_Jazz: year 3k problem is for the next developers to solve? :)) [10:55:10] Yeah. [10:55:25] That was a quote from the meeting I was in deciding to use ~2 as the format :) [10:55:54] !log mwmaint1002: mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=frwiki --search-index (T362367) [10:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:59] T362367: [wmf.26 - eswiki] Homepage: task counter issues - "No suggestions found" incorrectly displayed - https://phabricator.wikimedia.org/T362367 [10:56:10] (03PS2) 10Aklapper: Migrate from 100-scale to unit-scale SLO recording rules [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914946 (https://phabricator.wikimedia.org/T289615) (owner: 10RLazarus) [10:56:20] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:57:23] 06SRE, 06SRE-OnFire, 10observability: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569#9709555 (10Aklapper) @CDanis: Hi, all related patches in Gerrit have been merged. Can this task be resolved (via {nav name=Add Action... >... [11:06:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [11:06:56] (ProbeDown) firing: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:11:56] (ProbeDown) resolved: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:14:27] 06SRE, 06Traffic-Icebox, 07HTTPS, 13Patch-Needs-Improvement: Provide acme-chief/TLS SNI list support in compile_redirects() - https://phabricator.wikimedia.org/T225096#9709629 (10Aklapper) [11:14:37] (03CR) 10Alexandros Kosiaris: [C:03+1] service catalog: disable paging on jobrunner and videoscaler services [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796) (owner: 10Cwhite) [11:15:17] 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 13Patch-Needs-Improvement: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954#9709634 (10Aklapper) [11:22:25] (03CR) 10Btullis: [C:03+2] Swap accidental spaces for tabs in matomo partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1019264 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [11:33:23] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm [11:35:27] (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:36:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 832.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:37:51] (03PS2) 10Hashar: logging: always register udp2log handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019253 (https://phabricator.wikimedia.org/T228838) [11:37:51] (03PS1) 10Hashar: logging: pluralize $wmgDefaultMonologHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019267 (https://phabricator.wikimedia.org/T238838) [11:45:57] (03PS6) 10Hnowlan: shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [11:46:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 843.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:46:29] (03CR) 10Hnowlan: [C:03+1] shellbox: add PHP + Apache timeout settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [11:50:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T356166)', diff saved to https://phabricator.wikimedia.org/P60463 and previous config saved to /var/cache/conftool/dbconfig/20240412-115029-marostegui.json [11:50:36] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [12:02:41] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host matomo1003.eqiad.wmnet with OS bookworm [12:05:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P60464 and previous config saved to /var/cache/conftool/dbconfig/20240412-120537-marostegui.json [12:10:40] (03PS1) 10Slyngshede: New SSH key validator - Block duplicate keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/1019271 (https://phabricator.wikimedia.org/T359532) [12:19:29] (03PS1) 10Elukey: role::cassandra_dev: move to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1019272 (https://phabricator.wikimedia.org/T352647) [12:20:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P60466 and previous config saved to /var/cache/conftool/dbconfig/20240412-122045-marostegui.json [12:21:33] (03PS1) 10Elukey: role::cassandra_dev: add fake truststore password for PKI [labs/private] - 10https://gerrit.wikimedia.org/r/1019274 (https://phabricator.wikimedia.org/T352647) [12:23:31] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1889/co" [puppet] - 10https://gerrit.wikimedia.org/r/1019272 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [12:25:28] (03PS1) 10David Caro: kubeadm::worker: increase the limit of inotify user instances [puppet] - 10https://gerrit.wikimedia.org/r/1019277 (https://phabricator.wikimedia.org/T361519) [12:27:16] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1890/co" [puppet] - 10https://gerrit.wikimedia.org/r/1019277 (https://phabricator.wikimedia.org/T361519) (owner: 10David Caro) [12:27:52] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1019277 (https://phabricator.wikimedia.org/T361519) (owner: 10David Caro) [12:30:31] (03CR) 10David Caro: [V:03+1 C:03+2] kubeadm::worker: increase the limit of inotify user instances [puppet] - 10https://gerrit.wikimedia.org/r/1019277 (https://phabricator.wikimedia.org/T361519) (owner: 10David Caro) [12:31:00] !log updated rsyslog to 8.2404.0-1~bpo11+1 on staging-codfw and staging-eqiad k8s clusters - T357616 [12:34:23] (03PS1) 10JMeybohm: kubernetes::node: Log a line on rsyslog fd leak restarts [puppet] - 10https://gerrit.wikimedia.org/r/1019278 (https://phabricator.wikimedia.org/T357616) [12:35:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T356166)', diff saved to https://phabricator.wikimedia.org/P60467 and previous config saved to /var/cache/conftool/dbconfig/20240412-123552-marostegui.json [12:35:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:36:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:40:14] (03CR) 10Elukey: [C:03+2] ml-serve: Add istio config for mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019061 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [12:40:26] (03PS2) 10Elukey: ml-staging-codfw: Override mediawiki-app-vs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019074 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [12:43:03] (03CR) 10Filippo Giunchedi: [C:03+1] kubernetes::node: Log a line on rsyslog fd leak restarts [puppet] - 10https://gerrit.wikimedia.org/r/1019278 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [12:43:15] (03CR) 10JMeybohm: [C:03+2] kubernetes::node: Log a line on rsyslog fd leak restarts [puppet] - 10https://gerrit.wikimedia.org/r/1019278 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [12:44:32] (03CR) 10Elukey: "The CI diff seems a little messed up, I think the VS gets modified but the other ones are dropped (probably we don't merge dicts etc..). F" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019074 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [12:53:11] !log updated rsyslog to 8.2404.0-1~bpo11+1 on staging-codfw and staging-eqiad k8s clusters - T357616 [12:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:16] T357616: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616 [13:03:35] (03CR) 10Gergő Tisza: logging: always register udp2log handlers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019253 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [13:09:26] (03CR) 10Hashar: logging: always register udp2log handlers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019253 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [13:42:17] (03CR) 10Hashar: [C:03+2] Parser::statelessFetchTemplate: don't add interwiki redirects to dependencies [core] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018692 (https://phabricator.wikimedia.org/T362221) (owner: 10Jforrester) [13:51:10] (03CR) 10Gergő Tisza: logging: always register udp2log handlers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019253 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [13:54:34] (03CR) 10Cwhite: [C:03+2] opensearch: bump curator version to wmf4 [puppet] - 10https://gerrit.wikimedia.org/r/1018417 (https://phabricator.wikimedia.org/T348508) (owner: 10Cwhite) [14:00:25] (SystemdUnitFailed) firing: (2) prometheus-wmf-elasticsearch-exporter-9600.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:01:19] (03PS3) 10Elukey: ml-staging-codfw: Override mediawiki-app-vs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019074 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [14:01:21] (03Merged) 10jenkins-bot: Parser::statelessFetchTemplate: don't add interwiki redirects to dependencies [core] (wmf/1.42.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1018692 (https://phabricator.wikimedia.org/T362221) (owner: 10Jforrester) [14:01:49] (03CR) 10Cwhite: [C:03+2] service catalog: disable paging on jobrunner and videoscaler services [puppet] - 10https://gerrit.wikimedia.org/r/1018420 (https://phabricator.wikimedia.org/T349796) (owner: 10Cwhite) [14:03:36] (03CR) 10Elukey: "The diff doesn't look right, we need to copy the whole set of environment variables :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018959 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [14:05:43] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1018692|Parser::statelessFetchTemplate: don't add interwiki redirects to dependencies (T362221)]] [14:05:50] T362221: PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::addTemplate with interwiki link was deprecated in MediaWiki 1.42. [Called from MediaWiki\Parser\Parser::fetchTemplateAndTitle] - https://phabricator.wikimedia.org/T362221 [14:06:16] (03PS4) 10Elukey: ml-staging-codfw: Override mediawiki-app-vs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019074 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [14:07:54] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1115.eqiad.wmnet,service=(cdn|ats-be) [14:08:10] !log depool cp1115 for PXE boot issue testing: T350179 [14:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:15] T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 [14:08:16] !log hashar@deploy1002 hashar and jforrester: Backport for [[gerrit:1018692|Parser::statelessFetchTemplate: don't add interwiki redirects to dependencies (T362221)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:09:26] of course, having a repro makes it easier [14:09:28] !log hashar@deploy1002 hashar and jforrester: Continuing with sync [14:09:40] (03PS1) 10JMeybohm: kubernetes::master: Add support for configuring feature gates [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) [14:10:17] (03PS5) 10Elukey: ml-staging-codfw: Override mediawiki-app-vs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019074 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [14:11:27] (03PS2) 10JMeybohm: kubernetes::master: Add support for configuring feature gates [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) [14:12:30] (03PS1) 10Btullis: More work on the matomo partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1019283 (https://phabricator.wikimedia.org/T349397) [14:13:25] (03CR) 10Btullis: [C:03+2] More work on the matomo partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1019283 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [14:13:47] (03CR) 10Elukey: [C:03+2] ml-staging-codfw: Override mediawiki-app-vs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019074 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [14:14:40] (03CR) 10Elukey: [C:03+2] revscoring-editquality-damaging: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018996 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [14:17:57] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm [14:18:34] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:18:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:19:17] (03PS3) 10JMeybohm: kubernetes::master: Add support for configuring feature gates [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) [14:19:42] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:19:50] (03PS4) 10JMeybohm: kubernetes::master: Add support for configuring feature gates [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) [14:21:42] (03PS2) 10Msz2001: Remove 'obsolete-tag' from $wgSignatureAllowedLintErrors on Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019176 (https://phabricator.wikimedia.org/T362414) [14:21:48] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 5 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [14:22:13] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1018692|Parser::statelessFetchTemplate: don't add interwiki redirects to dependencies (T362221)]] (duration: 16m 29s) [14:22:19] T362221: PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::addTemplate with interwiki link was deprecated in MediaWiki 1.42. [Called from MediaWiki\Parser\Parser::fetchTemplateAndTitle] - https://phabricator.wikimedia.org/T362221 [14:38:28] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:54] (03PS1) 10Elukey: admin_ng: override port 80 with 4680 in istio configs for ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019288 (https://phabricator.wikimedia.org/T362316) [14:50:25] (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:15] (03PS2) 10Elukey: admin_ng: add port 4680 in istio configs for ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019288 (https://phabricator.wikimedia.org/T362316) [14:52:35] (03PS4) 10Elukey: article-description: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018959 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [14:53:37] (03PS5) 10JMeybohm: kubernetes::master: Add support for configuring feature gates [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) [14:53:41] (03CR) 10Clément Goubert: [C:03+1] admin_ng: add port 4680 in istio configs for ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019288 (https://phabricator.wikimedia.org/T362316) (owner: 10Elukey) [14:55:29] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [14:57:34] (03CR) 10Elukey: [C:03+2] article-description: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018959 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [14:57:44] (03CR) 10Elukey: [C:03+2] admin_ng: add port 4680 in istio configs for ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019288 (https://phabricator.wikimedia.org/T362316) (owner: 10Elukey) [14:58:08] (03PS1) 10Btullis: Tweak the matomo partman recipe again [puppet] - 10https://gerrit.wikimedia.org/r/1019289 (https://phabricator.wikimedia.org/T349397) [14:58:28] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:55] (03CR) 10Btullis: [C:03+2] Tweak the matomo partman recipe again [puppet] - 10https://gerrit.wikimedia.org/r/1019289 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [14:59:18] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host matomo1003.eqiad.wmnet with OS bookworm [15:01:28] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "magru - ayounsi@cumin1002" [15:02:24] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm [15:03:00] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1019261 (owner: 10Muehlenhoff) [15:03:36] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:03:44] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "magru - ayounsi@cumin1002" [15:03:57] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:05:11] (03CR) 10Andrea Denisse: "Thanks for the explanation Filippo, it makes a lot of sense." [puppet] - 10https://gerrit.wikimedia.org/r/1019139 (https://phabricator.wikimedia.org/T362376) (owner: 10Andrea Denisse) [15:05:30] (03CR) 10Eevans: [C:03+1] role::cassandra_dev: add fake truststore password for PKI [labs/private] - 10https://gerrit.wikimedia.org/r/1019274 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:05:35] (03Abandoned) 10Andrea Denisse: syslog: Update log cleanup command in syslog central server [puppet] - 10https://gerrit.wikimedia.org/r/1019139 (https://phabricator.wikimedia.org/T362376) (owner: 10Andrea Denisse) [15:07:17] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [15:08:11] (03CR) 10Eevans: [C:03+1] role::cassandra_dev: move to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1019272 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:11:17] (03PS3) 10Elukey: articletopic-outlink: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018961 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:11:35] (03CR) 10Elukey: [V:03+2 C:03+2] role::cassandra_dev: add fake truststore password for PKI [labs/private] - 10https://gerrit.wikimedia.org/r/1019274 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:15:25] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:23] (03CR) 10Elukey: [C:03+1] "Very elegant, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1019282 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [15:17:26] (03PS1) 10Hnowlan: restbase: migrate to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1019290 (https://phabricator.wikimedia.org/T360636) [15:18:07] (03CR) 10Elukey: [C:03+2] articletopic-outlink: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018961 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:20:02] (03CR) 10Elukey: [V:03+1] "On a second thought, lemme split the change in two: truststore and keystore" [puppet] - 10https://gerrit.wikimedia.org/r/1019272 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:21:52] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "magru - ayounsi@cumin1002" [15:22:02] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:22:08] (03PS2) 10Elukey: role::cassandra_dev: move to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1019272 (https://phabricator.wikimedia.org/T352647) [15:22:08] (03PS1) 10Elukey: role::cassandra_dev: add new truststore to support PKI [puppet] - 10https://gerrit.wikimedia.org/r/1019291 (https://phabricator.wikimedia.org/T352647) [15:23:01] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "magru - ayounsi@cumin1002" [15:23:45] (03PS3) 10Elukey: experimental: Switch to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018963 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:23:47] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1895/co" [puppet] - 10https://gerrit.wikimedia.org/r/1019290 (https://phabricator.wikimedia.org/T360636) (owner: 10Hnowlan) [15:23:51] (03PS3) 10Elukey: readability: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018964 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:23:56] (03PS3) 10Elukey: revertrisk: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018986 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:24:00] (03PS4) 10Elukey: revscoring-articlequality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018988 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:24:04] (03PS3) 10Elukey: revscoring-articletopic: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018990 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:24:09] (03PS3) 10Elukey: revscoring-draftquality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018992 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:24:13] (03PS3) 10Elukey: revscoring-drafttopic: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018994 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:24:16] (03PS3) 10Elukey: revscoring-editquality-goodfaith: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018998 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:24:19] (03PS3) 10Elukey: revscoring-editquality-reverted: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019000 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:25:36] 06SRE, 06Infrastructure-Foundations, 10netops: magru network setup - https://phabricator.wikimedia.org/T362421 (10ayounsi) 03NEW [15:25:58] 06SRE, 06Infrastructure-Foundations, 10netops: magru network setup - https://phabricator.wikimedia.org/T362421#9710307 (10ayounsi) [15:26:01] (03PS1) 10Ayounsi: Add magru to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) [15:26:35] (03CR) 10CI reject: [V:04-1] Add magru to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [15:27:01] (03PS2) 10Ayounsi: Add magru to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) [15:27:34] (03CR) 10CI reject: [V:04-1] Add magru to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [15:28:57] (03CR) 10Eevans: [C:03+1] role::cassandra_dev: move to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1019272 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:31:24] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host matomo1003.eqiad.wmnet with OS bookworm [15:32:39] (03CR) 10Eevans: [C:03+1] role::cassandra_dev: add new truststore to support PKI [puppet] - 10https://gerrit.wikimedia.org/r/1019291 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:32:56] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host matomo1003.eqiad.wmnet with OS bookworm [15:33:12] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9710346 (10ayounsi) Prefixes assigned in Netbox: https://netbox.wikimedia.org/ipam/prefixes/?site_id=11 Next step is to create the devices in Netbox and assign the IPs to the... [15:33:24] (03PS1) 10Elukey: ml-services: remove article-descriptions from experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019294 [15:33:32] (03CR) 10Elukey: [C:03+2] role::cassandra_dev: add new truststore to support PKI [puppet] - 10https://gerrit.wikimedia.org/r/1019291 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:34:11] (03PS3) 10Ayounsi: Add magru to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) [15:34:43] (03CR) 10CI reject: [V:04-1] Add magru to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [15:34:48] (03PS1) 10MVernon: comments: correct typos of "top" for "to" [puppet] - 10https://gerrit.wikimedia.org/r/1019295 [15:35:18] (03PS2) 10MVernon: comments: correct typos of "top" for "to" [puppet] - 10https://gerrit.wikimedia.org/r/1019295 [15:35:27] (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:35:43] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1019295 (owner: 10MVernon) [15:37:02] (03CR) 10Kevin Bazira: [C:03+1] ml-services: remove article-descriptions from experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019294 (owner: 10Elukey) [15:37:58] (03CR) 10Elukey: [C:03+2] ml-services: remove article-descriptions from experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019294 (owner: 10Elukey) [15:39:23] (03PS4) 10Elukey: readability: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018964 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:40:00] (03CR) 10Elukey: [C:03+2] revscoring-articlequality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018988 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:40:18] (03CR) 10Elukey: [C:03+2] revscoring-articletopic: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018990 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:40:41] (03CR) 10Elukey: [C:03+2] revscoring-draftquality: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018992 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:40:56] (03CR) 10Bking: [C:03+1] comments: correct typos of "top" for "to" [puppet] - 10https://gerrit.wikimedia.org/r/1019295 (owner: 10MVernon) [15:40:59] (03CR) 10Elukey: [C:03+2] revscoring-drafttopic: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018994 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:41:17] (03CR) 10Elukey: [C:03+2] revscoring-editquality-goodfaith: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018998 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:41:35] (03CR) 10Elukey: [C:03+2] revscoring-editquality-reverted: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019000 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:42:14] (03CR) 10Btullis: [C:03+1] comments: correct typos of "top" for "to" [puppet] - 10https://gerrit.wikimedia.org/r/1019295 (owner: 10MVernon) [15:42:28] (03PS4) 10Ayounsi: Add magru to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) [15:43:26] (03PS3) 10Elukey: role::cassandra_dev: move to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1019272 (https://phabricator.wikimedia.org/T352647) [15:44:52] (03PS4) 10Ilias Sarantopoulos: experimental: Switch to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018963 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:45:40] (03PS5) 10Ilias Sarantopoulos: experimental: Switch to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018963 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:46:24] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:46:32] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on matomo1003.eqiad.wmnet with reason: host reimage [15:46:49] (03CR) 10Elukey: [C:03+1] experimental: Switch to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018963 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:47:50] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:48:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cp1115.eqiad.wmnet [15:49:26] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:49:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:50:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on matomo1003.eqiad.wmnet with reason: host reimage [15:50:03] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2090 for reboot to get rid of broken systemd units - bking@cumin2002 - T353878 [15:50:04] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2090 for reboot to get rid of broken systemd units - bking@cumin2002 - T353878 [15:50:10] T353878: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 [15:50:57] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:51:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [15:51:44] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on elastic2090.codfw.wmnet with reason: T353878 [15:51:50] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1896/co" [puppet] - 10https://gerrit.wikimedia.org/r/1019297 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:51:54] (03CR) 10Ilias Sarantopoulos: [C:03+2] experimental: Switch to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018963 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:51:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on elastic2090.codfw.wmnet with reason: T353878 [15:52:02] (03CR) 10Elukey: [C:03+2] readability: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018964 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:52:26] (03Merged) 10jenkins-bot: experimental: Switch to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018963 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:53:44] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:54:19] (03PS4) 10Elukey: revertrisk: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018986 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:55:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 22.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:55:21] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [15:56:14] (03CR) 10Elukey: [C:03+2] revertrisk: Switch staging to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018986 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [15:56:14] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on cp1115.eqiad.wmnet with reason: testing PXE boot issues [15:56:27] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on cp1115.eqiad.wmnet with reason: testing PXE boot issues [15:58:56] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9710449 (10elukey) Current status: * all services deployed in ml-staging, need to double check that all the pods are running but so far I didn't no... [15:59:09] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9710450 (10elukey) a:05Clement_Goubert→03None [15:59:51] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [16:01:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad api_appserver GET/200: 0.2377126359175149s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyE [16:03:33] (03CR) 10Elukey: [V:03+1 C:03+2] role::cassandra_dev: force Cassandra instances to use the new truststore [puppet] - 10https://gerrit.wikimedia.org/r/1019297 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:04:33] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:04:53] this is due to the deployments that I did --^ [16:04:59] should autoresolve soon [16:05:43] (03CR) 10TChin: Add datasets-config helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:06:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad api_appserver GET/200: ... [16:06:15] 0.2423256480532805s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:09:33] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:10:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 30.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:16:21] (03PS7) 10Eevans: (WIP) cassandra-dev: surrogate user for cqlsh dev access [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) [16:16:35] !log move cassandra instances on cassandra-dev to the new truststore (allowing PKI certs) - T352647 [16:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:50] T352647: Move Cassandra clusters to PKI - https://phabricator.wikimedia.org/T352647 [16:19:41] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host matomo1003.eqiad.wmnet with OS bookworm [16:21:41] (03CR) 10Eevans: [C:03+1] comments: correct typos of "top" for "to" [puppet] - 10https://gerrit.wikimedia.org/r/1019295 (owner: 10MVernon) [16:40:34] (03CR) 10Dzahn: [C:03+2] community-crm: enable envoy [puppet] - 10https://gerrit.wikimedia.org/r/1018362 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [16:52:24] 10ops-codfw, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9710654 (10Jhancock.wm) @bking I got the HBA card replaced and it booted without any issues that I can find in the iDRAC. Can you check CLI to see if the ra... [16:52:25] (SystemdUnitFailed) firing: envoyproxy.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:13] ^ we are debugging this. service isn't in use yet. [17:00:17] !log crm2001 - on initial puppet run adding envoy build-envoy-config failed building config and service failed due to dependency issue. manual run of "sudo /usr/local/sbin/build-envoy-config -c /etc/envoy/" and restarted envoyproxy.service [17:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:55] (SystemdUnitFailed) resolved: envoyproxy.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:17:15] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcontrol2006-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9710739 (10Jhancock.wm) [18:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240412T0700) [18:00:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for GitLab version upgrades . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240412T1800). [18:04:18] no deploys in window [18:35:53] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudbackup2001.codfw.wmnet [18:40:43] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [18:41:04] (03PS1) 10Andrew Bogott: Remove refs to decom'd cloudbackup200[12] [puppet] - 10https://gerrit.wikimedia.org/r/1019326 (https://phabricator.wikimedia.org/T362438) [18:44:45] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudbackup2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [18:46:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudbackup2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [18:46:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:46:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudbackup2001.codfw.wmnet [18:47:04] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudbackup2002.codfw.wmnet [18:47:18] (03CR) 10Andrew Bogott: [C:03+2] Remove refs to decom'd cloudbackup200[12] [puppet] - 10https://gerrit.wikimedia.org/r/1019326 (https://phabricator.wikimedia.org/T362438) (owner: 10Andrew Bogott) [18:52:59] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [18:55:09] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudbackup2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [18:56:23] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudbackup2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [18:56:23] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:56:24] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudbackup2002.codfw.wmnet [18:58:26] 10ops-codfw, 06cloud-services-team, 10decommission-hardware, 13Patch-For-Review: decommission cloudbackup200[12].codfw.wmnet - https://phabricator.wikimedia.org/T362438#9710846 (10Andrew) [19:36:31] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [19:36:32] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [19:37:28] (SystemdUnitCrashLoop) firing: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on elastic2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:40:35] 10ops-codfw, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Degraded RAID on elastic2088 - https://phabricator.wikimedia.org/T361525#9710866 (10bking) @Jhancock.wm looks good, thanks for your help! I'm taking off the DC Ops tags and putting this back in our queue to finish off. [20:00:01] (03CR) 10Ryan Kemper: elasticsearch: remove elasticsearch-curator dep (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [20:08:22] (03PS4) 10JHathaway: otrs.conf: rename and tighten up epp type definitions [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) [20:09:36] (03PS8) 10Ryan Kemper: elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [20:14:34] (03PS4) 10JHathaway: vrts: create a profile for alias generation [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) [20:14:34] (03PS5) 10JHathaway: otrs.conf: rename and tighten up epp type definitions [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) [20:16:01] (03CR) 10JHathaway: vrts: create a profile for alias generation (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:16:22] (03CR) 10CI reject: [V:04-1] elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [20:22:20] (03PS9) 10Ryan Kemper: elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [20:29:06] (03PS2) 10JHathaway: postfix: prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1019115 (https://phabricator.wikimedia.org/T325395) [20:29:06] (03PS2) 10JHathaway: postfix: prometheus ops config [puppet] - 10https://gerrit.wikimedia.org/r/1019116 (https://phabricator.wikimedia.org/T325395) [20:30:05] (03CR) 10CI reject: [V:04-1] elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [20:31:26] (03CR) 10JHathaway: [C:03+1] Deprecate system::role for IF services (batch three) [puppet] - 10https://gerrit.wikimedia.org/r/1019255 (owner: 10Muehlenhoff) [20:32:59] (03CR) 10JHathaway: postfix: prometheus exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1019115 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:34:13] (03CR) 10JHathaway: [C:03+2] vrts: create a profile for alias generation [puppet] - 10https://gerrit.wikimedia.org/r/1019110 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:34:32] (03CR) 10JHathaway: [C:03+2] otrs.conf: rename and tighten up epp type definitions [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:35:12] (03PS6) 10JHathaway: otrs.conf: rename and tighten up epp type definitions [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) [20:35:36] (03CR) 10JHathaway: [V:03+2 C:03+2] otrs.conf: rename and tighten up epp type definitions [puppet] - 10https://gerrit.wikimedia.org/r/1019111 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:36:02] (03CR) 10JHathaway: [C:03+2] postfix: prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1019115 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:36:47] (03PS10) 10Ryan Kemper: elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [20:43:12] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:43:47] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:47:36] (03PS11) 10Ryan Kemper: elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [20:51:08] (03CR) 10Bking: [C:03+1] elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [20:55:57] (03PS12) 10Ryan Kemper: elasticsearch: remove elasticsearch-curator dep [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) (owner: 10Bking) [21:03:26] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:03:58] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:13:33] (03CR) 10Dzahn: scap: introduce bootstrapping mechanism specific to deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [21:15:36] (03CR) 10Dzahn: scap: add option to selectivlely disable bootstrapping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) (owner: 10Cwhite) [21:41:18] (03PS15) 10Bking: WIP: elasticsearch: prevent cross cluster seed config drift [puppet] - 10https://gerrit.wikimedia.org/r/1018360 (https://phabricator.wikimedia.org/T358389) [21:41:18] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1018360 (https://phabricator.wikimedia.org/T358389) (owner: 10Bking) [22:27:14] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9711212 (10Papaul) @ssingh one thing that I found between the server NiC and the switch interface is the vendor . In Eqiad, I checked 3... [23:23:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.181s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:28:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.181s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:37:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019366 [23:37:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1019366 (owner: 10TrainBranchBot)