[00:06:59] (03CR) 10Scott French: [C:03+1] "Spot checking how this works in practice, I _think_ this is coming from the `full-monitoring-metrics-access-${proto}` resources instantiat" [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [00:09:34] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1140262 [00:10:38] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1140263 [00:11:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1140263 (owner: 10TrainBranchBot) [00:18:43] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:20:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140193 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [00:31:37] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1140263 (owner: 10TrainBranchBot) [00:51:40] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/45956c7530d69111089a6daa45ee64e2fdeefc7e9c413a2eea64a5c47464ba0f/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:11:40] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:15:39] (03CR) 10Tim Starling: [C:03+2] testwiki: enable wgUseCodexSpecialBlock and wgEnableMultiBlocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136466 (https://phabricator.wikimedia.org/T377121) (owner: 10MusikAnimal) [01:16:26] (03Merged) 10jenkins-bot: testwiki: enable wgUseCodexSpecialBlock and wgEnableMultiBlocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136466 (https://phabricator.wikimedia.org/T377121) (owner: 10MusikAnimal) [01:18:19] !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1136466|testwiki: enable wgUseCodexSpecialBlock and wgEnableMultiBlocks (T377121)]] [01:18:23] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [01:20:49] (03PS1) 10Scott French: P:mediawiki::maintenance::pageassessments: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140266 (https://phabricator.wikimedia.org/T388536) [01:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:25:05] !log tstarling@deploy1003 tstarling, musikanimal: Backport for [[gerrit:1136466|testwiki: enable wgUseCodexSpecialBlock and wgEnableMultiBlocks (T377121)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [01:25:08] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [01:25:29] !log tstarling@deploy1003 tstarling, musikanimal: Continuing with sync [01:28:46] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10782607 (10ArthurPSmith) So https://www.wikidata.org/wiki/Property:P13551 has now bee... [01:32:12] !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136466|testwiki: enable wgUseCodexSpecialBlock and wgEnableMultiBlocks (T377121)]] (duration: 13m 52s) [01:32:14] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [01:46:42] (03PS2) 10Scott French: P:mediawiki::maintenance::pageassessments: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140266 (https://phabricator.wikimedia.org/T388536) [01:46:42] (03CR) 10Scott French: "I was briefly tempted to initially put the script in dry-run mode [0] for a first initial run, but given how simple it actually is, I don'" [puppet] - 10https://gerrit.wikimedia.org/r/1140266 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [01:53:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:09:38] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:35:28] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:28:43] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:44:00] (03CR) 10C. Scott Ananian: "Ok, safe to deploy now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137443 (owner: 10C. Scott Ananian) [04:18:43] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:52:27] FIRING: SystemdUnitCrashLoop: mjolnir-kafka-bulk-daemon.service crashloop on search-loader2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:21:36] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:22:22] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:38:22] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:53:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T0600) [06:00:04] marostegui, Amir1, and federico3: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T0600). [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:23:43] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:27:17] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:28] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:52:16] 06SRE, 06serviceops-radar: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10782948 (10Ladsgroup) Yeah. Can you file a ticket for better monitoring? [06:53:24] (03CR) 10Slyngshede: [C:03+1] admin: temporarily remove ssh key for aborrero [puppet] - 10https://gerrit.wikimedia.org/r/1140200 (owner: 10Arturo Borrero Gonzalez) [07:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:50:54] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10782973 (10Ollie.Shotton_WMDE) > Is the problem currently resolved by a process that... [08:00:05] hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T0800) [08:10:19] (03PS1) 10Slyngshede: data.yaml: Offboarding marktraceur [puppet] - 10https://gerrit.wikimedia.org/r/1140474 [08:10:33] (03CR) 10CI reject: [V:04-1] data.yaml: Offboarding marktraceur [puppet] - 10https://gerrit.wikimedia.org/r/1140474 (owner: 10Slyngshede) [08:11:05] (03Abandoned) 10Slyngshede: data.yaml: Offboarding marktraceur [puppet] - 10https://gerrit.wikimedia.org/r/1140474 (owner: 10Slyngshede) [08:11:08] (03Restored) 10Slyngshede: data.yaml: Offboarding marktraceur [puppet] - 10https://gerrit.wikimedia.org/r/1140474 (owner: 10Slyngshede) [08:11:14] (03PS2) 10Slyngshede: data.yaml: Offboarding marktraceur [puppet] - 10https://gerrit.wikimedia.org/r/1140474 [08:12:12] (03Abandoned) 10Slyngshede: mgmt module [software/bitu] - 10https://gerrit.wikimedia.org/r/918245 (owner: 10Slyngshede) [08:12:18] (03Abandoned) 10Slyngshede: Offboarding: Allow managers to offboard users. [software/bitu] - 10https://gerrit.wikimedia.org/r/920665 (https://phabricator.wikimedia.org/T335476) (owner: 10Slyngshede) [08:12:32] I am not running the MediaWiki train since it is an holiday here :) [08:18:43] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:52:27] FIRING: SystemdUnitCrashLoop: mjolnir-kafka-bulk-daemon.service crashloop on search-loader2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:14:21] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10783059 (10A_smart_kitten) It might be that the fix hasn't actually been deployed yet... [09:19:06] (03PS1) 10Btullis: Add role and partman config for new an-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140478 (https://phabricator.wikimedia.org/T393030) [09:20:16] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1140478 (https://phabricator.wikimedia.org/T393030) (owner: 10Btullis) [09:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:23:43] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [09:31:20] (03CR) 10Hnowlan: [C:03+1] P:mediawiki::maintenance::pageassessments: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140266 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [09:32:58] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate mostlinked job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140214 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [09:45:51] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [09:46:03] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [09:48:43] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [09:53:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T1000) [10:02:24] RESOLVED: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [10:10:41] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate all remaining general updatequerypages jobs [puppet] - 10https://gerrit.wikimedia.org/r/1140216 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [10:14:35] (03PS1) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [10:15:46] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [10:15:59] (03PS2) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [10:17:11] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [10:21:36] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:26:17] (03PS1) 10Hnowlan: mw::maintenance: migrate fixGlobalBlockWhitelist to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) [10:26:41] (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate fixGlobalBlockWhitelist to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [10:28:40] (03PS2) 10Hnowlan: mw::maintenance: migrate fixGlobalBlockWhitelist to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) [10:28:43] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:03] (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate fixGlobalBlockWhitelist to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [10:29:59] (03PS3) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [10:31:11] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [10:31:59] (03PS4) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [10:35:28] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:35:32] (03PS5) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [10:36:46] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [10:42:03] (03PS3) 10Hnowlan: mw::maintenance: migrate fixGlobalBlockWhitelist to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) [10:44:03] (03PS6) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [10:45:18] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [10:48:12] (03PS7) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [10:48:27] (03PS4) 10Hnowlan: mw::maintenance: migrate fixGlobalBlockWhitelist to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) [10:49:23] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [10:50:30] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T393091 (10LSobanski) 03NEW [10:53:27] 07sre-alert-triage, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T393091#10783195 (10BTullis) a:03BTullis Sorry for having missed this. I reset the failed units. ` btullis@stat1008:~$ systemctl --failed... [10:53:39] 07sre-alert-triage, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T393091#10783199 (10BTullis) 05Open→03Resolved [10:54:39] (03PS8) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [10:55:54] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [10:59:00] (03PS9) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [11:00:13] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [11:01:52] (03PS1) 10Hnowlan: mw::maintenance: migrate continuousScan-commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1140484 (https://phabricator.wikimedia.org/T385799) [11:05:02] (03PS10) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [11:06:50] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10783204 (10MatthewVernon) Thanks, this all looks good to me (and I had a bit of a poke at ms-be1091 myself). To summarise: - old-st... [11:12:05] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [11:12:09] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [11:15:02] (03CR) 10Dreamy Jazz: [C:03+1] "Looks good from a TSP point of view. I would agree with the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/1140484 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [11:21:44] (03PS11) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [11:24:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [11:24:09] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [11:29:58] (03CR) 10Marostegui: [C:03+1] swift: remove ms-be1060 entirely [puppet] - 10https://gerrit.wikimedia.org/r/1140130 (https://phabricator.wikimedia.org/T392796) (owner: 10MVernon) [11:32:17] FIRING: [149x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T1200) [12:31:46] (03CR) 10MVernon: "This change did seemingly get rolled out to some hosts (e.g. db1155), which are now unable to update from operations/mediawiki-config beca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140206 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [12:33:41] (03PS12) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [12:39:33] (03PS13) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [12:40:03] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [12:40:07] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [12:46:00] (03PS14) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [12:46:27] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [12:46:30] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [12:50:27] (03PS15) 10Btullis: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) [12:52:27] FIRING: SystemdUnitCrashLoop: mjolnir-kafka-bulk-daemon.service crashloop on search-loader2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:52:49] (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [12:54:15] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: Add a networkpolicy to allow publishing dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140481 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [12:58:12] (03CR) 10Thcipriani: "Anything that fetched this from operations/mediawiki-config will need to specify how to reconcile divergent branches in their git config b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140206 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T1300). [13:00:05] _Gerges and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:28] (03PS1) 10Slyngshede: VueJS Permissions App [software/bitu] - 10https://gerrit.wikimedia.org/r/1140498 [13:00:32] <_Gerges> Here [13:06:24] * TheresNoTime can deploy in a couple of minutes [13:08:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140193 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [13:08:49] (03Merged) 10jenkins-bot: [arwiki] Change logo and tagline with sync wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140193 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [13:09:24] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1140193|[arwiki] Change logo and tagline with sync wordmark (T392858)]] [13:09:27] T392858: Change Arabic Wikipedia logo and tagline - https://phabricator.wikimedia.org/T392858 [13:10:02] TheresNoTime: can you run fixStuckGlobalRename.php for T393093 [13:10:02] T393093: Unblock stuck global rename of A826 - https://phabricator.wikimedia.org/T393093 [13:10:13] anzx: ack [13:17:43] !log samtar@deploy1003 gergesshamon, samtar: Backport for [[gerrit:1140193|[arwiki] Change logo and tagline with sync wordmark (T392858)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:17:46] T392858: Change Arabic Wikipedia logo and tagline - https://phabricator.wikimedia.org/T392858 [13:17:50] _Gerges: ready for testing on mwdebug [13:18:20] (03CR) 10Bking: [C:03+1] Add role and partman config for new an-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140478 (https://phabricator.wikimedia.org/T393030) (owner: 10Btullis) [13:20:27] <_Gerges> If possible, delete the cache: https://ar.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-ar.svg [13:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:22:48] done, but am now misremembering if this normally works when it's only on the testservers — perhaps we can just continue with the sync and then purge & retry _Gerges ? [13:23:07] <_Gerges> When I go to any page in arwiki, the old tagline appears, but when I go to the tagline link, the new one appears. [13:23:36] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:24:11] _Gerges: I am going to continue the sync and then we can recheck again at the end [13:24:28] !log samtar@deploy1003 gergesshamon, samtar: Continuing with sync [13:24:42] <_Gerges> Ok [13:31:18] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140193|[arwiki] Change logo and tagline with sync wordmark (T392858)]] (duration: 21m 53s) [13:31:21] T392858: Change Arabic Wikipedia logo and tagline - https://phabricator.wikimedia.org/T392858 [13:31:36] _Gerges: can you check now? [13:32:03] FIRING: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore1005 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore1005 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh [13:32:46] !incidents [13:32:47] 6082 (UNACKED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1005:9100 node /srv eqiad) [13:32:47] 6074 (RESOLVED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2006:9100 node /srv codfw) [13:32:53] !ack 6082 [13:32:54] 6082 (ACKED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1005:9100 node /srv eqiad) [13:34:28] !log lowering sessionstore gc_grace_seconds to 172800 (two days) — T390514 [13:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:01] anzx: going to move to your patch [13:35:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140486 (https://phabricator.wikimedia.org/T392984) (owner: 10Anzx) [13:36:17] (03Merged) 10jenkins-bot: mswikisource: add NamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140486 (https://phabricator.wikimedia.org/T392984) (owner: 10Anzx) [13:36:18] TheresNoTime: ok [13:36:36] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1140486|mswikisource: add NamespacesToBeSearchedDefault (T392984)]] [13:36:40] T392984: Add new namespaces to Malay Wikisource - https://phabricator.wikimedia.org/T392984 [13:37:03] RESOLVED: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore1005 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore1005 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh [13:38:39] (03PS1) 10Cathal Mooney: Adjust OSPF metric on cr3-ulsfo -> cr4-ulsfo 100G link [homer/public] - 10https://gerrit.wikimedia.org/r/1140500 (https://phabricator.wikimedia.org/T390731) [13:39:13] (03CR) 10JHathaway: [C:03+1] data.yaml: Offboarding marktraceur [puppet] - 10https://gerrit.wikimedia.org/r/1140474 (owner: 10Slyngshede) [13:39:29] <_Gerges> @TheresNoTime: all on live [13:39:42] _Gerges: ack :) [13:39:56] !log invoking garbagecollect on sessionstore cluster — T390514 [13:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:17] <_Gerges> Thanks :) [13:41:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10783323 (10Jclark-ctr) @Stevemunene the drives have been swapped [13:41:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10783324 (10Jclark-ctr) [13:41:35] !log samtar@deploy1003 anzx, samtar: Backport for [[gerrit:1140486|mswikisource: add NamespacesToBeSearchedDefault (T392984)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:41:54] anzx: ready on mwdebug [13:42:37] TheresNoTime: looks good [13:42:42] !log samtar@deploy1003 anzx, samtar: Continuing with sync [13:45:36] (03CR) 10Cathal Mooney: [C:03+2] Adjust OSPF metric on cr3-ulsfo -> cr4-ulsfo 100G link [homer/public] - 10https://gerrit.wikimedia.org/r/1140500 (https://phabricator.wikimedia.org/T390731) (owner: 10Cathal Mooney) [13:46:08] (03Merged) 10jenkins-bot: Adjust OSPF metric on cr3-ulsfo -> cr4-ulsfo 100G link [homer/public] - 10https://gerrit.wikimedia.org/r/1140500 (https://phabricator.wikimedia.org/T390731) (owner: 10Cathal Mooney) [13:49:21] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140486|mswikisource: add NamespacesToBeSearchedDefault (T392984)]] (duration: 12m 44s) [13:49:23] T392984: Add new namespaces to Malay Wikisource - https://phabricator.wikimedia.org/T392984 [13:49:39] anzx: will run `namespaceDupes` [13:49:55] TheresNoTime: already ran yesterday [13:50:08] when adding namespace [13:50:27] ah okay [13:50:39] will run `fixStuckGlobalRename.php` now instead then :D [13:51:40] !log ran `[samtar@deploy1003 ~]$ mwscript-k8s --comment="T393093" --follow -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=knwikiquote --logwiki=metawiki '~aanzx' 'A826'` for T393093 [13:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:42] T393093: Unblock stuck global rename of A826 - https://phabricator.wikimedia.org/T393093 [13:51:44] TheresNoTime: Thanks for deploying [13:51:58] np! [13:52:36] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10783350 (10Jhancock.wm) a:03Jhancock.wm @Marostegui new disk installed. let me know if it all looks good no your end. [13:53:30] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10783352 (10cmooney) p:05High→03Medium [13:53:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:57:42] TheresNoTime: Thanks again , rename has been completed [13:58:07] :) [14:02:12] (03CR) 10Btullis: [V:03+1 C:03+2] Add role and partman config for new an-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140478 (https://phabricator.wikimedia.org/T393030) (owner: 10Btullis) [14:03:35] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10783380 (10BTullis) a:05BTullis→03None site.pp and preseed.yaml updated and merged. Should be good to go. [14:03:58] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029#10783383 (10BTullis) a:05BTullis→03None site.pp and preseed.yaml updated. [14:12:55] 06SRE, 06Data-Engineering, 10superset.wikimedia.org: Frequent filter timeouts in superset UI - https://phabricator.wikimedia.org/T393097 (10hnowlan) 03NEW [14:13:33] 06SRE, 07Data-Platform, 10superset.wikimedia.org: Frequent filter timeouts in superset UI - https://phabricator.wikimedia.org/T393097#10783408 (10hnowlan) [14:15:14] 06SRE, 07Data-Platform, 10superset.wikimedia.org: Frequent filter timeouts in superset UI - https://phabricator.wikimedia.org/T393097#10783414 (10ssingh) I have been facing this, intermittently, over the past week or so. Refreshing helps sometimes but not always. Some additional points that might help with d... [14:23:26] 06SRE, 07Data-Platform, 10superset.wikimedia.org: Frequent filter timeouts in superset UI - https://phabricator.wikimedia.org/T393097#10783430 (10hnowlan) p:05Triage→03Unbreak! [14:23:55] 06SRE, 07Data-Platform, 10superset.wikimedia.org: Frequent filter timeouts in superset UI - https://phabricator.wikimedia.org/T393097#10783431 (10hnowlan) Setting this to UBN as it's actively impairing incident response. Not 100% sure if the tags are right on this, please move as needed. [14:28:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10783438 (10Jhancock.wm) @Papaul so fun time. the external labels for the serial numbers on these servers got swapped. gonna update netbox to match internal. reimage... [14:31:41] (03PS1) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) [14:35:28] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:52] (03CR) 10CI reject: [V:04-1] sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [14:50:02] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:52:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:54:06] (03PS2) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) [14:54:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:54:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:55:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:00:05] hashar and dduvall: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T1500) [15:00:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:00:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:02:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:53] (03PS1) 10Dzahn: gerrit: enable bacula backups on gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1140506 (https://phabricator.wikimedia.org/T393034) [15:14:47] (03PS1) 10Dzahn: gerrit: enable backups on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140507 (https://phabricator.wikimedia.org/T393034) [15:17:58] RECOVERY - Dell PowerEdge RAID Controller on db2176 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [15:28:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2045.codfw.wmnet with OS bookworm [15:28:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10783544 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm [15:29:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm [15:29:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10783545 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm [15:29:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2048.codfw.wmnet with OS bookworm [15:29:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10783546 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm [15:31:45] 06SRE, 06serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (Radar), 07Test-Coverage: Add pcov PHP extension to wikimedia apt (and upgrade from 1.0.6-4+wmf1~buster1 to 1.0.11) so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847#10783562 (10Daimona) 05Op... [15:33:43] FIRING: [149x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:23] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:34:37] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:34:57] (03PS1) 10Btullis: Add an ssh known_hosts configmap to the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140508 (https://phabricator.wikimedia.org/T390738) [15:36:07] (03CR) 10CI reject: [V:04-1] Add an ssh known_hosts configmap to the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140508 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:14] (03PS2) 10Btullis: Add an ssh known_hosts configmap to the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140508 (https://phabricator.wikimedia.org/T390738) [15:40:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2045.codfw.wmnet with reason: host reimage [15:42:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2045.codfw.wmnet with reason: host reimage [15:58:28] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:00:04] jhathaway and rzl: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:34] jhancock@cumin2002 reimage (PID 1225700) is awaiting input [16:01:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:01:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2045.codfw.wmnet with OS bookworm [16:02:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10783696 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm completed: - gane... [16:02:13] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:04:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:11:28] 10ops-codfw, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102 (10RobH) 03NEW [16:12:06] 10ops-codfw, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10783752 (10RobH) [16:12:28] 10ops-codfw, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10783754 (10RobH) a:03Andrew @andrew, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-... [16:19:04] (03CR) 10Hnowlan: "Thanks! Given the importance of this job I am going to hold until next week to merge this change." [puppet] - 10https://gerrit.wikimedia.org/r/1140484 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [16:24:34] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104 (10RobH) 03NEW [16:24:50] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10783798 (10RobH) [16:24:56] (03CR) 10Dzahn: [C:03+1] wmnet: revert active aphlict host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1140218 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [16:25:21] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10783800 (10RobH) a:03MatthewVernon Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-t... [16:26:45] 10ops-codfw, 06DC-Ops: test servers for new cage - https://phabricator.wikimedia.org/T393105 (10Jhancock.wm) 03NEW [16:26:58] 10ops-codfw, 06DC-Ops: test servers for new cage - https://phabricator.wikimedia.org/T393105#10783837 (10Jhancock.wm) p:05Triage→03Medium [16:28:46] (03CR) 10Btullis: [C:03+2] Add an ssh known_hosts configmap to the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140508 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [16:30:37] (03Merged) 10jenkins-bot: Add an ssh known_hosts configmap to the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140508 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [16:33:02] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106 (10RobH) 03NEW [16:33:12] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10783869 (10RobH) [16:33:57] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10783872 (10RobH) a:03Marostegui Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers.... [16:35:12] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107 (10RobH) 03NEW [16:38:35] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10783898 (10RobH) a:03Marostegui Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers.... [16:38:38] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [16:38:45] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10783903 (10RobH) [16:39:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [16:40:00] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [16:40:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [16:47:04] (03CR) 10Dzahn: [C:03+2] gerrit: have different motd banners on active/passive servers [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn) [16:50:33] jhancock@cumin2002 reimage (PID 1226373) is awaiting input [16:50:47] jhancock@cumin2002 reimage (PID 1226145) is awaiting input [16:52:27] FIRING: SystemdUnitCrashLoop: mjolnir-kafka-bulk-daemon.service crashloop on search-loader2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:56:13] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110 (10RobH) 03NEW [16:56:59] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10783970 (10RobH) [16:57:20] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10783971 (10RobH) a:03Marostegui Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. T... [17:00:05] bd808: Your horoscope predicts another Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T1700) [17:06:34] (03PS2) 10Dzahn: vrts: add junk queue count and remove mobile queue [puppet] - 10https://gerrit.wikimedia.org/r/1140207 (https://phabricator.wikimedia.org/T389079) (owner: 10AOkoth) [17:09:43] (03CR) 10Dzahn: [C:03+1] "generally looks good to me. I don't know all details like how you got to "q ID 3" etc.. but seems easy enough to verify if it works after " [puppet] - 10https://gerrit.wikimedia.org/r/1140207 (https://phabricator.wikimedia.org/T389079) (owner: 10AOkoth) [17:10:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 14.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:10:20] hmm [17:10:48] (03CR) 10Dzahn: [C:03+2] "resolved. you can see the banners now when logging in on gerrit1003/gerrit2002." [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn) [17:12:58] (03Abandoned) 10Dzahn: Revert "gerrit: remove gerrit2002 and gerrit2003 from ssh_allowed_hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1140251 (owner: 10Dzahn) [17:14:13] (03Abandoned) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:14:39] (03Restored) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:15:12] (03Abandoned) 10Ssingh: P:dns::auth: log commit message to SAL for authdns-update [puppet] - 10https://gerrit.wikimedia.org/r/1122192 (owner: 10Ssingh) [17:15:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 16.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:15:54] (03CR) 10Dzahn: [C:04-1] "since we want 2 replicas now.. we need to add to this config.. I guess using both [0] and [1] of the array of replica hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:24:13] (03PS1) 10Bking: cirrussearch: don't filter out self cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1140519 (https://phabricator.wikimedia.org/T393100) [17:24:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140519 (https://phabricator.wikimedia.org/T393100) (owner: 10Bking) [17:24:53] (03CR) 10CI reject: [V:04-1] cirrussearch: don't filter out self cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1140519 (https://phabricator.wikimedia.org/T393100) (owner: 10Bking) [17:25:07] (03PS3) 10Ssingh: trafficserver: explicitly specify user/group for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1091330 [17:26:23] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5420/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh) [17:26:45] (03PS1) 10Dzahn: gerrit: add a second replica, start replicating to gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140520 [17:27:58] (03PS2) 10Dzahn: gerrit: add a second replica, start replicating to gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) [17:31:48] !log testing sasl email relaying on mx-in{1001,2001} [17:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:53] (03PS2) 10Bking: cirrussearch: don't filter out self cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1140519 (https://phabricator.wikimedia.org/T393100) [17:36:34] (03CR) 10CI reject: [V:04-1] cirrussearch: don't filter out self cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1140519 (https://phabricator.wikimedia.org/T393100) (owner: 10Bking) [17:38:13] (03PS3) 10Bking: cirrussearch: don't filter out self cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1140519 (https://phabricator.wikimedia.org/T393100) [17:38:27] (03PS3) 10Dzahn: gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833) [17:40:25] (03CR) 10Dzahn: "If we agree on this one and got it working.. I would then rebase https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129919 on top of it" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:41:02] (03CR) 10Dzahn: [C:04-1] "This is now supposed to come after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140520 and be rebased on top of that if we agree " [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:42:21] (03CR) 10Ebernhardson: cirrussearch: don't filter out self cluster settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1140519 (https://phabricator.wikimedia.org/T393100) (owner: 10Bking) [17:49:06] (03PS4) 10Bking: cirrussearch: don't filter out self cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1140519 (https://phabricator.wikimedia.org/T393100) [17:49:44] (03CR) 10Bking: cirrussearch: don't filter out self cluster settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1140519 (https://phabricator.wikimedia.org/T393100) (owner: 10Bking) [17:49:48] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140519 (https://phabricator.wikimedia.org/T393100) (owner: 10Bking) [17:53:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:00:05] hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T1800) [18:04:43] (03CR) 10BCornwall: [C:03+1] wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [18:06:22] (03CR) 10BCornwall: [C:03+1] wdqs-internal: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1136744 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [18:12:01] (03PS1) 10Andrea Denisse: grafana: Add conditional data sync via enable_sync hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) [18:12:01] (03CR) 10Andrea Denisse: "Hi team, I think that thsi variable is going to be useful during testing or version upgrades where changes to one instance should not prop" [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [18:12:03] (03PS5) 10Ryan Kemper: wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) [18:12:03] (03PS2) 10Ryan Kemper: wdqs-internal: remove lvs VIP [dns] - 10https://gerrit.wikimedia.org/r/1139936 (https://phabricator.wikimedia.org/T376151) [18:12:24] (03CR) 10BCornwall: [C:03+1] wdqs-internal: remove from LBs and backend servers [puppet] - 10https://gerrit.wikimedia.org/r/1136747 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [18:12:43] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140524 (https://phabricator.wikimedia.org/T386222) [18:12:45] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140524 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [18:12:59] (03CR) 10BCornwall: [C:03+1] wdqs-internal: remove service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1136756 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [18:13:10] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [18:13:19] (03CR) 10BCornwall: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [dns] - 10https://gerrit.wikimedia.org/r/1139936 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [18:13:35] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140524 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [18:20:59] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: don't filter out self cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1140519 (https://phabricator.wikimedia.org/T393100) (owner: 10Bking) [18:21:13] !log ryankemper@dns1004 START - running authdns-update [18:23:06] (03CR) 10Eevans: [C:03+2] adjust hosts lists to reflect changes in restbase cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138854 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [18:23:35] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [18:23:50] !log ryankemper@dns1004 END - running authdns-update [18:24:41] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1136744 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [18:24:44] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.27 refs T386222 [18:24:46] T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222 [18:24:48] (03Merged) 10jenkins-bot: adjust hosts lists to reflect changes in restbase cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138854 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [18:26:48] !log T376151 (wdqs-internal lvs teardown) Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136744 to flip `wdqs-internal` service state to `lvs_setup` and running puppet across `A:dnsbox` [18:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:51] T376151: Cutover wdqs-internal to new split endpoints - https://phabricator.wikimedia.org/T376151 [18:27:59] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/echostore: apply [18:28:25] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/echostore: apply [18:28:27] (03PS4) 10Ryan Kemper: wdqs-internal: remove from LBs and backend servers [puppet] - 10https://gerrit.wikimedia.org/r/1136747 (https://phabricator.wikimedia.org/T376151) [18:28:27] (03PS2) 10Ryan Kemper: wdqs-internal: remove service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1136756 (https://phabricator.wikimedia.org/T376151) [18:28:27] (03PS2) 10Ryan Kemper: wdqs-internal: rip out remaining logic/config [puppet] - 10https://gerrit.wikimedia.org/r/1136757 (https://phabricator.wikimedia.org/T376151) [18:28:55] so far so good for group1. i'm going to wait about 20 mins before rolling all wikis [18:29:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136757 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [18:29:33] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply [18:30:43] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply [18:31:43] !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/echostore: apply [18:32:29] (03CR) 10Bking: [C:03+1] wdqs-internal: rip out remaining logic/config [puppet] - 10https://gerrit.wikimedia.org/r/1136757 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [18:32:49] !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [18:35:28] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:42:47] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: remove from LBs and backend servers [puppet] - 10https://gerrit.wikimedia.org/r/1136747 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [18:44:54] !log T376151 [wdqs-internal lvs teardown -> pybal rolling restart] ran puppet on `O:Lvs::balancer` after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136747 [18:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:57] T376151: Cutover wdqs-internal to new split endpoints - https://phabricator.wikimedia.org/T376151 [18:46:55] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:48:03] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:48:16] ^ anticipated [18:48:23] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:48:24] alright. rolling all wikis [18:48:38] !log T376151 [wdqs-internal lvs teardown -> pybal rolling restart] Restarted pybal on `A:lvs-secondary-eqiad`, it only restarted on ` lvs1020` but for some reason ` lvs1013` doesn't have a pybal service running [18:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:55] yeah sigh, we need to fix that alias [18:49:09] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140527 (https://phabricator.wikimedia.org/T386222) [18:49:10] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140527 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [18:49:13] ryankemper: did you run on secondaries first? [18:49:20] that would be 1020 and 2014 [18:49:37] I will fix the alias today itself [18:49:41] I've only ran on `A:lvs-secondary-eqiad` so far [18:49:46] ok [18:49:49] so sounds like it just targeted wrong host [18:49:59] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140527 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [18:50:02] yep, that's the wrong alias it pulls [18:50:07] ._. [18:50:07] that's on me [18:50:19] 1020 is fine and so is 2014 [18:50:23] 1013 is not [18:50:25] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:50:29] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:51:17] (03PS1) 10Majavah: Revert "admin: Temporarily remove Taavi's access" [puppet] - 10https://gerrit.wikimedia.org/r/1140528 [18:52:21] waiting on sukhe's patch before proceeding ofc [18:53:21] ryankemper: you can just go ahead [18:53:41] sukhe: Are the other aliases correct? [18:53:41] and restart manually or pass the specific hosts [18:53:49] sukhe: okay what do i do about pybal not running on 1013 though? [18:53:52] brett: yeah all should be [18:54:08] that is a test host so skip that [18:54:12] as long as you target: [18:54:22] 1020 (secondary) and 1019 [18:54:34] got it, proceeding [18:54:39] and 2014 (secondary) and 2019 [18:54:43] er 2013 [18:54:47] it's all good [18:55:11] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:55:30] !log T376151 [wdqs-internal lvs teardown -> pybal rolling restart] Restarted pybal on `A:lvs-low-traffic-eqiad` (lvs1019), waiting few mins before proceeding [18:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:33] T376151: Cutover wdqs-internal to new split endpoints - https://phabricator.wikimedia.org/T376151 [18:57:34] sukhe@cumin1002:~$ sudo cumin "A:lvs-secondary-eqiad" [18:57:35] 2 hosts will be targeted: [18:57:35] lvs[1013,1020].eqiad.wmnet [18:57:35] DRY-RUN mode enabled, aborting [18:57:35] next up will be 2014 (codfw secondary) followed by 2013 (codfw low traffic primary) [18:57:40] this is the only wrong one, so fixing [18:57:42] ryankemper: cool [18:58:18] !log T376151 [wdqs-internal lvs teardown -> pybal rolling restart] Restarted pybal on `A:lvs-secondary-codfw` (lvs2014), waiting 2 mins before proceeding [18:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:44] !log T376151 [wdqs-internal lvs teardown -> pybal rolling restart] Restarted pybal on `A:lvs-low-traffic-codfw` (lvs2013) [18:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:10] (03CR) 10Andrea Denisse: [C:03+1] Revert "admin: Temporarily remove Taavi's access" [puppet] - 10https://gerrit.wikimedia.org/r/1140528 (owner: 10Majavah) [19:01:13] (03CR) 10Andrea Denisse: [C:03+2] Revert "admin: Temporarily remove Taavi's access" [puppet] - 10https://gerrit.wikimedia.org/r/1140528 (owner: 10Majavah) [19:03:03] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:03:05] !log T376151 [wdqs-internal lvs teardown -> pybal rolling restart] `ipvsadm --delete-service --tcp-service 10.2.1.41:80` on `A:lvs-secondary-codfw OR A:lvs-low-traffic-codfw`(lvs2013, lvs2014) [19:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:16] T376151: Cutover wdqs-internal to new split endpoints - https://phabricator.wikimedia.org/T376151 [19:04:21] !log T376151 [wdqs-internal lvs teardown -> pybal rolling restart] `ipvsadm --delete-service --tcp-service 10.2.2.41:80` on `lvs1019` and `lvs1020` [19:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:25] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:05:29] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:06:15] !log helm error during group2 deployment "Get "https://kubemaster.svc.codfw.wmnet:6443/api/v1/namespaces/mw-jobrunner/services/mediawiki-main-tls-service": dial tcp 10.2.1.8:6443: connect: no route to host - error from a previous attempt: read tcp 10.64.16.93:41894->10.2.1.8:6443: read: connection reset by peer" [19:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:23] PROBLEM - SSH on netbox1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:08:23] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:08:43] FIRING: [10x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:09:13] RECOVERY - SSH on netbox1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:09:41] !log T376151 [wdqs-internal lvs teardown -> pybal rolling restart] all IPVS diff check alerts have recovered, rolling restart complete [19:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:44] T376151: Cutover wdqs-internal to new split endpoints - https://phabricator.wikimedia.org/T376151 [19:09:47] !log deployment of mw-jobrunner-main for codfw failed during scap train (group2) (T386222) [19:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:50] T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222 [19:09:57] !log T376151 [wdqs-internal lvs teardown] running puppet across `A:wdqs-internal` now that pybal has been restarted [19:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:23] (03PS1) 10JHathaway: wikimedia.org: postmaster tools verification [dns] - 10https://gerrit.wikimedia.org/r/1140529 [19:10:42] FIRING: JobUnavailable: Reduced availability for job netbox_global in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:11:45] (03PS1) 10Ssingh: wmflib: get_lvs_class_hosts() to exclude test LVS hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140530 [19:12:09] (03CR) 10CI reject: [V:04-1] wmflib: get_lvs_class_hosts() to exclude test LVS hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140530 (owner: 10Ssingh) [19:12:32] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5424/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140530 (owner: 10Ssingh) [19:12:42] FIRING: [10x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:12:58] (03CR) 10Ssingh: [C:03+1] wikimedia.org: postmaster tools verification [dns] - 10https://gerrit.wikimedia.org/r/1140529 (owner: 10JHathaway) [19:13:20] (03CR) 10JHathaway: [C:03+2] wikimedia.org: postmaster tools verification [dns] - 10https://gerrit.wikimedia.org/r/1140529 (owner: 10JHathaway) [19:13:25] FIRING: [6x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:39] (03PS2) 10Ssingh: wmflib: get_lvs_class_hosts() to exclude test LVS hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140530 [19:14:04] !log jhathaway@dns1004 START - running authdns-update [19:14:23] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5425/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140530 (owner: 10Ssingh) [19:16:36] !log jhathaway@dns1004 END - running authdns-update [19:16:41] (03CR) 10Ssingh: [V:03+1] "-lvs-secondary: P{lvs1013* or lvs1020* or lvs2014* or lvs3010* or lvs4010* or lvs5006* or lvs6003* or lvs7003*}" [puppet] - 10https://gerrit.wikimedia.org/r/1140530 (owner: 10Ssingh) [19:16:47] brett: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140530 [19:17:11] (03PS3) 10Ssingh: wmflib: get_lvs_class_hosts() to exclude test LVS hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140530 [19:17:23] (03CR) 10Ssingh: "Commit message updated, no change since PCC run." [puppet] - 10https://gerrit.wikimedia.org/r/1140530 (owner: 10Ssingh) [19:17:48] sukhe: Will look in a moment [19:18:13] no worries, not urgent, but will prevent the lvs1013 from happening again [19:18:25] FIRING: [11x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:58] !log dduvall@deploy1003 Started scap sync-world: retrying sync-world following spurious helmfile apply error (mw-jobrunner codfw) [19:20:09] (03PS3) 10Ahmon Dancy: SpiderPig: Require explicit hiera config to enable Spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) [19:20:10] !log sukhe@netbox1003:~$ sudo systemctl start uwsgi-netbox.service: service was OOM'ed, restarting [19:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:33] (03CR) 10CI reject: [V:04-1] SpiderPig: Require explicit hiera config to enable Spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [19:20:42] RESOLVED: JobUnavailable: Reduced availability for job netbox_global in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:21:49] (03PS4) 10Ahmon Dancy: SpiderPig: Require explicit hiera config to enable Spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) [19:22:12] (03CR) 10CI reject: [V:04-1] SpiderPig: Require explicit hiera config to enable Spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [19:22:41] (03PS5) 10Ahmon Dancy: SpiderPig: Require explicit hiera config to enable Spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) [19:23:25] FIRING: [15x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:45] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [19:28:25] FIRING: [15x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:59] (03PS2) 10Dwisehaupt: monitoring: remove check_puppetrun.rb [puppet] - 10https://gerrit.wikimedia.org/r/1139930 (https://phabricator.wikimedia.org/T392961) [19:30:05] (03CR) 10Dwisehaupt: "Thanks. I've updated the changeset to remove the file. We are still working on shifting to prometheus based alerting for frack." [puppet] - 10https://gerrit.wikimedia.org/r/1139930 (https://phabricator.wikimedia.org/T392961) (owner: 10Dwisehaupt) [19:30:22] !log dduvall@deploy1003 Finished scap sync-world: retrying sync-world following spurious helmfile apply error (mw-jobrunner codfw) (duration: 11m 24s) [19:31:55] (03PS3) 10Ryan Kemper: wdqs-internal: remove service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1136756 (https://phabricator.wikimedia.org/T376151) [19:31:55] (03PS3) 10Ryan Kemper: wdqs-internal: rip out remaining logic/config [puppet] - 10https://gerrit.wikimedia.org/r/1136757 (https://phabricator.wikimedia.org/T376151) [19:31:55] (03PS1) 10Ryan Kemper: wdqs: remove realserver includes [puppet] - 10https://gerrit.wikimedia.org/r/1140531 (https://phabricator.wikimedia.org/T376151) [19:32:06] (03CR) 10Ahmon Dancy: [C:03+1] "This is ready for merging now. Works in beta. Prod should be unaffected. Latest PCC results are here: https://puppet-compiler.wmflabs.o" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [19:33:25] FIRING: [15x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:33:27] !log re-ran scap sync to fix mw-jobrunner codfw deployments following failed helmfile apply and verified correct image ref manually (T386222) [19:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:29] T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222 [19:33:30] (03PS2) 10Ryan Kemper: wdqs: remove realserver includes [puppet] - 10https://gerrit.wikimedia.org/r/1140531 (https://phabricator.wikimedia.org/T376151) [19:33:30] (03PS4) 10Ryan Kemper: wdqs-internal: remove service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1136756 (https://phabricator.wikimedia.org/T376151) [19:33:30] (03PS4) 10Ryan Kemper: wdqs-internal: rip out remaining logic/config [puppet] - 10https://gerrit.wikimedia.org/r/1136757 (https://phabricator.wikimedia.org/T376151) [19:33:48] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140531 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [19:34:27] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [19:34:41] !log running sre.dns.netbox to ensure no pending changes [19:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:56] !log [correction] running sre.dns.netbox to ensure no pending changes (NOT in dry-run) [19:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:33] (03CR) 10Bking: [C:03+2] cirrussearch: don't filter out self cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1140519 (https://phabricator.wikimedia.org/T393100) (owner: 10Bking) [19:36:37] (03CR) 10BCornwall: [C:03+1] wdqs: remove realserver includes [puppet] - 10https://gerrit.wikimedia.org/r/1140531 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [19:36:43] (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove realserver includes [puppet] - 10https://gerrit.wikimedia.org/r/1140531 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [19:36:54] (03CR) 10Ssingh: "You most likely will need to merge this change to see a correct PCC run, since right now, Puppet is failing on the host." [puppet] - 10https://gerrit.wikimedia.org/r/1140531 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [19:37:16] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:37:17] FIRING: [149x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:30] !log no pending Netbox changes [19:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:03] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: remove service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1136756 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [19:44:13] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: rip out remaining logic/config [puppet] - 10https://gerrit.wikimedia.org/r/1136757 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [19:46:38] (03PS3) 10Ryan Kemper: wdqs-internal: remove lvs VIP [dns] - 10https://gerrit.wikimedia.org/r/1139936 (https://phabricator.wikimedia.org/T376151) [19:49:41] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_wdqs-internal.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:54:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:54:52] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (2) rdb hosts - https://phabricator.wikimedia.org/T393121 (10RobH) 03NEW [19:56:49] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install rdb201[12] - https://phabricator.wikimedia.org/T393121#10784411 (10RobH) [19:57:10] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install rdb201[12] - https://phabricator.wikimedia.org/T393121#10784414 (10RobH) [19:59:41] RESOLVED: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_wdqs-internal.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T2000). [20:00:05] bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:23] o/ [20:02:27] RESOLVED: SystemdUnitCrashLoop: mjolnir-kafka-bulk-daemon.service crashloop on search-loader2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:03:08] I can deploy [20:03:42] bvibber: is it okay to do both at the same time? [20:03:50] jeena: yeah [20:04:23] thanks :D [20:04:39] np [20:05:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [extensions/JsonConfig] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140229 (https://phabricator.wikimedia.org/T389125) (owner: 10Bvibber) [20:05:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140228 (https://phabricator.wikimedia.org/T389126) (owner: 10Bvibber) [20:05:16] I'm using the spiderpig! You can see the progress on there I think https://spiderpig.wikimedia.org/ [20:05:48] aww i don't have permission to spiderpig :D [20:05:56] aww darn [20:06:07] (03Merged) 10jenkins-bot: Check for content validity before extracting license [extensions/JsonConfig] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140229 (https://phabricator.wikimedia.org/T389125) (owner: 10Bvibber) [20:11:31] bvibber: https://wikitech.wikimedia.org/wiki/Scap/SpiderPig#Access_to_SpiderPig [20:12:43] (I have a task open to at least let existing deployers skip that step: https://phabricator.wikimedia.org/T392958 ) [20:13:12] :) thx [20:13:21] requested access (i have deploy, in theory) [20:13:52] though i haven't deployed mediawiki in a long time :D [20:14:30] well then sounds like you're our target audience :D [20:14:33] (03Merged) 10jenkins-bot: Fix localization for validation errors checking tabular data [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140228 (https://phabricator.wikimedia.org/T389126) (owner: 10Bvibber) [20:14:48] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1140229|Check for content validity before extracting license (T389125)]], [[gerrit:1140228|Fix localization for validation errors checking tabular data (T389126)]] [20:14:55] T389125: Possible to break a page by passing in a chart that is not a JSON - https://phabricator.wikimedia.org/T389125 [20:14:55] T389126: Possible to break a page with invalid data - https://phabricator.wikimedia.org/T389126 [20:16:08] goal is to make backport deploys as easy as showing up to backport windows: enter patch, check on mwdebug, done [20:16:20] nice! [20:19:26] (03PS1) 10Dzahn: gerrit: rename host and replica_hosts array to primary/replica_vhost [puppet] - 10https://gerrit.wikimedia.org/r/1140534 (https://phabricator.wikimedia.org/T387833) [20:19:26] (03PS1) 10Ryan Kemper: wdqs-internal: remove old alias [puppet] - 10https://gerrit.wikimedia.org/r/1140535 (https://phabricator.wikimedia.org/T376151) [20:19:53] (03CR) 10CI reject: [V:04-1] gerrit: rename host and replica_hosts array to primary/replica_vhost [puppet] - 10https://gerrit.wikimedia.org/r/1140534 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [20:22:07] (03CR) 10Bking: [C:03+2] wdqs-internal: remove old alias [puppet] - 10https://gerrit.wikimedia.org/r/1140535 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [20:22:48] (03PS2) 10Dzahn: gerrit: rename host and replica_hosts array to primary/replica_vhost [puppet] - 10https://gerrit.wikimedia.org/r/1140534 (https://phabricator.wikimedia.org/T387833) [20:23:09] (03CR) 10BCornwall: [C:03+1] wmflib: get_lvs_class_hosts() to exclude test LVS hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140530 (owner: 10Ssingh) [20:23:13] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Temporarily remove lucaswerkmeister-wmde SSH key" [puppet] - 10https://gerrit.wikimedia.org/r/1140536 [20:24:57] (03CR) 10CI reject: [V:04-1] gerrit: rename host and replica_hosts array to primary/replica_vhost [puppet] - 10https://gerrit.wikimedia.org/r/1140534 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [20:25:58] (03PS1) 10Bking: cirrussearch: add newly-reimaged hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1140537 (https://phabricator.wikimedia.org/T388610) [20:26:47] (03PS2) 10Bking: cirrussearch: add newly-reimaged hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1140537 (https://phabricator.wikimedia.org/T388610) [20:27:13] (03PS3) 10Bking: cirrussearch: add newly-reimaged hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1140537 (https://phabricator.wikimedia.org/T388610) [20:27:21] (03CR) 10Majavah: [C:03+2] Revert "Temporarily remove lucaswerkmeister-wmde SSH key" [puppet] - 10https://gerrit.wikimedia.org/r/1140536 (owner: 10Lucas Werkmeister (WMDE)) [20:27:50] (03PS3) 10Dzahn: gerrit: rename host and replica_hosts array to primary/replica_vhost [puppet] - 10https://gerrit.wikimedia.org/r/1140534 (https://phabricator.wikimedia.org/T387833) [20:27:58] (03CR) 10Majavah: [C:03+2] "(verified in-person, Lucas is right next to me atm)" [puppet] - 10https://gerrit.wikimedia.org/r/1140536 (owner: 10Lucas Werkmeister (WMDE)) [20:29:53] (03CR) 10CI reject: [V:04-1] gerrit: rename host and replica_hosts array to primary/replica_vhost [puppet] - 10https://gerrit.wikimedia.org/r/1140534 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [20:32:30] !log sudo cumin 'O:config_master' 'run-puppet-agent' [20:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:48] !log jhuneidi@deploy1003 bvibber, jhuneidi: Backport for [[gerrit:1140229|Check for content validity before extracting license (T389125)]], [[gerrit:1140228|Fix localization for validation errors checking tabular data (T389126)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:33:52] T389125: Possible to break a page by passing in a chart that is not a JSON - https://phabricator.wikimedia.org/T389125 [20:33:52] T389126: Possible to break a page with invalid data - https://phabricator.wikimedia.org/T389126 [20:34:18] bvibber: ready for any checks [20:34:26] (03CR) 10Ssingh: [C:03+2] wmflib: get_lvs_class_hosts() to exclude test LVS hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140530 (owner: 10Ssingh) [20:35:12] jeena: so far so good, let 'er rip [20:35:19] !log jhuneidi@deploy1003 bvibber, jhuneidi: Continuing with sync [20:35:51] (03PS4) 10Dzahn: gerrit: rename host and replica_hosts array to primary/replica_vhost [puppet] - 10https://gerrit.wikimedia.org/r/1140534 (https://phabricator.wikimedia.org/T387833) [20:38:01] (03CR) 10CI reject: [V:04-1] gerrit: rename host and replica_hosts array to primary/replica_vhost [puppet] - 10https://gerrit.wikimedia.org/r/1140534 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [20:38:32] (03PS1) 10JHathaway: jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1140538 [20:40:23] !log restart pybal on lvs1020 [20:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:23] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140229|Check for content validity before extracting license (T389125)]], [[gerrit:1140228|Fix localization for validation errors checking tabular data (T389126)]] (duration: 30m 35s) [20:45:27] T389125: Possible to break a page by passing in a chart that is not a JSON - https://phabricator.wikimedia.org/T389125 [20:45:27] T389126: Possible to break a page with invalid data - https://phabricator.wikimedia.org/T389126 [20:45:34] bvibber: finished! [20:45:44] thanks jeena ! [20:45:56] yw :) [20:45:58] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: remove lvs VIP [dns] - 10https://gerrit.wikimedia.org/r/1139936 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [20:46:17] !log ryankemper@dns1004 START - running authdns-update [20:47:14] (03PS5) 10Dzahn: gerrit: rename host and replica_hosts array to primary/replica_vhost [puppet] - 10https://gerrit.wikimedia.org/r/1140534 (https://phabricator.wikimedia.org/T387833) [20:47:50] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate continuousScan-commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1140484 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [20:48:40] (03CR) 10JHathaway: [C:03+2] jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1140538 (owner: 10JHathaway) [20:48:53] !log ryankemper@dns1004 END - running authdns-update [20:49:17] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate fixGlobalBlockWhitelist to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140482 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [20:49:24] (03CR) 10CI reject: [V:04-1] gerrit: rename host and replica_hosts array to primary/replica_vhost [puppet] - 10https://gerrit.wikimedia.org/r/1140534 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [20:50:23] !log T376151 [wdqs-internal lvs teardown] Surrendered `10.2.2.41/32` (eqiad wdqs-internal vip) and `10.2.1.41/32` (codfw wdqs-internal vip) from netbox interface [20:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:26] T376151: Cutover wdqs-internal to new split endpoints - https://phabricator.wikimedia.org/T376151 [20:50:34] (03CR) 10Ryan Kemper: [C:03+2] "Surrendered `10.2.2.41/32` (eqiad wdqs-internal vip) and `10.2.1.41/32` (codfw wdqs-internal vip) from netbox interface" [dns] - 10https://gerrit.wikimedia.org/r/1139936 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [20:52:13] jouncebot: now [20:52:13] For the next 0 hour(s) and 7 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T2000) [20:53:57] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [20:54:53] !log T376151 [wdqs-internal lvs teardown] `sudo rm -fv /srv/config-master/pybal/eqiad/wdqs-internal && sudo rm -fv /srv/config-master/pybal/codfw/wdqs-internal` on `config-master[1,2]001` [20:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:32] ryankemper@cumin2002 netbox (PID 1554953) is awaiting input [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250501T2100) [21:00:44] is there going to be a MW deployment now? [21:00:58] I have a patch I would like to add to the window if that's the case. [21:01:00] !log T376151 [wdqs-internal lvs teardown] `sudo etcdctl -C https://conf1007.eqiad.wmnet:4001 --username root rmdir /conftool/v1/pools/eqiad/wdqs-internal/wdqs` && `sudo etcdctl -C https://conf1007.eqiad.wmnet:4001 --username root rmdir /conftool/v1/pools/eqiad/wdqs-internal/` [21:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:03] T376151: Cutover wdqs-internal to new split endpoints - https://phabricator.wikimedia.org/T376151 [21:01:28] !log T376151 [wdqs-internal lvs teardown] `sudo etcdctl -C https://conf1007.eqiad.wmnet:4001 --username root rmdir /conftool/v1/pools/codfw/wdqs-internal/wdqs` && `sudo etcdctl -C https://conf1007.eqiad.wmnet:4001 --username root rmdir /conftool/v1/pools/codfw/wdqs-internal/` [21:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:34] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove VIPs for wdqs-internal - ryankemper@cumin2002" [21:01:35] mutante: The train has reached group2 and seems to be good so I think you're clear. [21:01:39] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove VIPs for wdqs-internal - ryankemper@cumin2002" [21:01:40] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:01:43] (03PS1) 10Dzahn: Add throttle rule for Istanbul Hackathon 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140539 (https://phabricator.wikimedia.org/T382309) [21:01:59] dancy: well, this time I want something to be deployed:) [21:02:14] ooh, have you tried spiderpig? [21:02:28] no [21:02:40] * thcipriani excitement [21:02:56] See if you can log into https://spiderpig.wikimedia.org/ [21:03:01] !log T376151 [wdqs-internal lvs teardown] Declaring this officially done. No more irc log spam from me today :) [21:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:44] mutante: when you're done, could you ping me? I have a helmfile-only change I'd like to sneak through if possible [21:05:48] well, I wasn't planning to become a deployer on the spot right this moment [21:05:54] so go ahead with yours [21:07:37] mutante: happy to pair with you while you drive the spiderpig [21:08:31] I think first I would need a review of the patch regardless. [21:09:30] * thcipriani looks [21:10:13] mutante: sure, I can sneak the first of mine through pretty quickly [21:10:21] (03CR) 10Scott French: [C:03+2] P:mediawiki::maintenance::purge_loginnotify: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139923 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [21:10:45] (03CR) 10Thcipriani: [C:03+1] Add throttle rule for Istanbul Hackathon 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140539 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [21:13:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10784598 (10Jclark-ctr) Made some more adjustments will see how this effects cooling [21:13:35] ... waiting for puppet-agent on deploy1003 [21:14:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dzahn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140539 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [21:14:43] swfrench-wmf: mutante is actually deploying, FYI [21:14:44] (03PS1) 10Dwisehaupt: icinga: add new frack hosts for basic monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1140541 (https://phabricator.wikimedia.org/T386259) [21:15:16] thcipriani: ah, alright! I took "go ahead with yours" literally [21:15:25] I saw :) [21:15:27] (03Merged) 10jenkins-bot: Add throttle rule for Istanbul Hackathon 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140539 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [21:15:42] !log dzahn@deploy1003 Started scap sync-world: Backport for [[gerrit:1140539|Add throttle rule for Istanbul Hackathon 2025 (T382309)]] [21:15:49] :? [21:15:51] :/ [21:16:03] I think we're good [21:16:14] mutante: I think you're going to pick up my change, which is totally fine [21:17:49] swfrench-wmf: your change was for puppet? I only see one change going out via scap right now, FWIW. [21:18:00] mutante: actually, when you get to testservers, please do not proceed [21:18:45] I dont know what that means in practice. [21:18:57] it'll prompt you again to proceed [21:18:57] I am seeing a number that says 83% [21:19:53] swfrench-wmf: you can also follow along with the fun: https://spiderpig.wikimedia.org/jobs/32 [21:20:03] (03CR) 10Jgreen: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1140541 (https://phabricator.wikimedia.org/T386259) (owner: 10Dwisehaupt) [21:20:33] !log dzahn@deploy1003 dzahn: Backport for [[gerrit:1140539|Add throttle rule for Istanbul Hackathon 2025 (T382309)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:20:44] mutante: please hold here [21:21:16] (03PS1) 10Scott French: Revert "P:mediawiki::maintenance::purge_loginnotify: migrate to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1140542 (https://phabricator.wikimedia.org/T388536) [21:21:18] ok, not doing anything [21:22:10] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10784611 (10Jclark-ctr) Let me know what you would like to do i can remove drive you can reboot ` Server shows 8 drives NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 1.7T 0 disk ├─sda1... [21:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:22:57] mutante: cool, thanks! I'll give you a heads-up when it's safe to proceed [21:23:03] ack [21:23:22] (03CR) 10CI reject: [V:04-1] Revert "P:mediawiki::maintenance::purge_loginnotify: migrate to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1140542 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [21:23:50] do you guys know how to "manually clear a cache" when it's related to mediawiki-config? [21:24:12] (03PS2) 10Scott French: Revert "P:mediawiki::maintenance::purge_loginnotify: migrate to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1140542 (https://phabricator.wikimedia.org/T388536) [21:24:25] mutante: depends on what cache you're talking about [21:24:27] ah, it seems this means: [21:24:29] mwscript resetAuthenticationThrottle.php --wiki=metawiki --signup --ip 1.2.3.4 [21:24:32] something like this [21:24:46] it says I need this if the event is within 72 hours [21:24:56] https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold [21:25:49] cool, we can do that post deploy [21:26:10] glad to hear that [21:26:43] (03CR) 10Scott French: [C:03+2] Revert "P:mediawiki::maintenance::purge_loginnotify: migrate to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1140542 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [21:30:27] (03PS2) 10Dzahn: firewall: temp add rule to allow Istanbul Hackathon [puppet] - 10https://gerrit.wikimedia.org/r/1138483 (https://phabricator.wikimedia.org/T382309) [21:31:42] (03PS3) 10Dzahn: firewall: temp add rule to allow Istanbul Hackathon [puppet] - 10https://gerrit.wikimedia.org/r/1138483 (https://phabricator.wikimedia.org/T382309) [21:32:38] (03CR) 10Dzahn: "adding ports like in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138995" [puppet] - 10https://gerrit.wikimedia.org/r/1138483 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [21:33:07] mutante: thcipriani: you're good to go. apologies for the delay - this was supposed to be fairly straightforward :) [21:34:00] swfrench-wmf: all good. apologies from me for confusing messaging. I didn't realize that I had already started it with that button. [21:34:14] clicks Yes [21:34:18] !log dzahn@deploy1003 dzahn: Continuing with sync [21:36:51] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure (sdb) on coludcephmon1004 - https://phabricator.wikimedia.org/T392458#10784640 (10Jclark-ctr) Replaced Failed Drive [21:37:23] hrm, realizing that resetAuthenticationThrottle does not support cidr ranges, so guess I'll run it 16 times after deploy [21:37:39] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure (sdb) on coludcephmon1004 - https://phabricator.wikimedia.org/T392458#10784641 (10Jclark-ctr) 05Open→03Resolved [21:38:22] thcipriani: hmmm.. they said "to be on the safe side" [21:38:40] meh, it's not hard to run [21:38:44] and there might be another IP .. arrr [21:40:59] !log dzahn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140539|Add throttle rule for Istanbul Hackathon 2025 (T382309)]] (duration: 25m 16s) [21:41:20] it says it's finished [21:43:18] (03PS4) 10Dzahn: firewall: temp add rule to allow Istanbul Hackathon [puppet] - 10https://gerrit.wikimedia.org/r/1138483 (https://phabricator.wikimedia.org/T382309) [21:49:24] (03PS1) 10Dzahn: Add another throttle rule for Istanbul Hackathon 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140543 (https://phabricator.wikimedia.org/T382309) [21:50:18] (03CR) 10Thcipriani: [C:03+1] Add another throttle rule for Istanbul Hackathon 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140543 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [21:50:53] (03PS2) 10Dzahn: Add another throttle rule for Istanbul Hackathon 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140543 (https://phabricator.wikimedia.org/T382309) [21:52:07] swfrench-wmf: doing another one [21:52:57] mutante: ack, thanks! I'm probably done for the day now that I have some sort of yaml -> json -> yaml round-tripping to debug :) [21:53:42] alright:) [21:53:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dzahn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140543 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [21:54:39] (03Merged) 10jenkins-bot: Add another throttle rule for Istanbul Hackathon 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140543 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [21:54:53] !log dzahn@deploy1003 Started scap sync-world: Backport for [[gerrit:1140543|Add another throttle rule for Istanbul Hackathon 2025 (T382309)]] [22:00:14] !log dzahn@deploy1003 dzahn: Backport for [[gerrit:1140543|Add another throttle rule for Istanbul Hackathon 2025 (T382309)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:02:41] !log dzahn@deploy1003 dzahn: Continuing with sync [22:07:16] (03PS1) 10Dzahn: gerrit: add another IP to throttling exempt for Hackathon [puppet] - 10https://gerrit.wikimedia.org/r/1140546 (https://phabricator.wikimedia.org/T382309) [22:07:35] PROBLEM - Host logging-hd2001 is DOWN: PING CRITICAL - Packet loss = 100% [22:09:26] !log dzahn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140543|Add another throttle rule for Istanbul Hackathon 2025 (T382309)]] (duration: 14m 32s) [22:09:35] RECOVERY - Host logging-hd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [22:13:24] (03PS1) 10Scott French: Remove references to wdqs-internal listenter in fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140547 [22:14:07] (03PS2) 10Scott French: Remove references to wdqs-internal listenter in fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140547 (https://phabricator.wikimedia.org/T376151) [22:19:59] (03CR) 10Scott French: [C:03+2] Remove references to wdqs-internal listenter in fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140547 (https://phabricator.wikimedia.org/T376151) (owner: 10Scott French) [22:20:46] (03CR) 10Scott French: [C:03+2] "Going to go ahead and merge this test-only change, in order to unbreak CI." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140547 (https://phabricator.wikimedia.org/T376151) (owner: 10Scott French) [22:22:31] (03Merged) 10jenkins-bot: Remove references to wdqs-internal listenter in fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140547 (https://phabricator.wikimedia.org/T376151) (owner: 10Scott French) [22:27:16] !log mwscript-k8s -- resetAuthenticationThrottle.pp --wiki=aawiki --signup --ip= (x17) [22:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:30:08] (03CR) 10Dzahn: [C:03+2] gerrit: add another IP to throttling exempt for Hackathon [puppet] - 10https://gerrit.wikimedia.org/r/1140546 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [22:35:28] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:37:25] taavi: are you at the Hackathon right now? [22:42:14] (03PS5) 10Dzahn: firewall: temp add rule to allow Istanbul Hackathon [puppet] - 10https://gerrit.wikimedia.org/r/1138483 (https://phabricator.wikimedia.org/T382309) [22:46:21] (03PS6) 10Dzahn: firewall: temp add rule to allow Istanbul Hackathon [puppet] - 10https://gerrit.wikimedia.org/r/1138483 (https://phabricator.wikimedia.org/T382309) [22:46:58] (03CR) 10Aleksandar Mastilovic: "@brouberol@wikimedia.org I think it's time to merge this one, Gobblin has been successfully migrated." [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [22:49:43] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1138483/5427/" [puppet] - 10https://gerrit.wikimedia.org/r/1138483 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [22:52:00] (03PS7) 10Dzahn: firewall/nftables_throttling: temp add rule to allow Istanbul Hackathon [puppet] - 10https://gerrit.wikimedia.org/r/1138483 (https://phabricator.wikimedia.org/T382309) [22:54:07] (03CR) 10Dzahn: [C:03+2] firewall/nftables_throttling: temp add rule to allow Istanbul Hackathon [puppet] - 10https://gerrit.wikimedia.org/r/1138483 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [22:55:02] (03PS3) 10Aleksandar Mastilovic: Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) [22:55:25] (03CR) 10Dzahn: [C:03+2] "This affects: gitlab, gerrit and durum. (as of today)." [puppet] - 10https://gerrit.wikimedia.org/r/1138483 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [23:13:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:14:47] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10784771 (10phaultfinder) [23:33:25] FIRING: SystemdUnitFailed: wmf_auto_restart_uwsgi-netbox.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:43] FIRING: [149x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:40:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1140549 [23:40:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1140549 (owner: 10TrainBranchBot) [23:44:05] (03PS2) 10Scott French: mw:periodic_job:kubernetes: quote job description [puppet] - 10https://gerrit.wikimedia.org/r/1140548 [23:48:08] (03PS3) 10Scott French: P:mediawiki::maintenance::pageassessments: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140266 (https://phabricator.wikimedia.org/T388536) [23:50:15] (03CR) 10Scott French: "FYI, I'm going to hold off on this until (1) Ic7032f3ff1c774990a9ef38241ab315eab0c573b or similar is merged and (2) the next-affected job " [puppet] - 10https://gerrit.wikimedia.org/r/1140266 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [23:53:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1140549 (owner: 10TrainBranchBot)