[00:01:08] (03CR) 10CI reject: [V: 04-1] grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [00:03:43] !log tstarling@deploy2002 Synchronized wmf-config: T344791 related cleanup (duration: 06m 22s) [00:03:47] T344791: Get rid of ss0- SameSite cookie prefix hack - https://phabricator.wikimedia.org/T344791 [00:05:13] (03PS20) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [00:08:57] (03CR) 10CI reject: [V: 04-1] grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [00:12:54] (03PS21) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [00:14:11] (03CR) 10CI reject: [V: 04-1] grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [00:25:23] (03PS23) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [00:26:53] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1141/co" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [00:32:33] (03PS1) 10Varnent: Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991100 (https://phabricator.wikimedia.org/T354790) [00:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991076 [00:38:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991076 (owner: 10TrainBranchBot) [00:43:19] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:10] (03CR) 10Varnent: [C: 03+1] Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991100 (https://phabricator.wikimedia.org/T354790) (owner: 10Varnent) [01:01:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991076 (owner: 10TrainBranchBot) [01:10:21] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:17] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:52:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P54761 and previous config saved to /var/cache/conftool/dbconfig/20240117-025232-ladsgroup.json [02:52:36] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:07:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P54762 and previous config saved to /var/cache/conftool/dbconfig/20240117-030738-ladsgroup.json [03:09:17] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:26] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10Papaul) @RobH In the process of creating the RMA for the linecard in FPC0 on cr2-codfw the Juniper team did let me know that the linecard has only technical support and no hardware support for it so impossible to RMA it.... [03:21:27] (03PS1) 10Tim Starling: WMCS: add views for block and block_target tables [puppet] - 10https://gerrit.wikimedia.org/r/991105 (https://phabricator.wikimedia.org/T355034) [03:22:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P54763 and previous config saved to /var/cache/conftool/dbconfig/20240117-032245-ladsgroup.json [03:23:10] (HelmReleaseBadStatus) firing: Helm release miscweb/wikiworkshop on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:23:33] (03CR) 10Tim Starling: "I tested it by manually concatenating the bits to make an actual CREATE VIEW query, which I applied and verified on my new-schema test wik" [puppet] - 10https://gerrit.wikimedia.org/r/991105 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [03:37:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P54764 and previous config saved to /var/cache/conftool/dbconfig/20240117-033751-ladsgroup.json [03:37:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [03:37:56] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:38:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:21:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1183.eqiad.wmnet with reason: Maintenance [05:47:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1183.eqiad.wmnet with reason: Maintenance [05:49:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2101.codfw.wmnet with reason: Maintenance [05:49:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2101.codfw.wmnet with reason: Maintenance [05:50:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2111.codfw.wmnet with reason: Maintenance [05:50:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2111.codfw.wmnet with reason: Maintenance [05:50:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2111 (T354336)', diff saved to https://phabricator.wikimedia.org/P54765 and previous config saved to /var/cache/conftool/dbconfig/20240117-055056-marostegui.json [05:51:01] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [05:54:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T354336)', diff saved to https://phabricator.wikimedia.org/P54766 and previous config saved to /var/cache/conftool/dbconfig/20240117-055409-marostegui.json [06:09:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P54767 and previous config saved to /var/cache/conftool/dbconfig/20240117-060916-marostegui.json [06:24:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P54768 and previous config saved to /var/cache/conftool/dbconfig/20240117-062422-marostegui.json [06:39:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T354336)', diff saved to https://phabricator.wikimedia.org/P54769 and previous config saved to /var/cache/conftool/dbconfig/20240117-063929-marostegui.json [06:39:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [06:39:34] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [06:39:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [06:39:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2123 (T354336)', diff saved to https://phabricator.wikimedia.org/P54770 and previous config saved to /var/cache/conftool/dbconfig/20240117-063951-marostegui.json [06:43:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T354336)', diff saved to https://phabricator.wikimedia.org/P54771 and previous config saved to /var/cache/conftool/dbconfig/20240117-064304-marostegui.json [06:58:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P54772 and previous config saved to /var/cache/conftool/dbconfig/20240117-065811-marostegui.json [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T0700) [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:18] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:13:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P54773 and previous config saved to /var/cache/conftool/dbconfig/20240117-071317-marostegui.json [07:23:25] (HelmReleaseBadStatus) firing: Helm release miscweb/wikiworkshop on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:23:44] (03CR) 10Slyngshede: [C: 03+2] Ganeti memory pressure alerting. [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:25:19] (03Merged) 10jenkins-bot: Ganeti memory pressure alerting. [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:28:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T354336)', diff saved to https://phabricator.wikimedia.org/P54774 and previous config saved to /var/cache/conftool/dbconfig/20240117-072824-marostegui.json [07:28:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance [07:28:29] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [07:28:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance [07:28:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [07:28:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [07:29:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T354336)', diff saved to https://phabricator.wikimedia.org/P54775 and previous config saved to /var/cache/conftool/dbconfig/20240117-072902-marostegui.json [07:32:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T354336)', diff saved to https://phabricator.wikimedia.org/P54776 and previous config saved to /var/cache/conftool/dbconfig/20240117-073212-marostegui.json [07:47:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P54777 and previous config saved to /var/cache/conftool/dbconfig/20240117-074719-marostegui.json [07:50:10] (GanetiMemoryPressure) firing: Ganeti: High memory usage on ganeti5004:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [07:53:38] (03PS2) 10Peter Fischer: enable page_rerender for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990718 (https://phabricator.wikimedia.org/T351503) [08:00:05] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T0800). [08:00:05] pfischer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:02:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P54778 and previous config saved to /var/cache/conftool/dbconfig/20240117-080225-marostegui.json [08:04:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/988677 (https://phabricator.wikimedia.org/T338825) (owner: 10Slyngshede) [08:04:51] o/ [08:09:11] (03PS2) 10Muehlenhoff: admin: add wfan219 to deployment [puppet] - 10https://gerrit.wikimedia.org/r/985331 (https://phabricator.wikimedia.org/T353958) (owner: 10Herron) [08:09:21] (03PS1) 10Marostegui: site.pp: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/991269 (https://phabricator.wikimedia.org/T354210) [08:12:12] (03CR) 10Muehlenhoff: [C: 03+2] admin: add wfan219 to deployment [puppet] - 10https://gerrit.wikimedia.org/r/985331 (https://phabricator.wikimedia.org/T353958) (owner: 10Herron) [08:14:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:15:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10MoritzMuehlenhoff) 05In progress→03Resolved @AnnWF I'v enabled your access and you should now be able to log into the deployment servers. If you run into any iss... [08:16:53] !log installing python-git security updates [08:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T354336)', diff saved to https://phabricator.wikimedia.org/P54779 and previous config saved to /var/cache/conftool/dbconfig/20240117-081731-marostegui.json [08:17:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [08:17:36] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [08:17:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [08:17:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2137:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54780 and previous config saved to /var/cache/conftool/dbconfig/20240117-081754-marostegui.json [08:19:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:19:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:20:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P54781 and previous config saved to /var/cache/conftool/dbconfig/20240117-082001-ladsgroup.json [08:20:09] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [08:21:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54782 and previous config saved to /var/cache/conftool/dbconfig/20240117-082106-marostegui.json [08:21:16] (03CR) 10Arnaudb: [C: 03+1] site.pp: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/991269 (https://phabricator.wikimedia.org/T354210) (owner: 10Marostegui) [08:22:33] (03PS1) 10Slyngshede: Ganeti Memory Pressure: Add better dashboard. [alerts] - 10https://gerrit.wikimedia.org/r/991270 [08:22:45] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 20 days, 0:00:00 on db2194.codfw.wmnet with reason: debugging something before T343674 [08:22:49] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [08:23:00] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on db2194.codfw.wmnet with reason: debugging something before T343674 [08:24:19] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355170 (10MoritzMuehlenhoff) p:05Triage→03Medium @JWheeler-WMF Welcome to the Foundation. I have added you to the cn=wmf LDAP group which e.g. allows you to access https://turnilo.wikimedia.org... [08:27:37] (03CR) 10Marostegui: [C: 03+2] site.pp: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/991269 (https://phabricator.wikimedia.org/T354210) (owner: 10Marostegui) [08:30:27] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [08:30:48] (03CR) 10Filippo Giunchedi: "LGTM, the dashboard itself though should" [alerts] - 10https://gerrit.wikimedia.org/r/991270 (owner: 10Slyngshede) [08:31:49] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Docker [08:33:27] (03PS1) 10Muehlenhoff: Remove access for bstorm [puppet] - 10https://gerrit.wikimedia.org/r/991271 [08:36:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P54783 and previous config saved to /var/cache/conftool/dbconfig/20240117-083613-marostegui.json [08:36:29] (03PS1) 10Muehlenhoff: Record LDAP access for jwheeler [puppet] - 10https://gerrit.wikimedia.org/r/991272 (https://phabricator.wikimedia.org/T355170) [08:36:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for bstorm [puppet] - 10https://gerrit.wikimedia.org/r/991271 (owner: 10Muehlenhoff) [08:38:53] (03CR) 10DCausse: [C: 03+1] enable page_rerender for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990718 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [08:39:59] jouncebot: nowandnext [08:39:59] For the next 0 hour(s) and 20 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T0800) [08:39:59] In 0 hour(s) and 20 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T0900) [08:44:57] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10ayounsi) OK, that's what I initially thought. We didn't renew support on them because they're old linecards. I got mixed up with the [[ https://entitlementsearch.juniper.net/entitlementsearch/ | entitlementsearch ]] dash... [08:45:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990718 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [08:45:45] (03Merged) 10jenkins-bot: enable page_rerender for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990718 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [08:46:53] !log dcausse@deploy2002 Started scap: Backport for [[gerrit:990718|enable page_rerender for all wikis (T351503)]] [08:46:57] T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 [08:47:08] (03CR) 10Muehlenhoff: [C: 03+2] Record LDAP access for jwheeler [puppet] - 10https://gerrit.wikimedia.org/r/991272 (https://phabricator.wikimedia.org/T355170) (owner: 10Muehlenhoff) [08:48:21] !log dcausse@deploy2002 pfischer and dcausse: Backport for [[gerrit:990718|enable page_rerender for all wikis (T351503)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:50:16] (03PS1) 10Slyngshede: Ganeti memory pressure: Alert is currently trigger on low usage. [alerts] - 10https://gerrit.wikimedia.org/r/991279 [08:50:19] !log dcausse@deploy2002 pfischer and dcausse: Continuing with sync [08:51:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P54784 and previous config saved to /var/cache/conftool/dbconfig/20240117-085119-marostegui.json [08:55:01] !log installing Python 2.7 security updates [08:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:17] (03PS1) 10Peter Fischer: Search update pipeline: enable page_rerender for all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/991282 (https://phabricator.wikimedia.org/T351503) [08:56:08] !log dcausse@deploy2002 Finished scap: Backport for [[gerrit:990718|enable page_rerender for all wikis (T351503)]] (duration: 09m 15s) [08:56:12] T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 [08:56:15] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: enable page_rerender for all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/991282 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [08:57:07] (03Merged) 10jenkins-bot: Search update pipeline: enable page_rerender for all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/991282 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [09:00:05] jnuche and jeena: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T0900). [09:00:27] hi, I'll deploy the train in 5mins [09:02:48] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2042.codfw.wmnet [09:04:05] (03PS1) 10Muehlenhoff: Switch mc2042 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991283 (https://phabricator.wikimedia.org/T349619) [09:05:42] (03CR) 10Muehlenhoff: [C: 03+2] Switch mc2042 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991283 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:06:01] (03PS1) 10Muehlenhoff: Switch mc2043 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991284 (https://phabricator.wikimedia.org/T349619) [09:06:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54785 and previous config saved to /var/cache/conftool/dbconfig/20240117-090626-marostegui.json [09:06:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance [09:06:30] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [09:06:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance [09:06:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T354336)', diff saved to https://phabricator.wikimedia.org/P54786 and previous config saved to /var/cache/conftool/dbconfig/20240117-090648-marostegui.json [09:07:14] (03PS2) 10Slyngshede: Ganeti Memory Pressure: Add better dashboard. [alerts] - 10https://gerrit.wikimedia.org/r/991270 [09:07:24] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991285 (https://phabricator.wikimedia.org/T354432) [09:07:26] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991285 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot) [09:08:09] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991285 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot) [09:08:15] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mc2042.codfw.wmnet [09:08:53] (03CR) 10Slyngshede: Ganeti Memory Pressure: Add better dashboard. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/991270 (owner: 10Slyngshede) [09:10:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T354336)', diff saved to https://phabricator.wikimedia.org/P54787 and previous config saved to /var/cache/conftool/dbconfig/20240117-091000-marostegui.json [09:10:42] (03PS1) 10Muehlenhoff: Fix Hiera entry [puppet] - 10https://gerrit.wikimedia.org/r/991287 [09:12:10] (03CR) 10Muehlenhoff: [C: 03+2] Fix Hiera entry [puppet] - 10https://gerrit.wikimedia.org/r/991287 (owner: 10Muehlenhoff) [09:15:18] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.14 refs T354432 [09:15:22] T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432 [09:21:33] !log jnuche@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.14 refs T354432 (duration: 06m 15s) [09:21:40] T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432 [09:22:44] (03CR) 10Slyngshede: [C: 03+2] Modify password reset to take CN as username. [software/bitu] - 10https://gerrit.wikimedia.org/r/988677 (https://phabricator.wikimedia.org/T338825) (owner: 10Slyngshede) [09:22:48] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Modify password reset to take CN as username. [software/bitu] - 10https://gerrit.wikimedia.org/r/988677 (https://phabricator.wikimedia.org/T338825) (owner: 10Slyngshede) [09:25:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P54788 and previous config saved to /var/cache/conftool/dbconfig/20240117-092507-marostegui.json [09:26:44] (03CR) 10Btullis: [V: 03+2 C: 03+2] Switch all spark images to use Java 8 as their base JDK/JRE [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989787 (https://phabricator.wikimedia.org/T354777) (owner: 10Btullis) [09:27:40] (03PS1) 10Majavah: P:toolforge::checker: remove webservice checks [puppet] - 10https://gerrit.wikimedia.org/r/991289 (https://phabricator.wikimedia.org/T313030) [09:29:25] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2042.codfw.wmnet [09:30:00] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mc2042.codfw.wmnet [09:31:52] 10SRE, 10Machine-Learning-Team: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10isarantopoulos) From my understanding so far and according to the [[ https://kubernetes.io/docs/reference/access-authn-authz/rbac/ | k8s docs ]] we need to create a `Role` sinc... [09:32:59] (03CR) 10Filippo Giunchedi: [C: 03+1] Ganeti memory pressure: Alert is currently trigger on low usage. [alerts] - 10https://gerrit.wikimedia.org/r/991279 (owner: 10Slyngshede) [09:34:17] (03CR) 10Filippo Giunchedi: Ganeti Memory Pressure: Add better dashboard. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/991270 (owner: 10Slyngshede) [09:34:59] (03CR) 10Brouberol: [C: 03+2] Update statsd-exporter mappings for Airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/990688 (https://phabricator.wikimedia.org/T343232) (owner: 10Aqu) [09:35:03] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:35:42] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:36:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991077 [09:36:21] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1042.eqiad.wmnet [09:36:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991077 (owner: 10TrainBranchBot) [09:36:42] (03CR) 10Slyngshede: [C: 03+2] Ganeti memory pressure: Alert is currently trigger on low usage. [alerts] - 10https://gerrit.wikimedia.org/r/991279 (owner: 10Slyngshede) [09:36:52] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1042 [puppet] - 10https://gerrit.wikimedia.org/r/990986 (owner: 10Effie Mouzeli) [09:37:55] (03Merged) 10jenkins-bot: Ganeti memory pressure: Alert is currently trigger on low usage. [alerts] - 10https://gerrit.wikimedia.org/r/991279 (owner: 10Slyngshede) [09:39:02] (03PS3) 10Slyngshede: Ganeti Memory Pressure: Add better dashboard. [alerts] - 10https://gerrit.wikimedia.org/r/991270 [09:39:09] (03CR) 10CI reject: [V: 04-1] Ganeti Memory Pressure: Add better dashboard. [alerts] - 10https://gerrit.wikimedia.org/r/991270 (owner: 10Slyngshede) [09:40:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P54789 and previous config saved to /var/cache/conftool/dbconfig/20240117-094015-marostegui.json [09:40:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1042.eqiad.wmnet [09:42:32] (03Abandoned) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc2042 [puppet] - 10https://gerrit.wikimedia.org/r/990987 (owner: 10Effie Mouzeli) [09:44:54] (03PS4) 10Slyngshede: Ganeti Memory Pressure: Add better dashboard. [alerts] - 10https://gerrit.wikimedia.org/r/991270 [09:45:23] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet [09:45:29] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1042.eqiad.wmnet [09:46:19] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1043.eqiad.wmnet [09:47:10] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1043 [puppet] - 10https://gerrit.wikimedia.org/r/990988 (owner: 10Effie Mouzeli) [09:47:26] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc1043 [puppet] - 10https://gerrit.wikimedia.org/r/990988 (owner: 10Effie Mouzeli) [09:49:03] (03CR) 10Slyngshede: Ganeti Memory Pressure: Add better dashboard. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/991270 (owner: 10Slyngshede) [09:50:10] (GanetiMemoryPressure) firing: (2) Ganeti: High memory usage on ganeti5004:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [09:50:42] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc2043 [puppet] - 10https://gerrit.wikimedia.org/r/990989 (owner: 10Effie Mouzeli) [09:51:17] (03PS1) 10Ilias Sarantopoulos: admin_ng: allow write access to pods in experimental ns in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/991292 (https://phabricator.wikimedia.org/T354516) [09:51:23] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1042.eqiad.wmnet [09:51:38] (03PS2) 10Ilias Sarantopoulos: WIP - admin_ng: allow write access to pods in experimental ns in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/991292 (https://phabricator.wikimedia.org/T354516) [09:51:59] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet [09:52:10] (03PS1) 10Effie Mouzeli: mc1049: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991293 [09:52:12] (03PS1) 10Effie Mouzeli: mc2049: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991294 [09:52:14] (03PS1) 10Effie Mouzeli: mc1050: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991295 [09:52:16] (03PS1) 10Effie Mouzeli: mc2050: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991296 [09:52:18] (03PS1) 10Effie Mouzeli: mc1051: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991297 [09:52:20] (03PS1) 10Effie Mouzeli: mc2051: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991298 [09:52:22] (03PS1) 10Effie Mouzeli: mc1052: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991299 [09:52:24] (03PS1) 10Effie Mouzeli: mc2052: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991300 [09:52:26] (03PS1) 10Effie Mouzeli: mc1053: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991301 [09:52:28] (03PS1) 10Effie Mouzeli: mc2053: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991302 [09:52:30] (03PS1) 10Effie Mouzeli: mc1054: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991303 [09:52:32] (03PS1) 10Effie Mouzeli: mc2054: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991304 [09:52:34] (03PS1) 10Effie Mouzeli: mc2055: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991305 [09:53:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1043.eqiad.wmnet [09:54:43] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10isarantopoulos) I started a patch for the above. I haven't found a way to do this only for ml-staging-codfw. From our side there is no issue (it may be pr... [09:54:47] (03PS2) 10Ilias Sarantopoulos: WIP - admin_ng: allow write access to pods in experimental ns in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/991292 (https://phabricator.wikimedia.org/T354516) [09:55:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T354336)', diff saved to https://phabricator.wikimedia.org/P54790 and previous config saved to /var/cache/conftool/dbconfig/20240117-095521-marostegui.json [09:55:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [09:55:26] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [09:55:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [09:55:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54791 and previous config saved to /var/cache/conftool/dbconfig/20240117-095544-marostegui.json [09:58:17] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:58:33] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:58:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991077 (owner: 10TrainBranchBot) [09:58:44] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:58:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54792 and previous config saved to /var/cache/conftool/dbconfig/20240117-095856-marostegui.json [10:00:10] (GanetiMemoryPressure) firing: (4) Ganeti: High memory usage (96.64%) on ganeti3008:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [10:01:08] (03CR) 10Filippo Giunchedi: [C: 03+1] Ganeti Memory Pressure: Add better dashboard. [alerts] - 10https://gerrit.wikimedia.org/r/991270 (owner: 10Slyngshede) [10:09:53] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.6 - https://phabricator.wikimedia.org/T316421 (10LSobanski) I updated the description to reflect the new Etherpad release (1.9.6). See below for a list of changes: * Notable enhancements and fixes ** Prevent etherp... [10:10:07] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.6 - https://phabricator.wikimedia.org/T316421 (10LSobanski) [10:12:25] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2043.codfw.wmnet [10:14:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P54793 and previous config saved to /var/cache/conftool/dbconfig/20240117-101403-marostegui.json [10:14:09] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2043 [puppet] - 10https://gerrit.wikimedia.org/r/990989 (owner: 10Effie Mouzeli) [10:15:16] (03CR) 10Slyngshede: [C: 03+2] Ganeti Memory Pressure: Add better dashboard. [alerts] - 10https://gerrit.wikimedia.org/r/991270 (owner: 10Slyngshede) [10:16:29] (03Merged) 10jenkins-bot: Ganeti Memory Pressure: Add better dashboard. [alerts] - 10https://gerrit.wikimedia.org/r/991270 (owner: 10Slyngshede) [10:18:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2043.codfw.wmnet [10:24:37] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc1044 [puppet] - 10https://gerrit.wikimedia.org/r/990990 (owner: 10Effie Mouzeli) [10:26:07] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:26:17] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:26:21] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet [10:29:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P54795 and previous config saved to /var/cache/conftool/dbconfig/20240117-102909-marostegui.json [10:29:30] 10SRE-OnFire, 10Znuny, 10collaboration-services: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10LSobanski) >>! In T354479#9462886, @Jelto wrote: >Currently the `prometheus::blackbox::check::http` does not support delaying probe down alerts, it's set to a fixed `2m`.... [10:29:42] (03PS1) 10Peter Fischer: Search update pipeline: enable page_rerender for all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/991307 (https://phabricator.wikimedia.org/T351503) [10:30:00] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: enable page_rerender for all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/991307 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [10:30:42] (03PS1) 10Slyngshede: Ganeti Memory Pressure: Use MemAvailable for usage calculation. [alerts] - 10https://gerrit.wikimedia.org/r/991308 [10:30:55] (03CR) 10FNegri: [C: 03+1] "I checked when these were disabled and found T221301 where it was decided it was not the best monitoring approach." [puppet] - 10https://gerrit.wikimedia.org/r/991289 (https://phabricator.wikimedia.org/T313030) (owner: 10Majavah) [10:30:59] (03Merged) 10jenkins-bot: Search update pipeline: enable page_rerender for all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/991307 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [10:33:03] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet [10:43:45] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1043.eqiad.wmnet [10:44:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T354336)', diff saved to https://phabricator.wikimedia.org/P54796 and previous config saved to /var/cache/conftool/dbconfig/20240117-104416-marostegui.json [10:44:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance [10:44:20] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [10:44:24] (03PS5) 10Anzx: thwiki: update tagline and optimise other logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) [10:44:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance [10:44:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T354336)', diff saved to https://phabricator.wikimedia.org/P54797 and previous config saved to /var/cache/conftool/dbconfig/20240117-104438-marostegui.json [10:45:17] (03CR) 10Majavah: [C: 03+2] P:toolforge::checker: remove webservice checks [puppet] - 10https://gerrit.wikimedia.org/r/991289 (https://phabricator.wikimedia.org/T313030) (owner: 10Majavah) [10:48:26] (03PS6) 10Anzx: thwiki: update tagline and optimise other logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) [10:48:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T354336)', diff saved to https://phabricator.wikimedia.org/P54798 and previous config saved to /var/cache/conftool/dbconfig/20240117-104851-marostegui.json [10:49:40] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1043.eqiad.wmnet [10:51:20] (03PS1) 10Lucas Werkmeister (WMDE): Only build result entries for used wbsearchentities results [extensions/Wikibase] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991059 (https://phabricator.wikimedia.org/T355053) [10:51:38] (03PS7) 10Anzx: thwiki: update tagline and optimise other logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) [10:51:41] (03PS1) 10Lucas Werkmeister (WMDE): Only build result entries for used wbsearchentities results [extensions/Wikibase] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991060 (https://phabricator.wikimedia.org/T355053) [10:53:39] (03PS1) 10Lucas Werkmeister (WMDE): Exclude qqq from monolingual text languages [extensions/Wikibase] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991061 (https://phabricator.wikimedia.org/T341409) [10:54:54] (03PS8) 10Anzx: thwiki: update tagline and optimise other logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) [10:55:58] (03PS9) 10Anzx: thwiki: update tagline and optimise other logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) [10:56:15] (03PS7) 10Lucas Werkmeister (WMDE): Remove unused $wgExtraLanguageNames['qqq'] assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) [10:56:17] (03PS6) 10Lucas Werkmeister (WMDE): DNM: Stop using $wmgExtraLanguageNames in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628774 (https://phabricator.wikimedia.org/T263441) [10:56:19] (03PS6) 10Lucas Werkmeister (WMDE): Remove $wmgExtraLanguageNames from InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628775 (https://phabricator.wikimedia.org/T263441) [10:56:21] (03PS5) 10Lucas Werkmeister (WMDE): Remove $wmgExtraLanguageNames from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628776 (https://phabricator.wikimedia.org/T263441) [10:59:58] (03CR) 10Filippo Giunchedi: [C: 03+1] Ganeti Memory Pressure: Use MemAvailable for usage calculation. [alerts] - 10https://gerrit.wikimedia.org/r/991308 (owner: 10Slyngshede) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T1100) [11:00:14] (03CR) 10Slyngshede: [C: 03+2] Ganeti Memory Pressure: Use MemAvailable for usage calculation. [alerts] - 10https://gerrit.wikimedia.org/r/991308 (owner: 10Slyngshede) [11:01:25] (03Merged) 10jenkins-bot: Ganeti Memory Pressure: Use MemAvailable for usage calculation. [alerts] - 10https://gerrit.wikimedia.org/r/991308 (owner: 10Slyngshede) [11:03:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P54799 and previous config saved to /var/cache/conftool/dbconfig/20240117-110357-marostegui.json [11:04:36] (03CR) 10Phuedx: [C: 03+1] "+1 because the EventLoggingLegacyConverter class LGTM and is well-tested. I defer to others on whether the script in docroot/ is sufficien" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [11:07:52] (03PS3) 10Hnowlan: kubernetes: make 4 codfw jobrunner hosts k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/990723 (https://phabricator.wikimedia.org/T354791) [11:08:44] (03CR) 10Muehlenhoff: [C: 03+2] Configure ACLs for reprepro upload queue [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [11:09:18] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:09:50] !log stopped scanning script [11:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:53] !log T351400 running on a tmux session `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30.txt` [11:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:56] T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400 [11:11:53] (03CR) 10Hnowlan: [C: 03+1] mobileapps: switch service discovery to k8s only [deployment-charts] - 10https://gerrit.wikimedia.org/r/991043 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [11:13:06] (03CR) 10CI reject: [V: 04-1] Only build result entries for used wbsearchentities results [extensions/Wikibase] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991060 (https://phabricator.wikimedia.org/T355053) (owner: 10Lucas Werkmeister (WMDE)) [11:15:06] (03CR) 10Slyngshede: [C: 03+2] Netfilter: Remove exclude filter. [alerts] - 10https://gerrit.wikimedia.org/r/991003 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:16:06] (03CR) 10Hnowlan: mobileapps: add Cassandra config support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [11:16:15] (03Merged) 10jenkins-bot: Netfilter: Remove exclude filter. [alerts] - 10https://gerrit.wikimedia.org/r/991003 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:19:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P54800 and previous config saved to /var/cache/conftool/dbconfig/20240117-111904-marostegui.json [11:20:10] (GanetiMemoryPressure) firing: (3) Ganeti: High memory usage (96.7%) on ganeti3008:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [11:22:20] (03CR) 10Marostegui: phabricator: use same db server regardless of DC of phab server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989537 (owner: 10Dzahn) [11:23:26] (HelmReleaseBadStatus) firing: Helm release miscweb/wikiworkshop on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:24:08] (03PS1) 10Effie Mouzeli: memcached: switch memcached role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991312 [11:25:10] (GanetiMemoryPressure) resolved: Ganeti: High memory usage (99.46%) on ganeti6002:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=drmrs - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [11:28:43] (03PS1) 10Muehlenhoff: aptrepo: Don't apply deb822 validation on uploaders file [puppet] - 10https://gerrit.wikimedia.org/r/991313 (https://phabricator.wikimedia.org/T115349) [11:29:50] (03CR) 10CI reject: [V: 04-1] aptrepo: Don't apply deb822 validation on uploaders file [puppet] - 10https://gerrit.wikimedia.org/r/991313 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [11:30:42] (03PS2) 10Muehlenhoff: aptrepo: Don't apply deb822 validation on uploaders file [puppet] - 10https://gerrit.wikimedia.org/r/991313 (https://phabricator.wikimedia.org/T115349) [11:32:29] 10SRE, 10serviceops: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10hnowlan) >>! In T355117#9462317, @jnuche wrote: > Every Tuesday morning, an automated process runs to presync new MW versions to hosts. It also takes care of deleting the older vers... [11:33:24] (03CR) 10Effie Mouzeli: [C: 03+1] mobileapps: switch service discovery to k8s only [deployment-charts] - 10https://gerrit.wikimedia.org/r/991043 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [11:33:42] (03PS5) 10Slyngshede: P:puppet::client_bucket Start moving monitoring to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) [11:34:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T354336)', diff saved to https://phabricator.wikimedia.org/P54801 and previous config saved to /var/cache/conftool/dbconfig/20240117-113410-marostegui.json [11:34:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2192.codfw.wmnet with reason: Maintenance [11:34:15] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [11:34:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2192.codfw.wmnet with reason: Maintenance [11:34:30] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet [11:34:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T354336)', diff saved to https://phabricator.wikimedia.org/P54802 and previous config saved to /var/cache/conftool/dbconfig/20240117-113432-marostegui.json [11:34:39] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet [11:34:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: memcached [11:35:23] (03CR) 10Muehlenhoff: [C: 03+2] memcached: switch memcached role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991312 (owner: 10Effie Mouzeli) [11:37:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T354336)', diff saved to https://phabricator.wikimedia.org/P54803 and previous config saved to /var/cache/conftool/dbconfig/20240117-113745-marostegui.json [11:38:16] (03Abandoned) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 (owner: 10Effie Mouzeli) [11:38:20] 10SRE, 10serviceops: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10jnuche) >>! In T355117#9465167, @hnowlan wrote: >>>! In T355117#9462317, @jnuche wrote: >> Every Tuesday morning, an automated process runs to presync new MW versions to hosts. It a... [11:38:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1142/co" [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:38:30] (03Abandoned) 10Effie Mouzeli: (WIP2) mcrouter vanilla chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/981461 (owner: 10Effie Mouzeli) [11:38:34] (03Abandoned) 10Effie Mouzeli: (WIP2) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/982785 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:38:38] (03PS12) 10Anzx: thwiki: update tagline and optimise other logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) [11:38:47] (03Abandoned) 10Effie Mouzeli: (WIP) modules/lamp: remove job_1.0.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/989841 (owner: 10Effie Mouzeli) [11:38:51] (03Abandoned) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847 (owner: 10Effie Mouzeli) [11:39:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:39:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: memcached [11:40:33] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet [11:40:36] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2044.codfw.wmnet [11:44:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:45:50] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc2044 [puppet] - 10https://gerrit.wikimedia.org/r/990991 (owner: 10Effie Mouzeli) [11:46:28] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1044.eqiad.wmnet [11:46:39] PROBLEM - Host mw2394 is DOWN: PING CRITICAL - Packet loss = 100% [11:48:26] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1044 [puppet] - 10https://gerrit.wikimedia.org/r/990990 (owner: 10Effie Mouzeli) [11:49:54] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Don't apply deb822 validation on uploaders file [puppet] - 10https://gerrit.wikimedia.org/r/991313 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [11:52:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1044.eqiad.wmnet [11:52:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P54804 and previous config saved to /var/cache/conftool/dbconfig/20240117-115252-marostegui.json [11:54:12] (03PS1) 10Lucas Werkmeister (WMDE): Skip tainted references test:distnodiff script to fix Wikibase CI [extensions/Wikibase] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991062 (https://phabricator.wikimedia.org/T354881) [11:54:26] (03PS2) 10Lucas Werkmeister (WMDE): Only build result entries for used wbsearchentities results [extensions/Wikibase] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991060 (https://phabricator.wikimedia.org/T355053) [11:54:29] mw2394 apparently has a bad DIMM [11:54:56] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-wf1001.eqiad.wmnet [11:55:20] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2044.codfw.wmnet [11:55:38] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc1045 [puppet] - 10https://gerrit.wikimedia.org/r/990992 (owner: 10Effie Mouzeli) [11:55:55] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2044 [puppet] - 10https://gerrit.wikimedia.org/r/990991 (owner: 10Effie Mouzeli) [11:59:06] !log cgoubert@cumin2002 conftool action : set/pooled=inactive; selector: name=mw2394.codfw.wmnet [12:00:01] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw2394.codfw.wmnet with reason: Bad DIMM [12:00:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2044.codfw.wmnet [12:00:18] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw2394.codfw.wmnet with reason: Bad DIMM [12:00:23] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=87150827-7740-4075-ada6-a08469c8b7f6) set by cgoubert@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Bad DIMM ` mw... [12:00:34] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1001.eqiad.wmnet [12:00:37] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-wf1002.eqiad.wmnet [12:01:16] 10SRE, 10serviceops, 10Patch-For-Review: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10CodeReviewBot) jnuche opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/204 prune old inactive branches as first step of staging a train [12:01:18] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Clement_Goubert) 05Resolved→03Open Reopening this task since hardware failures for this server happened very close to each other. `mw2394` crashed this morning due to a DIMM error `---------------------... [12:03:08] (03CR) 10Slyngshede: P:puppet::client_bucket Start moving monitoring to Prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:06:27] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1002.eqiad.wmnet [12:07:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P54805 and previous config saved to /var/cache/conftool/dbconfig/20240117-120758-marostegui.json [12:12:25] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:12:32] (03CR) 10Jgiannelos: mobileapps: add Cassandra config support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [12:12:33] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:12:45] (03CR) 10Jgiannelos: mobileapps: add Cassandra config support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [12:12:59] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:14:35] (03PS1) 10Muehlenhoff: aptrepo: Drop one more deb822 check [puppet] - 10https://gerrit.wikimedia.org/r/991316 [12:17:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] mcrouter: add helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [12:20:30] (03PS4) 10WMDE-Fisch: [beta] Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989204 (https://phabricator.wikimedia.org/T353798) [12:22:50] !log setting mw[2267,2282,2357,2395] inactive in advance of reimaging [12:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T354336)', diff saved to https://phabricator.wikimedia.org/P54806 and previous config saved to /var/cache/conftool/dbconfig/20240117-122305-marostegui.json [12:23:16] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [12:33:06] (03CR) 10Hnowlan: [C: 03+2] kubernetes: make 4 codfw jobrunner hosts k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/990723 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:38:14] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2267.codfw.wmnet with OS bullseye [12:38:27] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2267.codfw.wmnet with OS bullseye [12:46:29] (03CR) 10Majavah: [C: 03+2] hieradata: drop cloud-support1-c-eqiad from LVS [puppet] - 10https://gerrit.wikimedia.org/r/990962 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah) [12:46:47] PROBLEM - Swift https backend on moss-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:47:40] !log removing vlan1119 interface on lvs1020 T355115 [12:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:44] T355115: Remove cloud-support1-c-eqiad VLAN - https://phabricator.wikimedia.org/T355115 [12:48:07] RECOVERY - Swift https backend on moss-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Swift [12:52:10] (03CR) 10Marostegui: [C: 03+1] mariadb: Remove unused variable [puppet] - 10https://gerrit.wikimedia.org/r/989808 (owner: 10Ladsgroup) [12:53:16] (03PS2) 10Ladsgroup: mariadb: Remove unused variable [puppet] - 10https://gerrit.wikimedia.org/r/989808 [12:53:21] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Remove unused variable [puppet] - 10https://gerrit.wikimedia.org/r/989808 (owner: 10Ladsgroup) [12:53:25] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Remove unused variable [puppet] - 10https://gerrit.wikimedia.org/r/989808 (owner: 10Ladsgroup) [12:56:15] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2267.codfw.wmnet with reason: host reimage [12:58:18] !log removing vlan1119 interface on lvs1018 T355115 [12:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:12] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2267.codfw.wmnet with reason: host reimage [13:01:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2113.codfw.wmnet with reason: Maintenance [13:01:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2113.codfw.wmnet with reason: Maintenance [13:03:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:04:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:04:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [13:04:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [13:04:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1146:3312 (T354336)', diff saved to https://phabricator.wikimedia.org/P54809 and previous config saved to /var/cache/conftool/dbconfig/20240117-130422-marostegui.json [13:04:27] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [13:05:40] (03PS1) 10Ladsgroup: Drop unused nagios sql pass [labs/private] - 10https://gerrit.wikimedia.org/r/991323 [13:06:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T354336)', diff saved to https://phabricator.wikimedia.org/P54810 and previous config saved to /var/cache/conftool/dbconfig/20240117-130639-marostegui.json [13:08:51] (03CR) 10Marostegui: [C: 03+1] Drop unused nagios sql pass [labs/private] - 10https://gerrit.wikimedia.org/r/991323 (owner: 10Ladsgroup) [13:09:44] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Drop unused nagios sql pass [labs/private] - 10https://gerrit.wikimedia.org/r/991323 (owner: 10Ladsgroup) [13:11:06] (03CR) 10Filippo Giunchedi: [C: 03+1] aptrepo: Drop one more deb822 check [puppet] - 10https://gerrit.wikimedia.org/r/991316 (owner: 10Muehlenhoff) [13:12:01] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Drop one more deb822 check [puppet] - 10https://gerrit.wikimedia.org/r/991316 (owner: 10Muehlenhoff) [13:12:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10MoritzMuehlenhoff) >>! In T354049#9459780, @ArthurTaylor wrote: > @JMeybohm I am able to login to https://wikitech.wikimedia.org/ with "Arthur taylor" Ack, ok!... [13:13:40] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2048:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2048 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:14:23] (03PS3) 10Muehlenhoff: dragonfly::dfdaemon: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/951079 [13:15:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10ArthurTaylor) Hi @MoritzMuehlenhoff , @Lucas_Werkmeister_WMDE is in a similar role, except that he has deployment access, which I don't ye... [13:18:41] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, also please let me know what you think of the idea of not alerting at all, which would be even easier" [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:19:13] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2267.codfw.wmnet with OS bullseye [13:19:25] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2267.codfw.wmnet with OS bullseye completed: - mw2267 (**PASS**) - Downt... [13:21:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P54811 and previous config saved to /var/cache/conftool/dbconfig/20240117-132145-marostegui.json [13:22:17] 10SRE-OnFire, 10Znuny, 10collaboration-services: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10fgiunchedi) >>! In T354479#9464988, @LSobanski wrote: > @fgiunchedi what do you think, is that something that could be introduced? Definitely yes, I'm +1 on the general... [13:23:12] (03CR) 10Ayounsi: [C: 03+1] network: remove cloud-support1-c-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/991006 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah) [13:23:22] (03CR) 10Lucas Werkmeister (WMDE): Remove unused $wgExtraLanguageNames['qqq'] assignment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [13:23:27] (03CR) 10Slyngshede: [V: 03+1] P:puppet::client_bucket Start moving monitoring to Prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:23:35] if anyone’s in a reviewing mood, I’d love to have a +1 from someone on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/628773 [13:23:35] (03CR) 10Majavah: [C: 03+2] network: remove cloud-support1-c-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/991006 (https://phabricator.wikimedia.org/T355115) (owner: 10Majavah) [13:23:48] it’s an older config cleanup (should be a no-op) that I never carried over the finish line [13:24:00] but it just came up again in another context, so I added it to the backport window today :) [13:26:43] (03CR) 10Dzahn: phabricator: use same db server regardless of DC of phab server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989537 (owner: 10Dzahn) [13:28:16] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Jhancock.wm) I will check on this this morning. thank you for depooling [13:30:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host snapshot1014.eqiad.wmnet [13:30:10] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:30:37] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:31:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.codfw.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:32:14] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:32:37] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:33:47] (03CR) 10DCausse: [C: 03+1] Remove unused $wgExtraLanguageNames['qqq'] assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [13:34:46] RECOVERY - Check systemd state on snapshot1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P54812 and previous config saved to /var/cache/conftool/dbconfig/20240117-133456-ladsgroup.json [13:34:58] thanks dcausse :) [13:35:00] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:35:08] yw! :) [13:36:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1014.eqiad.wmnet [13:36:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.codfw.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:36:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P54813 and previous config saved to /var/cache/conftool/dbconfig/20240117-133652-marostegui.json [13:40:26] (03PS1) 10Ayounsi: Spicerack: Add support for routed Ganeti [software/spicerack] - 10https://gerrit.wikimedia.org/r/991325 (https://phabricator.wikimedia.org/T300152) [13:45:10] (03PS1) 10Majavah: P:openstack: nova::compute: increase max conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/991346 (https://phabricator.wikimedia.org/T355222) [13:46:30] (03PS1) 10Muehlenhoff: mediawiki::cgroup: Enanble v1 cgroups on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/991347 (https://phabricator.wikimedia.org/T325228) [13:47:38] (03CR) 10CI reject: [V: 04-1] mediawiki::cgroup: Enanble v1 cgroups on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/991347 (https://phabricator.wikimedia.org/T325228) (owner: 10Muehlenhoff) [13:48:36] (03CR) 10Volans: "Nice! Almost ready" [software/spicerack] - 10https://gerrit.wikimedia.org/r/991325 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:49:22] (03PS1) 10Ayounsi: sre.ganeti: add support for routed Ganeti [cookbooks] - 10https://gerrit.wikimedia.org/r/991348 (https://phabricator.wikimedia.org/T300152) [13:50:00] WMDE-Fisch: is it okay if I deploy the beta config change already? [13:50:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P54814 and previous config saved to /var/cache/conftool/dbconfig/20240117-135002-ladsgroup.json [13:50:04] since there are a lot of changes in the window ^^ [13:50:06] (03PS1) 10Arnaudb: orchestrator: skip validation to help puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991082 (https://phabricator.wikimedia.org/T355157) [13:50:23] (“and whose fault is that” me. it’s my fault) [13:51:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T354336)', diff saved to https://phabricator.wikimedia.org/P54815 and previous config saved to /var/cache/conftool/dbconfig/20240117-135158-marostegui.json [13:52:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:52:03] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [13:52:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:52:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:52:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:52:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T354336)', diff saved to https://phabricator.wikimedia.org/P54816 and previous config saved to /var/cache/conftool/dbconfig/20240117-135242-marostegui.json [13:53:58] (03CR) 10CI reject: [V: 04-1] sre.ganeti: add support for routed Ganeti [cookbooks] - 10https://gerrit.wikimedia.org/r/991348 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:54:03] (03PS2) 10Muehlenhoff: mediawiki::cgroup: Enanble v1 cgroups on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/991347 (https://phabricator.wikimedia.org/T325228) [13:54:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T354336)', diff saved to https://phabricator.wikimedia.org/P54817 and previous config saved to /var/cache/conftool/dbconfig/20240117-135459-marostegui.json [13:55:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Starting gate-and-submit ahead of backport window." [extensions/Wikibase] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991059 (https://phabricator.wikimedia.org/T355053) (owner: 10Lucas Werkmeister (WMDE)) [13:55:47] (03PS2) 10Ayounsi: sre.ganeti: add support for routed Ganeti [cookbooks] - 10https://gerrit.wikimedia.org/r/991348 (https://phabricator.wikimedia.org/T300152) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T1400). Please do the needful. [14:00:05] Lucas_WMDE and WMDE-Fisch: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:21] o/ [14:00:41] WMDE-Fisch: around? :) [14:01:21] I’ll start with the should-be-noop config change then [14:01:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [14:01:55] \o [14:02:22] (03Merged) 10jenkins-bot: Remove unused $wgExtraLanguageNames['qqq'] assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [14:03:00] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:628773|Remove unused $wgExtraLanguageNames['qqq'] assignment (T263441)]] [14:03:05] T263441: Clean up $wgExtraLanguageNames production config - https://phabricator.wikimedia.org/T263441 [14:03:25] WMDE-Fisch: hi! I’ll do your config change next then [14:03:34] Lucas_WMDE: thx! [14:04:05] (03PS1) 10Daimona Eaytoy: beta: Stop setting $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991351 (https://phabricator.wikimedia.org/T347608) [14:05:05] (03PS1) 10Daimona Eaytoy: prod: Stop setting $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991352 (https://phabricator.wikimedia.org/T347608) [14:05:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P54818 and previous config saved to /var/cache/conftool/dbconfig/20240117-140509-ladsgroup.json [14:05:19] (03PS1) 10Slyngshede: D:service::docker Run Docker prune on pull. [puppet] - 10https://gerrit.wikimedia.org/r/991353 (https://phabricator.wikimedia.org/T321851) [14:05:37] (03CR) 10Marostegui: [C: 03+1] orchestrator: skip validation to help puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991082 (https://phabricator.wikimedia.org/T355157) (owner: 10Arnaudb) [14:06:26] (03CR) 10CI reject: [V: 04-1] D:service::docker Run Docker prune on pull. [puppet] - 10https://gerrit.wikimedia.org/r/991353 (https://phabricator.wikimedia.org/T321851) (owner: 10Slyngshede) [14:07:02] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:24] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:628773|Remove unused $wgExtraLanguageNames['qqq'] assignment (T263441)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:07:39] * Lucas_WMDE tests [14:07:49] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/991353 (https://phabricator.wikimedia.org/T321851) (owner: 10Slyngshede) [14:07:52] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:09:57] (03PS2) 10Slyngshede: D:service::docker Run Docker prune on pull. [puppet] - 10https://gerrit.wikimedia.org/r/991353 (https://phabricator.wikimedia.org/T321851) [14:10:00] (03CR) 10Lucas Werkmeister (WMDE): "FTR, when I tested this during the deployment, I didn’t notice any change in https://www.wikidata.org/wiki/Special:Translate with regard t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [14:10:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P54819 and previous config saved to /var/cache/conftool/dbconfig/20240117-141005-marostegui.json [14:10:47] (03PS1) 10Brouberol: spark-history: use an image using JDK8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991354 (https://phabricator.wikimedia.org/T354777) [14:14:04] (03PS5) 10Lucas Werkmeister (WMDE): [beta] Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989204 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [14:14:07] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:628773|Remove unused $wgExtraLanguageNames['qqq'] assignment (T263441)]] (duration: 11m 07s) [14:14:11] T263441: Clean up $wgExtraLanguageNames production config - https://phabricator.wikimedia.org/T263441 [14:14:15] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] [beta] Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989204 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [14:14:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989204 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [14:14:36] (03PS2) 10Ayounsi: Spicerack: Add support for routed Ganeti [software/spicerack] - 10https://gerrit.wikimedia.org/r/991325 (https://phabricator.wikimedia.org/T300152) [14:15:05] (03Merged) 10jenkins-bot: [beta] Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989204 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [14:15:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991059 (https://phabricator.wikimedia.org/T355053) (owner: 10Lucas Werkmeister (WMDE)) [14:15:50] WMDE-Fisch: should be deployed with the next beta config sync [14:15:56] (03Merged) 10jenkins-bot: Only build result entries for used wbsearchentities results [extensions/Wikibase] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991059 (https://phabricator.wikimedia.org/T355053) (owner: 10Lucas Werkmeister (WMDE)) [14:16:01] (03PS1) 10Cathal Mooney: Use vlan name to determine if server BGP peering should be added [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/991356 (https://phabricator.wikimedia.org/T355225) [14:16:04] Lucas_WMDE: Perfect. Thanks! [14:16:07] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-wf2001.codfw.wmnet [14:16:08] (03CR) 10Ayounsi: "Thx !" [software/spicerack] - 10https://gerrit.wikimedia.org/r/991325 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [14:16:16] so probably in 10 minutes or so, judging by https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/ [14:16:20] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:991059|Only build result entries for used wbsearchentities results (T355053)]] [14:16:22] (03CR) 10Arnaudb: [C: 03+2] orchestrator: skip validation to help puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991082 (https://phabricator.wikimedia.org/T355157) (owner: 10Arnaudb) [14:16:23] (there’s a job running that may or may not have picked up the update already) [14:16:24] T355053: Only create needed search result entries in wbsearchentities - https://phabricator.wikimedia.org/T355053 [14:16:30] (03PS2) 10Brouberol: spark-history: use an image using JDK8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991354 (https://phabricator.wikimedia.org/T354777) [14:17:06] okay, according to https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/479345/console it doesn’t have the update [14:17:44] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:991059|Only build result entries for used wbsearchentities results (T355053)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:17:48] testing [14:18:58] (03PS2) 10Ladsgroup: WMCS: add views for block and block_target tables [puppet] - 10https://gerrit.wikimedia.org/r/991105 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [14:19:01] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] WMCS: add views for block and block_target tables [puppet] - 10https://gerrit.wikimedia.org/r/991105 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [14:19:48] Lucas_WMDE: No stress, there are other things that I need to really test this. I just wanted to move forward here. :-) [14:19:58] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:20:00] ok :) [14:20:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P54820 and previous config saved to /var/cache/conftool/dbconfig/20240117-142015-ladsgroup.json [14:20:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [14:20:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:20:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [14:21:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Starting gate-and-submit already to save some time." [extensions/Wikibase] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991062 (https://phabricator.wikimedia.org/T354881) (owner: 10Lucas Werkmeister (WMDE)) [14:21:39] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Starting gate-and-submit already to save some time." [extensions/Wikibase] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991060 (https://phabricator.wikimedia.org/T355053) (owner: 10Lucas Werkmeister (WMDE)) [14:22:03] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2001.codfw.wmnet [14:22:06] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-wf2002.codfw.wmnet [14:22:50] (03PS1) 10Effie Mouzeli: cache/mcrouter: upgrade to 1.3.0 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991357 [14:23:41] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [14:23:49] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [14:25:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P54821 and previous config saved to /var/cache/conftool/dbconfig/20240117-142511-marostegui.json [14:25:44] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:991059|Only build result entries for used wbsearchentities results (T355053)]] (duration: 09m 23s) [14:25:48] T355053: Only create needed search result entries in wbsearchentities - https://phabricator.wikimedia.org/T355053 [14:26:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991062 (https://phabricator.wikimedia.org/T354881) (owner: 10Lucas Werkmeister (WMDE)) [14:26:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991060 (https://phabricator.wikimedia.org/T355053) (owner: 10Lucas Werkmeister (WMDE)) [14:26:25] (03PS1) 10Hubaishan: Set ShowRollbackConfirmation in arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991358 (https://phabricator.wikimedia.org/T355213) [14:26:31] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2002.codfw.wmnet [14:26:40] (03CR) 10Marostegui: "Yeah, that should work. However, you should test with the mysql client that the user/pass and the proxy works, just in case you find somet" [puppet] - 10https://gerrit.wikimedia.org/r/989537 (owner: 10Dzahn) [14:28:49] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10ABran-WMF) @BTullis I've merged this patch and stumbled upon: ` Jan 17 14:24:55 dborch1001 orchestrator[981119]: ReadTopologyInstance(dbstore1008.eqiad.wmnet:3311) show... [14:29:43] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10ABran-WMF) p:05High→03Medium [14:39:18] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:49] (03PS2) 10Lucas Werkmeister (WMDE): Exclude qqq from monolingual text languages [extensions/Wikibase] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991061 (https://phabricator.wikimedia.org/T341409) [14:40:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "+2ing ahead of backport to save some time" [extensions/Wikibase] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991061 (https://phabricator.wikimedia.org/T341409) (owner: 10Lucas Werkmeister (WMDE)) [14:40:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T354336)', diff saved to https://phabricator.wikimedia.org/P54822 and previous config saved to /var/cache/conftool/dbconfig/20240117-144018-marostegui.json [14:40:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:40:23] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [14:40:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:40:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170:3312 (T354336)', diff saved to https://phabricator.wikimedia.org/P54823 and previous config saved to /var/cache/conftool/dbconfig/20240117-144039-marostegui.json [14:41:26] (03PS2) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991357 [14:41:48] (PuppetZeroResources) firing: Puppet has failed generate resources on mw2357:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:41:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T354336)', diff saved to https://phabricator.wikimedia.org/P54824 and previous config saved to /var/cache/conftool/dbconfig/20240117-144156-marostegui.json [14:42:42] (03Merged) 10jenkins-bot: Skip tainted references test:distnodiff script to fix Wikibase CI [extensions/Wikibase] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991062 (https://phabricator.wikimedia.org/T354881) (owner: 10Lucas Werkmeister (WMDE)) [14:42:46] (03Merged) 10jenkins-bot: Only build result entries for used wbsearchentities results [extensions/Wikibase] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991060 (https://phabricator.wikimedia.org/T355053) (owner: 10Lucas Werkmeister (WMDE)) [14:43:11] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:991062|Skip tainted references test:distnodiff script to fix Wikibase CI (T354881)]], [[gerrit:991060|Only build result entries for used wbsearchentities results (T355053)]] [14:43:19] T354881: Wikibase CI broken due to Tainted Reference Node 18 indirect crypto dependency incompatibility - https://phabricator.wikimedia.org/T354881 [14:43:20] T355053: Only create needed search result entries in wbsearchentities - https://phabricator.wikimedia.org/T355053 [14:44:39] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:991062|Skip tainted references test:distnodiff script to fix Wikibase CI (T354881)]], [[gerrit:991060|Only build result entries for used wbsearchentities results (T355053)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:45:37] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:48:40] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2048:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2048 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:49:05] !log restarted rsyslog on kubernetes2048 [14:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:02] (03PS1) 10Filippo Giunchedi: monitoring: adjust default for cluster and group [puppet] - 10https://gerrit.wikimedia.org/r/991360 (https://phabricator.wikimedia.org/T333615) [14:51:04] (03PS1) 10Filippo Giunchedi: puppetserver: move ::generators from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/991361 (https://phabricator.wikimedia.org/T333615) [14:51:40] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:991062|Skip tainted references test:distnodiff script to fix Wikibase CI (T354881)]], [[gerrit:991060|Only build result entries for used wbsearchentities results (T355053)]] (duration: 08m 28s) [14:51:45] T354881: Wikibase CI broken due to Tainted Reference Node 18 indirect crypto dependency incompatibility - https://phabricator.wikimedia.org/T354881 [14:51:45] T355053: Only create needed search result entries in wbsearchentities - https://phabricator.wikimedia.org/T355053 [14:51:47] alright, one more backport [14:51:54] might slightly overrun the window, hopefully that’s okay [14:52:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991061 (https://phabricator.wikimedia.org/T341409) (owner: 10Lucas Werkmeister (WMDE)) [14:52:15] (zuul says ETA 8 min)) [14:52:52] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: cache::text [14:54:18] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:04] (03CR) 10CI reject: [V: 04-1] monitoring: adjust default for cluster and group [puppet] - 10https://gerrit.wikimedia.org/r/991360 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [14:55:07] (03PS1) 10Muehlenhoff: Switch cache/text to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991362 (https://phabricator.wikimedia.org/T349619) [14:55:23] (03CR) 10CI reject: [V: 04-1] puppetserver: move ::generators from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/991361 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [14:56:32] (03PS1) 10Filippo Giunchedi: pontoon: include profile::monitoring in base [puppet] - 10https://gerrit.wikimedia.org/r/991363 (https://phabricator.wikimedia.org/T333615) [14:56:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw2357:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:56:57] (03CR) 10Muehlenhoff: [C: 03+2] Switch cache/text to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991362 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:57:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P54826 and previous config saved to /var/cache/conftool/dbconfig/20240117-145702-marostegui.json [14:59:24] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet [14:59:46] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet [15:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T1500) [15:00:13] (03PS6) 10Filippo Giunchedi: P:puppet::client_bucket Start moving monitoring to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:00:15] (03PS2) 10Filippo Giunchedi: monitoring: adjust default for cluster and group [puppet] - 10https://gerrit.wikimedia.org/r/991360 (https://phabricator.wikimedia.org/T333615) [15:00:17] (03PS2) 10Filippo Giunchedi: puppetserver: move ::generators from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/991361 (https://phabricator.wikimedia.org/T333615) [15:00:17] re jouncebot: I’m still deploying, sorry [15:00:19] (03PS2) 10Filippo Giunchedi: pontoon: include profile::monitoring in base [puppet] - 10https://gerrit.wikimedia.org/r/991363 (https://phabricator.wikimedia.org/T333615) [15:00:21] (03PS1) 10Filippo Giunchedi: icinga: remove ldap-icinga renmants [puppet] - 10https://gerrit.wikimedia.org/r/991364 (https://phabricator.wikimedia.org/T333615) [15:00:23] (03PS1) 10Filippo Giunchedi: klaxon: bookworm/gunicorn compat [puppet] - 10https://gerrit.wikimedia.org/r/991365 (https://phabricator.wikimedia.org/T333615) [15:00:56] (03Merged) 10jenkins-bot: Exclude qqq from monolingual text languages [extensions/Wikibase] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991061 (https://phabricator.wikimedia.org/T341409) (owner: 10Lucas Werkmeister (WMDE)) [15:01:23] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:991061|Exclude qqq from monolingual text languages (T341409)]] [15:01:25] (03CR) 10Muehlenhoff: klaxon: bookworm/gunicorn compat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991365 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:01:27] T341409: [TECH] Use LanguageNameUtils::ALL for monolingual text and lexemes - https://phabricator.wikimedia.org/T341409 [15:02:56] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:991061|Exclude qqq from monolingual text languages (T341409)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:03:13] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [15:03:16] (03CR) 10Brouberol: [C: 03+2] spark-history: use an image using JDK8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991354 (https://phabricator.wikimedia.org/T354777) (owner: 10Brouberol) [15:03:40] (03CR) 10Andrew Bogott: [C: 03+1] P:openstack: nova::compute: increase max conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/991346 (https://phabricator.wikimedia.org/T355222) (owner: 10Majavah) [15:03:43] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10BTullis) >>! In T355157#9465806, @ABran-WMF wrote: > @BTullis I've merged this patch and stumbled upon: > > ` > Jan 17 14:24:55 dborch1001 orchestrator[981119]: ReadTop... [15:04:10] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10Marostegui) All sections should have orchestrator grants [15:04:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [15:05:08] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2044.codfw.wmnet [15:05:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [15:05:17] (03CR) 10CI reject: [V: 04-1] monitoring: adjust default for cluster and group [puppet] - 10https://gerrit.wikimedia.org/r/991360 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:05:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [15:05:24] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet [15:05:32] (03CR) 10CI reject: [V: 04-1] puppetserver: move ::generators from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/991361 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:05:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [15:07:04] (now in php-fpm-restart) [15:08:06] (03CR) 10Majavah: [C: 03+2] P:openstack: nova::compute: increase max conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/991346 (https://phabricator.wikimedia.org/T355222) (owner: 10Majavah) [15:09:23] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:991061|Exclude qqq from monolingual text languages (T341409)]] (duration: 07m 59s) [15:09:27] T341409: [TECH] Use LanguageNameUtils::ALL for monolingual text and lexemes - https://phabricator.wikimedia.org/T341409 [15:10:16] * Lucas_WMDE done [15:10:20] !log UTC afternoon backport+config window done [15:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:26] sorry for the delay [15:12:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P54827 and previous config saved to /var/cache/conftool/dbconfig/20240117-151208-marostegui.json [15:13:22] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [15:13:30] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Downtimed on Ici... [15:13:51] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10BTullis) >>! In T355157#9465960, @Marostegui wrote: > All sections should have orchestrator grants Apologies for being vague. Yes, all sections on both of these hosts a... [15:15:31] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [15:15:38] (03CR) 10FNegri: [C: 03+1] "Sounds reasonable, I left a comment in the task." [puppet] - 10https://gerrit.wikimedia.org/r/991346 (https://phabricator.wikimedia.org/T355222) (owner: 10Majavah) [15:16:36] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 200242096 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:16:57] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [15:20:28] (03PS1) 10Alexandros Kosiaris: base.meta: Remove dependency on the mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991368 [15:20:30] (03PS1) 10Alexandros Kosiaris: base.meta: Remove dependency on the mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 [15:20:33] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10Marostegui) That should be fine, I can see both hosts in orchestrator fine. Not sure if something was done. Also please add root grants from cumin1002 cc @ABran-WMF [15:20:52] PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The following units failed: wmf-pt-kill@s4.service,wmf-pt-kill@s6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:07] (03PS2) 10Alexandros Kosiaris: base.meta: Remove dependency on the mesh module ("copy" change) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991368 [15:21:08] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:21:09] (03PS2) 10Alexandros Kosiaris: base.meta: Remove dependency on the mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 [15:21:56] (03CR) 10CI reject: [V: 04-1] base.meta: Remove dependency on the mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 (owner: 10Alexandros Kosiaris) [15:22:07] (03CR) 10Alexandros Kosiaris: "Modules using this one are:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 (owner: 10Alexandros Kosiaris) [15:22:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: cache::text [15:23:23] (03PS3) 10Alexandros Kosiaris: base.meta: Remove dependency on the mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 [15:23:26] (HelmReleaseBadStatus) firing: Helm release miscweb/wikiworkshop on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:23:32] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1045.eqiad.wmnet [15:23:52] RECOVERY - Check systemd state on clouddb1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:58] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1045 [puppet] - 10https://gerrit.wikimedia.org/r/990992 (owner: 10Effie Mouzeli) [15:24:15] (03CR) 10CI reject: [V: 04-1] base.meta: Remove dependency on the mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 (owner: 10Alexandros Kosiaris) [15:24:21] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc2045 [puppet] - 10https://gerrit.wikimedia.org/r/990993 (owner: 10Effie Mouzeli) [15:25:40] (ProbeDown) firing: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:26:07] etherpad is indeed down, looking [15:26:13] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:27:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T354336)', diff saved to https://phabricator.wikimedia.org/P54830 and previous config saved to /var/cache/conftool/dbconfig/20240117-152715-marostegui.json [15:27:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:27:20] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [15:27:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:27:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T354336)', diff saved to https://phabricator.wikimedia.org/P54831 and previous config saved to /var/cache/conftool/dbconfig/20240117-152737-marostegui.json [15:27:53] !log restart etherpad-lite.service on etherpad1003 [15:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1045.eqiad.wmnet [15:29:18] (JobUnavailable) firing: (2) Reduced availability for job etherpad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T354336)', diff saved to https://phabricator.wikimedia.org/P54832 and previous config saved to /var/cache/conftool/dbconfig/20240117-152953-marostegui.json [15:30:00] that seems to have fixed it [15:30:40] (ProbeDown) resolved: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:49] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2045.codfw.wmnet [15:31:12] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2045 [puppet] - 10https://gerrit.wikimedia.org/r/990993 (owner: 10Effie Mouzeli) [15:32:16] thanks taavi! [15:35:13] (03CR) 10Jgiannelos: mobileapps: add Cassandra config support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [15:35:42] (03PS1) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) [15:36:02] (03CR) 10Volans: [C: 03+1] "LGTM! Thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/991325 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [15:36:39] (03CR) 10CI reject: [V: 04-1] cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli) [15:38:31] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:38:37] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:39:24] (03Abandoned) 10Filippo Giunchedi: monitoring: adjust default for cluster and group [puppet] - 10https://gerrit.wikimedia.org/r/991360 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:39:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2045.codfw.wmnet [15:39:50] (03PS3) 10Filippo Giunchedi: puppetserver: move ::generators from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/991361 (https://phabricator.wikimedia.org/T333615) [15:39:54] (03PS3) 10Filippo Giunchedi: pontoon: include profile::monitoring in base [puppet] - 10https://gerrit.wikimedia.org/r/991363 (https://phabricator.wikimedia.org/T333615) [15:39:56] (03PS2) 10Filippo Giunchedi: icinga: remove ldap-icinga renmants [puppet] - 10https://gerrit.wikimedia.org/r/991364 (https://phabricator.wikimedia.org/T333615) [15:39:58] (03PS2) 10Filippo Giunchedi: klaxon: bookworm/gunicorn compat [puppet] - 10https://gerrit.wikimedia.org/r/991365 (https://phabricator.wikimedia.org/T333615) [15:41:00] (03CR) 10Filippo Giunchedi: klaxon: bookworm/gunicorn compat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991365 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:42:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/991365 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:43:13] (03CR) 10Filippo Giunchedi: [C: 03+2] klaxon: bookworm/gunicorn compat [puppet] - 10https://gerrit.wikimedia.org/r/991365 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:45:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P54833 and previous config saved to /var/cache/conftool/dbconfig/20240117-154459-marostegui.json [15:45:01] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:45:24] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:45:26] (03CR) 10Btullis: [C: 03+1] "This looks good to me. I'd be happy for the change to be deployed any time convenient to the traffic team." [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [15:45:48] (03PS3) 10Filippo Giunchedi: klaxon: bookworm/gunicorn compat [puppet] - 10https://gerrit.wikimedia.org/r/991365 (https://phabricator.wikimedia.org/T333615) [15:45:51] (03CR) 10Filippo Giunchedi: [V: 03+2] klaxon: bookworm/gunicorn compat [puppet] - 10https://gerrit.wikimedia.org/r/991365 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:46:42] (03PS4) 10Alexandros Kosiaris: base.meta: Remove dependency on the mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 [15:48:11] (HelmReleaseBadStatus) resolved: Helm release miscweb/wikiworkshop on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:49:02] (03PS4) 10Filippo Giunchedi: puppetserver: move ::generators from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/991361 (https://phabricator.wikimedia.org/T333615) [15:49:04] (03PS4) 10Filippo Giunchedi: pontoon: include profile::monitoring in base [puppet] - 10https://gerrit.wikimedia.org/r/991363 (https://phabricator.wikimedia.org/T333615) [15:49:06] (03PS3) 10Filippo Giunchedi: icinga: remove ldap-icinga remnants [puppet] - 10https://gerrit.wikimedia.org/r/991364 (https://phabricator.wikimedia.org/T333615) [15:49:27] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [15:49:34] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [15:49:52] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [15:50:01] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [15:50:09] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1143/co" [puppet] - 10https://gerrit.wikimedia.org/r/991364 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:51:11] (03CR) 10Volans: [C: 03+1] "To be tested but looks good. It depends on a spicerack release with the routed support." [cookbooks] - 10https://gerrit.wikimedia.org/r/991348 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [15:53:49] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1144/co" [puppet] - 10https://gerrit.wikimedia.org/r/991364 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:54:05] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-tool1005.eqiad.wmnet with reason: Testing new version of Superset [15:54:19] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 7 days, 0:00:00 on an-tool1005.eqiad.wmnet with reason: Testing new version of Superset [15:54:45] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-tool1005.eqiad.wmnet with reason: Testing new version of Superset [15:54:48] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-tool1005.eqiad.wmnet with reason: Testing new version of Superset [15:56:29] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, other than that LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [15:58:14] (03PS3) 10Ayounsi: sre.ganeti: add support for routed Ganeti [cookbooks] - 10https://gerrit.wikimedia.org/r/991348 (https://phabricator.wikimedia.org/T300152) [15:58:26] (03PS10) 10Arnaudb: mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) [15:59:24] (03CR) 10CI reject: [V: 04-1] mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [15:59:54] (03PS2) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) [16:00:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P54834 and previous config saved to /var/cache/conftool/dbconfig/20240117-160005-marostegui.json [16:00:48] (03CR) 10CI reject: [V: 04-1] cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli) [16:02:25] (03CR) 10Jgiannelos: "With `cassandra_client.enabled=True` helm lint locally displays the expected configmap (also no failures). For the rest I will defer to Jo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [16:04:59] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Jhancock.wm) I moved the DIMM to a different slot and the error moved with it. I've put in a dispatch with Dell. SR183504113 [16:05:47] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.6 - https://phabricator.wikimedia.org/T316421 (10Jelto) p:05Medium→03High I'll raise the priority to high as we see more frequent alerts with the old version of etherpad. Especially with version `1.9.0` some rac... [16:06:54] (03PS1) 10Hnowlan: changeprop-jobqueue: disable ThumbnailRender on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/991377 (https://phabricator.wikimedia.org/T349796) [16:08:09] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) [16:08:43] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Clement_Goubert) [16:08:59] (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: disable ThumbnailRender on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/991377 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:09:11] 10SRE, 10MW-on-K8s, 10serviceops: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [16:09:16] (03PS2) 10Hnowlan: changeprop-jobqueue: disable ThumbnailRender on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/991377 (https://phabricator.wikimedia.org/T349796) [16:09:20] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) Reverting to bare metal in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/991377 [16:10:15] (03PS1) 10Hubaishan: Restrict pagequality-validate right to patroller in arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991379 (https://phabricator.wikimedia.org/T354503) [16:11:10] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: disable ThumbnailRender on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/991377 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:11:16] (03PS3) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) [16:12:08] (03Merged) 10jenkins-bot: changeprop-jobqueue: disable ThumbnailRender on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/991377 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:12:10] (03CR) 10CI reject: [V: 04-1] cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli) [16:13:19] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:13:50] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:13:58] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:14:24] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:15:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T354336)', diff saved to https://phabricator.wikimedia.org/P54835 and previous config saved to /var/cache/conftool/dbconfig/20240117-161512-marostegui.json [16:15:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:15:19] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [16:15:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:15:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T354336)', diff saved to https://phabricator.wikimedia.org/P54836 and previous config saved to /var/cache/conftool/dbconfig/20240117-161534-marostegui.json [16:17:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T354336)', diff saved to https://phabricator.wikimedia.org/P54837 and previous config saved to /var/cache/conftool/dbconfig/20240117-161746-marostegui.json [16:17:49] (03PS4) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) [16:19:30] (03CR) 10Kamila Součková: [C: 03+2] mobileapps: switch service discovery to k8s only [deployment-charts] - 10https://gerrit.wikimedia.org/r/991043 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [16:20:26] (03Merged) 10jenkins-bot: mobileapps: switch service discovery to k8s only [deployment-charts] - 10https://gerrit.wikimedia.org/r/991043 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [16:22:31] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:23:25] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [16:23:32] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [16:23:48] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [16:23:56] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [16:25:21] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:32:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P54838 and previous config saved to /var/cache/conftool/dbconfig/20240117-163252-marostegui.json [16:37:39] (03PS1) 10Andrea Denisse: grafana: Ensure the grafana2001 hosts uses Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991386 (https://phabricator.wikimedia.org/T352665) [16:38:53] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: Ensure the grafana2001 hosts uses Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991386 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [16:39:04] (03CR) 10Andrea Denisse: [C: 03+2] grafana: Ensure the grafana2001 hosts uses Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991386 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [16:39:12] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [16:39:25] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [16:39:29] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [16:39:39] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [16:40:57] !log jforrester@deploy2002 Started deploy [integration/docroot@f08a107]: I74613426e76b9d1a92482d024fcd326463496d88 for T354310 [16:41:01] T354310: Sunset WikimediaUI Base - https://phabricator.wikimedia.org/T354310 [16:41:04] !log jforrester@deploy2002 Finished deploy [integration/docroot@f08a107]: I74613426e76b9d1a92482d024fcd326463496d88 for T354310 (duration: 00m 07s) [16:42:23] !log denisse@cumin2002 START - Cookbook sre.hosts.reimage for host grafana2001.codfw.wmnet with OS bookworm [16:47:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P54839 and previous config saved to /var/cache/conftool/dbconfig/20240117-164759-marostegui.json [16:48:00] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [16:48:08] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [16:48:46] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2357.codfw.wmnet with OS bullseye [16:48:55] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2357.codfw.wmnet with OS bullseye [16:52:43] (03PS1) 10Jdlrobson: Update checkboxHack target node [skins/MinervaNeue] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991339 (https://phabricator.wikimedia.org/T354315) [16:57:40] !log denisse@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on grafana2001.codfw.wmnet with reason: host reimage [17:00:07] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2395.codfw.wmnet with OS bullseye [17:00:16] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2395.codfw.wmnet with OS bullseye [17:00:31] (03PS4) 10Ayounsi: sre.ganeti: add support for routed Ganeti [cookbooks] - 10https://gerrit.wikimedia.org/r/991348 (https://phabricator.wikimedia.org/T300152) [17:01:03] (03CR) 10Ayounsi: sre.ganeti: add support for routed Ganeti (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/991348 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [17:02:12] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on grafana2001.codfw.wmnet with reason: host reimage [17:03:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T354336)', diff saved to https://phabricator.wikimedia.org/P54840 and previous config saved to /var/cache/conftool/dbconfig/20240117-170305-marostegui.json [17:03:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:03:09] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [17:03:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:03:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T354336)', diff saved to https://phabricator.wikimedia.org/P54841 and previous config saved to /var/cache/conftool/dbconfig/20240117-170327-marostegui.json [17:05:02] (03CR) 10CI reject: [V: 04-1] sre.ganeti: add support for routed Ganeti [cookbooks] - 10https://gerrit.wikimedia.org/r/991348 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [17:05:33] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2357.codfw.wmnet with reason: host reimage [17:05:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T354336)', diff saved to https://phabricator.wikimedia.org/P54842 and previous config saved to /var/cache/conftool/dbconfig/20240117-170539-marostegui.json [17:08:15] (03PS25) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [17:08:46] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2357.codfw.wmnet with reason: host reimage [17:11:00] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:13:07] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:16:03] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2395.codfw.wmnet with reason: host reimage [17:16:23] (03PS1) 10Andrea Denisse: grafana: Ensure the grafana1002 host uses Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) [17:18:30] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [17:19:07] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Jdforrester-WMF) [17:19:13] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Jdforrester-WMF) p:05High→03Unbreak! [17:19:19] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host grafana2001.codfw.wmnet with OS bookworm [17:19:20] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2395.codfw.wmnet with reason: host reimage [17:19:46] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:20:42] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) As we can see on a [[ https://logstash.wikimedia.org/goto/aa282fd4b9efeb635c3767593fb2f58c | wider log view ]] errors coincide with tr... [17:20:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P54843 and previous config saved to /var/cache/conftool/dbconfig/20240117-172045-marostegui.json [17:21:24] (03PS5) 10Ayounsi: sre.ganeti: add support for routed Ganeti [cookbooks] - 10https://gerrit.wikimedia.org/r/991348 (https://phabricator.wikimedia.org/T300152) [17:22:34] (03PS24) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [17:23:11] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:24:10] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1147/co" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [17:25:45] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Clement_Goubert) Thank you @Jhancock.wm [17:25:53] (03PS1) 10Kamila Součková: service catalog: remove mw-api-async-transition [puppet] - 10https://gerrit.wikimedia.org/r/991394 (https://phabricator.wikimedia.org/T350846) [17:29:03] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2357.codfw.wmnet with OS bullseye [17:29:13] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2357.codfw.wmnet with OS bullseye completed: - mw2357 (**PASS**) - Downtimed on Icinga/Alertma... [17:29:57] (03CR) 10Volans: [C: 03+1] "LGTM, depends on a new spicerack release" [cookbooks] - 10https://gerrit.wikimedia.org/r/991348 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [17:30:03] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10kamila) [17:30:13] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10kamila) 05Open→03Resolved All traffic is now going to k8s \o/ I will keep an eye on php workers saturation, but it should be fine, so I'm calling it resolved. [17:30:21] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10KFrancis) Hi all, the NDA has been signed. Thank you! [17:31:27] (03CR) 10Kamila Součková: "Should I clean this up now, or do we want to reuse it for something else?" [puppet] - 10https://gerrit.wikimedia.org/r/991394 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [17:32:00] (03CR) 10Herron: [V: 03+1] grafana: add dashboard datasource usage (graphite) exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [17:35:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P54844 and previous config saved to /var/cache/conftool/dbconfig/20240117-173552-marostegui.json [17:39:06] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2395.codfw.wmnet with OS bullseye [17:39:14] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2395.codfw.wmnet with OS bullseye completed: - mw2395 (**PASS**) - Downtimed on Icinga/Alertma... [17:42:49] (03PS6) 10Htriedman: T354456: update eventstream helm values.yaml file to include hard-coded list of redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 [17:43:32] (03CR) 10Htriedman: "added one extra page for redaction testing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [17:50:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T354336)', diff saved to https://phabricator.wikimedia.org/P54845 and previous config saved to /var/cache/conftool/dbconfig/20240117-175059-marostegui.json [17:51:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1222.eqiad.wmnet with reason: Maintenance [17:51:03] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [17:51:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1222.eqiad.wmnet with reason: Maintenance [17:51:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T354336)', diff saved to https://phabricator.wikimedia.org/P54846 and previous config saved to /var/cache/conftool/dbconfig/20240117-175120-marostegui.json [17:51:55] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:53:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T354336)', diff saved to https://phabricator.wikimedia.org/P54847 and previous config saved to /var/cache/conftool/dbconfig/20240117-175338-marostegui.json [17:54:21] (03PS4) 10Urbanecm: beta: Enable conditional defaults for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987964 (https://phabricator.wikimedia.org/T353225) [17:54:25] jouncebot: nowandnext [17:54:25] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [17:54:25] In 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T1800) [17:54:32] (03CR) 10Urbanecm: [C: 03+2] beta: Enable conditional defaults for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987964 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [17:54:36] 5 minutes should be enough [17:55:21] (03Merged) 10jenkins-bot: beta: Enable conditional defaults for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987964 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [17:55:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987964 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [17:57:39] * urbanecm done [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T1800) [18:08:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P54848 and previous config saved to /var/cache/conftool/dbconfig/20240117-180844-marostegui.json [18:16:59] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) I see the errors still showing up in prod. By looking at Scap's code, I think redeploying the train to group1 should make this change go to pro... [18:23:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P54849 and previous config saved to /var/cache/conftool/dbconfig/20240117-182351-marostegui.json [18:32:36] (03PS1) 10Jeena Huneidi: Disable anything that uses 'extension1' Train-dev doesn't know about this cluster [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991406 [18:34:26] (03CR) 10Jeena Huneidi: [C: 03+2] Disable anything that uses 'extension1' Train-dev doesn't know about this cluster [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991406 (owner: 10Jeena Huneidi) [18:35:26] (03Merged) 10jenkins-bot: Disable anything that uses 'extension1' Train-dev doesn't know about this cluster [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991406 (owner: 10Jeena Huneidi) [18:37:20] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) After digging a bit more, I think a simpler `sync-world` with a few flags will be enough to deploy this Helm config change. I'll try it in a bi... [18:38:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T354336)', diff saved to https://phabricator.wikimedia.org/P54850 and previous config saved to /var/cache/conftool/dbconfig/20240117-183857-marostegui.json [18:39:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:39:13] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [18:39:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:39:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1229.eqiad.wmnet with reason: Maintenance [18:39:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1229.eqiad.wmnet with reason: Maintenance [18:39:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T354336)', diff saved to https://phabricator.wikimedia.org/P54851 and previous config saved to /var/cache/conftool/dbconfig/20240117-183944-marostegui.json [18:41:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T354336)', diff saved to https://phabricator.wikimedia.org/P54852 and previous config saved to /var/cache/conftool/dbconfig/20240117-184156-marostegui.json [18:48:02] going to deploy a K8s config change to prod in the next few mins for https://phabricator.wikimedia.org/T355243 [18:52:54] !log jnuche@deploy2002 Started scap: deploying K8s config changes from T355243 [18:52:58] T355243: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 [18:54:37] !log jnuche@deploy2002 Finished scap: deploying K8s config changes from T355243 (duration: 01m 42s) [18:57:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P54853 and previous config saved to /var/cache/conftool/dbconfig/20240117-185703-marostegui.json [19:00:04] jnuche and jeena: May I have your attention please! Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T1900) [19:00:05] jnuche and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T1900). [19:00:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:01:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:04:43] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) Error rate seems unaffected after deploying the configuration change: https://logstash.wikimedia.org/goto/5f6d40ce5bbf6313e2fcec6ccc28ea51 :( P... [19:10:14] (03PS1) 10WMDE-Fisch: Fix state bleeding from one into the next [extensions/Kartographer] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991343 (https://phabricator.wikimedia.org/T355044) [19:12:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P54854 and previous config saved to /var/cache/conftool/dbconfig/20240117-191209-marostegui.json [19:13:53] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/991007 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [19:14:19] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/991008 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [19:14:35] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) I have a commitment soon and need to stop for the day. I've asked the backup conductor @jeena to follow up on this. [19:14:44] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [19:15:42] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/991363 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [19:16:10] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/991364 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [19:16:18] (03CR) 10Herron: [V: 03+1 C: 03+2] "thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [19:18:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10Dzahn) The `restricted` group gives access to the following hosts: - mwmaint* - maintenance - this is where jobs/timers run actions on wik... [19:18:32] (03PS12) 10Dr0ptp4kt: webrequest varnishkafka - Add to X-Analytics prefetch indicators [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [19:20:33] (03CR) 10Dr0ptp4kt: webrequest varnishkafka - Add to X-Analytics prefetch indicators (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [19:22:16] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10Dzahn) 05Open→03In progress [19:22:58] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jeena) isn't a deployment of changeprop in kubernetes needed here? I don't think scap does this. [19:23:02] (03CR) 10Dzahn: [C: 03+1] "ready to merge, per https://phabricator.wikimedia.org/T354276#9466549" [puppet] - 10https://gerrit.wikimedia.org/r/988132 (https://phabricator.wikimedia.org/T354276) (owner: 10Andrea Denisse) [19:25:25] (03PS2) 10Dzahn: admin: Add dimakoushha to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/988132 (https://phabricator.wikimedia.org/T354276) (owner: 10Andrea Denisse) [19:25:53] (03CR) 10Andrea Denisse: [C: 03+2] admin: Add dimakoushha to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/988132 (https://phabricator.wikimedia.org/T354276) (owner: 10Andrea Denisse) [19:26:00] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] admin: Add dimakoushha to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/988132 (https://phabricator.wikimedia.org/T354276) (owner: 10Andrea Denisse) [19:26:02] (03CR) 10Dzahn: [C: 03+1] "PS2: fixed rebase conflict with I3e148bda50a941" [puppet] - 10https://gerrit.wikimedia.org/r/988132 (https://phabricator.wikimedia.org/T354276) (owner: 10Andrea Denisse) [19:26:55] (03PS1) 10Fabfur: Add missing netmapper for abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/991409 (https://phabricator.wikimedia.org/T355158) [19:27:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T354336)', diff saved to https://phabricator.wikimedia.org/P54855 and previous config saved to /var/cache/conftool/dbconfig/20240117-192715-marostegui.json [19:27:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1233.eqiad.wmnet with reason: Maintenance [19:27:31] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [19:27:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1233.eqiad.wmnet with reason: Maintenance [19:27:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T354336)', diff saved to https://phabricator.wikimedia.org/P54856 and previous config saved to /var/cache/conftool/dbconfig/20240117-192737-marostegui.json [19:29:10] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1148/co" [puppet] - 10https://gerrit.wikimedia.org/r/991409 (https://phabricator.wikimedia.org/T355158) (owner: 10Fabfur) [19:29:19] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:29:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T354336)', diff saved to https://phabricator.wikimedia.org/P54857 and previous config saved to /var/cache/conftool/dbconfig/20240117-192953-marostegui.json [19:37:43] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Fix state bleeding from one into the next [extensions/Kartographer] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991343 (https://phabricator.wikimedia.org/T355044) (owner: 10WMDE-Fisch) [19:45:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P54858 and previous config saved to /var/cache/conftool/dbconfig/20240117-194500-marostegui.json [20:00:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P54859 and previous config saved to /var/cache/conftool/dbconfig/20240117-200006-marostegui.json [20:04:23] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [20:04:30] (03PS1) 10Subramanya Sastry: Re-enable: "Temporarily disable isPreview in Parsoid's rendering"" [extensions/DiscussionTools] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991344 [20:05:21] !log LDAP - added uid=dimakoushha to groups wmde and nda (T354276) [20:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:43] T354276: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 [20:05:43] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Docker [20:06:24] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) @Andrew before i change from PerformancePerWatt to PerformanceOptimized do you have any hesitations with that change? Thank you for logs p... [20:07:56] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10Dzahn) 05In progress→03Resolved Hi @Dima you have been added to the requested groups wmde and nda. Upon your next Gerrit login you should see some new permissions for WMDE repos and t... [20:08:53] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355170 (10Dzahn) a:05Arrbee→03JWheeler-WMF [20:09:20] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355170 (10Dzahn) 05Open→03In progress [20:15:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T354336)', diff saved to https://phabricator.wikimedia.org/P54860 and previous config saved to /var/cache/conftool/dbconfig/20240117-201513-marostegui.json [20:15:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:15:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:15:35] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [20:21:18] (03PS1) 10Gehel: microsites: simplify query service UI configuration [puppet] - 10https://gerrit.wikimedia.org/r/991411 (https://phabricator.wikimedia.org/T354658) [20:21:20] (03PS1) 10Gehel: microsites: create experimental microsite for WDQS graph split [puppet] - 10https://gerrit.wikimedia.org/r/991412 (https://phabricator.wikimedia.org/T354658) [20:25:26] (03CR) 10CI reject: [V: 04-1] microsites: simplify query service UI configuration [puppet] - 10https://gerrit.wikimedia.org/r/991411 (https://phabricator.wikimedia.org/T354658) (owner: 10Gehel) [20:25:34] (03CR) 10CI reject: [V: 04-1] microsites: create experimental microsite for WDQS graph split [puppet] - 10https://gerrit.wikimedia.org/r/991412 (https://phabricator.wikimedia.org/T354658) (owner: 10Gehel) [20:26:03] urbanecm: once you dropped the rows from beta cluster, please ping me to optimize tables there and measure the difference [20:26:42] Amir1: will do. So far I only dropped one row to verify it works :D [20:27:06] it'd be hard to measure impact of the one row drop :P [20:27:20] No doubts :) [20:27:40] (03PS2) 10Gehel: microsites: simplify query service UI configuration [puppet] - 10https://gerrit.wikimedia.org/r/991411 (https://phabricator.wikimedia.org/T354658) [20:27:42] (03PS2) 10Gehel: microsites: create experimental microsite for WDQS graph split [puppet] - 10https://gerrit.wikimedia.org/r/991412 (https://phabricator.wikimedia.org/T354658) [20:32:01] (03CR) 10CI reject: [V: 04-1] microsites: create experimental microsite for WDQS graph split [puppet] - 10https://gerrit.wikimedia.org/r/991412 (https://phabricator.wikimedia.org/T354658) (owner: 10Gehel) [20:36:52] (03PS3) 10Gehel: microsites: create experimental microsite for WDQS graph split [puppet] - 10https://gerrit.wikimedia.org/r/991412 (https://phabricator.wikimedia.org/T354658) [20:44:34] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Andrew) >>! In T353408#9467046, @Jclark-ctr wrote: > @Andrew before i change from PerformancePerWatt to PerformanceOptimized do you have any hesitations... [20:56:00] (03PS1) 10Kosta Harlan: ipoid: Bump version and schedule multiple runs [deployment-charts] - 10https://gerrit.wikimedia.org/r/991416 (https://phabricator.wikimedia.org/T344941) [20:56:52] (03PS2) 10Kosta Harlan: ipoid: Bump version and schedule multiple runs [deployment-charts] - 10https://gerrit.wikimedia.org/r/991416 (https://phabricator.wikimedia.org/T344941) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T2100). [21:00:05] Jdlrobson and subbu: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] (03CR) 10Kosta Harlan: ipoid: Bump version and schedule multiple runs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991416 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [21:00:57] o/ [21:01:33] (03CR) 10Tchanders: [C: 03+1] ipoid: Bump version and schedule multiple runs [deployment-charts] - 10https://gerrit.wikimedia.org/r/991416 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [21:02:33] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version and schedule multiple runs [deployment-charts] - 10https://gerrit.wikimedia.org/r/991416 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [21:02:36] (03CR) 10Tchanders: [C: 03+1] ipoid: Bump version and schedule multiple runs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991416 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [21:02:38] (03CR) 10Dzahn: "as mentioned on I52acf31ab2c239 I think there is going to be a problem here with certificates if you introduce another domain level. The w" [puppet] - 10https://gerrit.wikimedia.org/r/991412 (https://phabricator.wikimedia.org/T354658) (owner: 10Gehel) [21:03:20] (03CR) 10Jdlrobson: [C: 03+1] thwiki: update tagline and optimise other logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) (owner: 10Anzx) [21:03:30] (03Merged) 10jenkins-bot: ipoid: Bump version and schedule multiple runs [deployment-charts] - 10https://gerrit.wikimedia.org/r/991416 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [21:04:50] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [21:05:33] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [21:05:48] (03CR) 10Herron: [C: 04-1] "Logstash and Benthos sharing a consumer group would allow Benthos to increment logstash's topic offsets, which could lead to dropped logs " [puppet] - 10https://gerrit.wikimedia.org/r/990166 (https://phabricator.wikimedia.org/T354904) (owner: 10Cwhite) [21:06:12] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [21:06:50] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [21:07:26] (03CR) 10Herron: [V: 03+1 C: 03+2] thanos::rule: set reload service to stopped [puppet] - 10https://gerrit.wikimedia.org/r/990126 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [21:07:28] !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [21:07:49] !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [21:08:10] (03PS1) 10Kosta Harlan: ipoid: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/991417 [21:08:36] TheresNoTime: urbanecm RoanKattouw are either of you available for a deploy? [21:08:47] (03PS2) 10Kosta Harlan: ipoid: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/991417 (https://phabricator.wikimedia.org/T344941) [21:10:41] Jdlrobson: I will be in about 5 minutes [21:10:51] thanks RoanKattouw i can wait 5 :) [21:11:03] o/ [21:11:14] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/991417 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [21:11:19] technically I could also deploy but I am semi distracted. [21:11:28] so probably should let Roan do it. [21:12:05] (03Merged) 10jenkins-bot: ipoid: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/991417 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [21:13:41] !log bking@kafka-main1001 `kafka topics --alter --topic codfw.cirrussearch.update_pipeline.update.rc0 --partitions 5` [21:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:45] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [21:13:49] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [21:15:19] !log bking@kafka-main1001 `kafka topics --alter --topic eqiad.cirrussearch.update_pipeline.update.rc0 --partitions 5` T354595 [21:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:32] T354595: SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 [21:16:04] (03CR) 10Herron: [C: 03+1] sre: add mw edit failures alert [alerts] - 10https://gerrit.wikimedia.org/r/991007 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [21:16:08] !log bking@kafka-main1001 `kafka topics --alter --topic codfw.cirrussearch.update_pipeline.fetch_error.rc0 --partitions 5 [21:16:08] ` T354595 [21:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:29] (03CR) 10Herron: [C: 03+1] graphite: remove mw edit failures graphite alerts [puppet] - 10https://gerrit.wikimedia.org/r/991008 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [21:18:29] (03CR) 10Herron: [C: 03+1] "thanks for cleaning this up!" [puppet] - 10https://gerrit.wikimedia.org/r/991364 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [21:20:05] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:51] OK I'm back, sorry that took a bit longer than expected [21:21:37] (03CR) 10Catrope: [C: 03+2] Fix text overflow in history page [skins/MinervaNeue] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991049 (https://phabricator.wikimedia.org/T354218) (owner: 10Jdlrobson) [21:21:50] (03CR) 10Catrope: [C: 03+2] Update checkboxHack target node [skins/MinervaNeue] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991339 (https://phabricator.wikimedia.org/T354315) (owner: 10Jdlrobson) [21:22:02] (03CR) 10Catrope: [C: 03+2] Re-enable: "Temporarily disable isPreview in Parsoid's rendering"" [extensions/DiscussionTools] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991344 (owner: 10Subramanya Sastry) [21:22:29] Jdlrobson: Should your config change wait for the wmf branch cherry-picks to go first, or can it go now? [21:24:04] they can go alongside each other [21:24:28] OK, I'll do the config one now then, while we wait for the cherry-picks to make their way through Jenkins [21:24:33] sounds good to me [21:24:35] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990152 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [21:27:01] (03PS2) 10Catrope: Enable desktop history page for all mobile logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990152 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [21:27:05] I am here now. [21:27:07] (03CR) 10Catrope: [C: 03+2] Enable desktop history page for all mobile logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990152 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [21:27:53] (03Merged) 10jenkins-bot: Enable desktop history page for all mobile logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990152 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [21:28:00] (03CR) 10Herron: [C: 03+1] "LGTM!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/989187 (https://phabricator.wikimedia.org/T347262) (owner: 10Klausman) [21:28:07] RoanKattouw, as a new deployer (who hasn't yet done any deployment outside the training), I notice that you manually +2ed all patches at once vs. using the scap backport command on each patch. Are you optimizing this in some fashion by doing manual scap pulls, etc? [21:28:25] !log catrope@deploy2002 Started scap: Backport for [[gerrit:990152|Enable desktop history page for all mobile logged in users (T353388)]] [21:28:29] T353388: Enable desktop history HTML on mobile - https://phabricator.wikimedia.org/T353388 [21:28:43] So for config patches, I use scap backport to +2 (in this case I forgot to rebase first so I had to rebase and re-+2), because CI is very fsat [21:28:46] (03CR) 10Herron: [C: 03+1] Add Lift Wing recommendation-api-ng SLO (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/989187 (https://phabricator.wikimedia.org/T347262) (owner: 10Klausman) [21:28:59] ack. [21:29:00] For wmf branch backports, the CI process is very slow. So I manually +2 the patches, and then run scap backport later [21:29:09] This way the CI is parallelized rather than in series [21:29:43] okay .. so, scap backport will skip that step if it sees that a patch has already merged .. makes sense reg. parallelized CI. [21:30:04] !log catrope@deploy2002 jdlrobson and catrope: Backport for [[gerrit:990152|Enable desktop history page for all mobile logged in users (T353388)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:30:23] Yes exactly [21:30:33] noted. thanks. [21:30:48] Jdlrobson: Your config patch is ready for testing on the debug serveres [21:31:14] my DT patch can go through ... i cannot do any test for it on mwdebug. [21:31:24] (whenever you are ready for that one). [21:31:46] RoanKattouw: thanks! looking! [21:33:28] RoanKattouw: LGTM! please sync! [21:34:12] RoanKattouw: +1 on your workflow :) [21:35:43] * hashar vanishes [21:37:49] !log catrope@deploy2002 jdlrobson and catrope: Continuing with sync [21:42:04] (03Abandoned) 10Andrea Denisse: admin: Add arthurtaylor to restricted [puppet] - 10https://gerrit.wikimedia.org/r/988133 (https://phabricator.wikimedia.org/T354049) (owner: 10Andrea Denisse) [21:42:14] (03Merged) 10jenkins-bot: Fix text overflow in history page [skins/MinervaNeue] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991049 (https://phabricator.wikimedia.org/T354218) (owner: 10Jdlrobson) [21:42:17] (03Merged) 10jenkins-bot: Update checkboxHack target node [skins/MinervaNeue] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991339 (https://phabricator.wikimedia.org/T354315) (owner: 10Jdlrobson) [21:42:20] (03Merged) 10jenkins-bot: Re-enable: "Temporarily disable isPreview in Parsoid's rendering"" [extensions/DiscussionTools] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991344 (owner: 10Subramanya Sastry) [21:43:41] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:990152|Enable desktop history page for all mobile logged in users (T353388)]] (duration: 15m 15s) [21:43:46] T353388: Enable desktop history HTML on mobile - https://phabricator.wikimedia.org/T353388 [21:45:33] OK I'm now going to deploy Jdlrobson's Minerva text overflow patch, his Minerva checkbox hack patch, and subbu's DiscussionTools patch, all at once (unfortunately scap backport won't let me do them one by one because they're all merged already) [21:45:54] !log catrope@deploy2002 Started scap: Backport for [[gerrit:991049|Fix text overflow in history page (T354218)]] [21:45:58] T354218: Long edit summaries spill out of container on history page - https://phabricator.wikimedia.org/T354218 [21:46:03] that was going to my next question about your optimized workflow :) [21:46:07] Not the best hygiene, but I didn't want this to take an hour [21:46:25] I can understand. [21:46:45] I think I might have been able to do them one by one with manual commands, but I don't know if those manual commands are still supported in the modern k8s world [21:47:14] !log bking@kafka-main2001 `kafka topics --alter --topic eqiad.cirrussearch.update_pipeline.update.rc0 --partitions 5` T354595 [21:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:18] T354595: SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 [21:47:21] !log catrope@deploy2002 jdlrobson and catrope: Backport for [[gerrit:991049|Fix text overflow in history page (T354218)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:47:35] Jdlrobson: Both of your Minerva patches are now ready for testing [21:47:44] subbu: And yours too, but I think you said you couldn't test it? [21:47:45] so, you are going to run scap-file on each patch's file? (assuming every patch only touches one file)? [21:47:52] yes, i cannot test. [21:48:22] RoanKattouw: on it! [21:48:36] Yes that's what I would have done, scap sync-file on each of the extension dirs (some of the patches touch multiple files, but they're all contained in one directory) [21:48:56] reg my patch, just for the record, i've already dployed that patch once before (for running visual diff tests) and reverted it (after) .. so, I am just repeating that process today .. it is safe to go. [21:49:24] RoanKattouw: both LGTM! [21:49:26] please sync! [21:49:30] !log catrope@deploy2002 jdlrobson and catrope: Continuing with sync [21:49:52] RoanKattouw, got it. so, in the off chance that of the 3 patches that merged now, one of them cannot proceed because mwdebug testing revealed that it is not safe ... you would then maually sync just the good ones and then revert the bad one? Or revert the bad one first and then do a sync of the rest of the revert completes? [21:50:22] !log bking@kafka-main2001 `kafka topics --alter --topic codfw.cirrussearch.update_pipeline.fetch_error.rc0 --partitions 5` T354595 [21:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:48] I think I would have reverted the bad one and then deployed that revert, letting the good ones ride along with tat [21:50:57] ok. [21:51:09] Anyway this was less of a good idea than I thought it would be, so maybe not recommended [21:51:25] noted. :) but, it is good to understand the edge cases. [21:52:25] (03PS1) 10Jdlrobson: Use desktop history page HTML everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991424 (https://phabricator.wikimedia.org/T353388) [21:55:34] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:991049|Fix text overflow in history page (T354218)]] (duration: 09m 39s) [21:55:38] T354218: Long edit summaries spill out of container on history page - https://phabricator.wikimedia.org/T354218 [21:55:47] OK all done [21:57:45] thanks! [22:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240117T2200) [22:00:33] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10RobH) When I login to cr2-codfw, I cannot see the serial of the line card in quetion: ` robh@re0.cr1-codfw> show chassis hardware Hardware inventory: Item Version Part number Serial number Description... [22:01:33] !log bking@kafka-main2001 `kafka topics --alter --topic eqiad.cirrussearch.update_pipeline.fetch_error.rc0 --partitions 5` T354595 [22:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:40] T354595: SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 [22:05:01] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10cmooney) >>! In T354732#9464722, @ayounsi wrote: > @papaul let's double check with @cmooney just in case, but we should recycle that linecard (and remove it from Netbox). Yeah let's recycle it if we've not paid support.... [22:05:54] Thanks for your help today RoanKattouw ! [22:16:40] (03Abandoned) 10Cwhite: udp2log: add simple benthos pipeline [puppet] - 10https://gerrit.wikimedia.org/r/984238 (https://phabricator.wikimedia.org/T353220) (owner: 10Cwhite) [22:21:34] (03CR) 10Cwhite: [C: 03+2] udp2log: amend demux.py to support the python3 runtime [puppet] - 10https://gerrit.wikimedia.org/r/984237 (https://phabricator.wikimedia.org/T353220) (owner: 10Cwhite) [22:22:05] (03PS1) 10BCornwall: Add markmonitor API username/password [labs/private] - 10https://gerrit.wikimedia.org/r/991426 (https://phabricator.wikimedia.org/T355190) [22:23:05] (03CR) 10BCornwall: [V: 03+2 C: 03+2] Add markmonitor API username/password [labs/private] - 10https://gerrit.wikimedia.org/r/991426 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [22:30:13] (03CR) 10Cwhite: [C: 03+1] icinga: remove ldap-icinga remnants [puppet] - 10https://gerrit.wikimedia.org/r/991364 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [22:32:37] (03CR) 10Dzahn: "Interesting to see this, I literally have an ancient ToDo to "check out MarkMonitor API, ask Traffic team if they are still interested in " [labs/private] - 10https://gerrit.wikimedia.org/r/991426 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [22:33:29] (03CR) 10Cwhite: [C: 03+1] "PCC OK https://puppet-compiler.wmflabs.org/output/991391/1151/" [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [22:35:28] (03CR) 10Dzahn: "I think you also have to set "acmechief_host" key. At least that was always combined with the puppet7 key so far or until recently." [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [22:37:17] (03CR) 10Dzahn: "puppet/hieradata/hosts$ grep -r -A1 force_puppet7 *" [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [22:38:40] (03PS1) 10Ryan Kemper: wdqs graph-split: don't use subdomain [puppet] - 10https://gerrit.wikimedia.org/r/991427 (https://phabricator.wikimedia.org/T350464) [22:39:18] (03CR) 10Ryan Kemper: microsites: create experimental microsite for WDQS graph split (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991412 (https://phabricator.wikimedia.org/T354658) (owner: 10Gehel) [22:41:32] (03PS4) 10Ryan Kemper: microsites: create experimental microsite for WDQS graph split [puppet] - 10https://gerrit.wikimedia.org/r/991412 (https://phabricator.wikimedia.org/T354658) (owner: 10Gehel) [22:42:04] (03CR) 10Bking: [C: 03+1] wdqs graph-split: don't use subdomain [puppet] - 10https://gerrit.wikimedia.org/r/991427 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper) [22:42:44] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991427 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper) [22:42:56] (03Abandoned) 10Cwhite: logstash: move labels.trace to error.stack_trace [puppet] - 10https://gerrit.wikimedia.org/r/939285 (https://phabricator.wikimedia.org/T339137) (owner: 10Cwhite) [22:44:36] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355170 (10Dzahn) a:05JWheeler-WMF→03Arrbee [22:47:01] (03PS1) 10Ryan Kemper: wdqs graph-split: add experimental svcs [dns] - 10https://gerrit.wikimedia.org/r/991429 (https://phabricator.wikimedia.org/T354662) [22:48:50] (03CR) 10Ryan Kemper: [C: 03+2] wdqs graph-split: don't use subdomain [puppet] - 10https://gerrit.wikimedia.org/r/991427 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper) [22:48:59] (03CR) 10Bking: [C: 03+1] wdqs graph-split: add experimental svcs [dns] - 10https://gerrit.wikimedia.org/r/991429 (https://phabricator.wikimedia.org/T354662) (owner: 10Ryan Kemper) [22:50:52] brett: I went ahead and puppet-merged https://gerrit.wikimedia.org/r/c/labs/private/+/991426/ [22:50:54] 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install es10[35-41] - https://phabricator.wikimedia.org/T355269 (10Jclark-ctr) [22:51:15] 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install es10[35-41] - https://phabricator.wikimedia.org/T355269 (10Jclark-ctr) [22:51:18] (03CR) 10Dzahn: wdqs graph-split: don't use subdomain (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991427 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper) [22:51:30] ryankemper: D'oh, thanks! [22:51:38] 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install es10[35-41] - https://phabricator.wikimedia.org/T355269 (10Jclark-ctr) [22:53:58] (03CR) 10Dzahn: wdqs graph-split: don't use subdomain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991427 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper) [22:54:08] (03CR) 10Ryan Kemper: [C: 03+2] microsites: create experimental microsite for WDQS graph split [puppet] - 10https://gerrit.wikimedia.org/r/991412 (https://phabricator.wikimedia.org/T354658) (owner: 10Gehel) [22:54:27] 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install es10[35-41] - https://phabricator.wikimedia.org/T355269 (10Jclark-ctr) a:05Jclark-ctr→03Marostegui If you can update installation instructions and update preseed.yaml, and site.pp if needed Thanks [22:56:06] (03CR) 10Dzahn: wdqs graph-split: don't use subdomain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991427 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper) [23:03:35] (03CR) 10Ryan Kemper: [C: 03+2] "Nice, very elegant simplification" [puppet] - 10https://gerrit.wikimedia.org/r/991411 (https://phabricator.wikimedia.org/T354658) (owner: 10Gehel) [23:07:53] (03CR) 10Andrea Denisse: grafana: Ensure the grafana1002 host uses Puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [23:10:05] (03CR) 10Dzahn: grafana: Ensure the grafana1002 host uses Puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [23:12:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:12:40] (03PS2) 10Andrea Denisse: grafana: Ensure the grafana1002 host uses Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) [23:13:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:18:17] (03PS3) 10Andrea Denisse: grafana: Ensure the grafana1002 hosts uses Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) [23:18:43] (ProbeDown) firing: (2) Service miscweb2003:443 has failed probes (http_query_full_experimental_wikidata_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:19:08] (03PS4) 10Andrea Denisse: grafana: Ensure the grafana1002 hosts uses Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) [23:19:40] (03CR) 10Andrea Denisse: grafana: Ensure the grafana1002 hosts uses Puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [23:23:43] (ProbeDown) firing: (6) Service miscweb1003:443 has failed probes (http_query_full_experimental_wikidata_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:30:21] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:32:23] (03CR) 10Dzahn: [C: 03+1] grafana: Ensure the grafana1002 hosts uses Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [23:34:23] (03CR) 10Dzahn: "This is creating monitoring alerts and tickets:" [puppet] - 10https://gerrit.wikimedia.org/r/991412 (https://phabricator.wikimedia.org/T354658) (owner: 10Gehel) [23:36:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [23:36:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [23:36:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T352010)', diff saved to https://phabricator.wikimedia.org/P54861 and previous config saved to /var/cache/conftool/dbconfig/20240117-233655-ladsgroup.json [23:36:59] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:42:32] hnowlan: Hey, could you take care of deploying https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/972055 ?