[00:00:48] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.161 second response time https://wikitech.wikimedia.org/wiki/Swift [00:01:45] (03PS1) 10Dzahn: scap: move firewall rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/862378 (https://phabricator.wikimedia.org/T114209) [00:02:20] (03CR) 10CI reject: [V: 04-1] scap: move firewall rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/862378 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [00:04:36] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:04:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P41969 and previous config saved to /var/cache/conftool/dbconfig/20221201-000458-ladsgroup.json [00:07:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1206.eqiad.wmnet with reason: host reimage [00:07:53] 10SRE: Check incoming requests to secure.wm.o - https://phabricator.wikimedia.org/T119274 (10Dzahn) >>! In T119274#1861105, @Reedy wrote: > This was filed for T93531 Hey Reedy, let me respond 7 years later. You can still search for it https://www.google.co.uk/search?q=site:secure.wikimedia.org but T93531 has b... [00:08:24] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift [00:10:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1206.eqiad.wmnet with reason: host reimage [00:14:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T322618)', diff saved to https://phabricator.wikimedia.org/P41970 and previous config saved to /var/cache/conftool/dbconfig/20221201-001427-ladsgroup.json [00:14:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance [00:14:35] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [00:14:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance [00:14:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T322618)', diff saved to https://phabricator.wikimedia.org/P41971 and previous config saved to /var/cache/conftool/dbconfig/20221201-001449-ladsgroup.json [00:17:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T322618)', diff saved to https://phabricator.wikimedia.org/P41972 and previous config saved to /var/cache/conftool/dbconfig/20221201-001659-ladsgroup.json [00:19:09] (03PS1) 10Andrew Bogott: oslo_messaging_rabbit: increase retry and backoff by a lot [puppet] - 10https://gerrit.wikimedia.org/r/862389 (https://phabricator.wikimedia.org/T318816) [00:19:55] (03PS2) 10Ssingh: cp5026: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861914 (https://phabricator.wikimedia.org/T322048) [00:20:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P41973 and previous config saved to /var/cache/conftool/dbconfig/20221201-002005-ladsgroup.json [00:21:50] (03CR) 10Andrew Bogott: [C: 03+2] oslo_messaging_rabbit: increase retry and backoff by a lot [puppet] - 10https://gerrit.wikimedia.org/r/862389 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott) [00:22:29] 10SRE, 10serviceops-collab, 10serviceops-radar: Check incoming requests to secure.wm.o - https://phabricator.wikimedia.org/T119274 (10Dzahn) [00:23:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1206.eqiad.wmnet with OS bullseye [00:23:39] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1206.eqiad.wmnet with OS bullseye completed: - db1206 (**PASS**) - Removed from Puppet and Puppe... [00:23:53] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Papaul) [00:24:09] 10SRE, 10serviceops-collab, 10serviceops-radar: Check incoming requests to secure.wm.o - https://phabricator.wikimedia.org/T119274 (10Dzahn) I went to https://superset.wikimedia.org and then tried the dashboard "Webrequest Sampled 128 | SRE" that @volans just showed us in an SRE presentation. I filtered by... [00:24:15] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Papaul) 05Openโ†’03Resolved @Marostegui this is complete [00:24:28] (03CR) 10Ssingh: [C: 03+2] cp5026: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861914 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [00:25:15] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:25:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5026.eqsin.wmnet with OS buster [00:25:46] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5026.eqsin.wmnet with OS buster [00:26:27] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Swift [00:26:30] 10SRE, 10serviceops-collab, 10serviceops-radar: Check incoming requests to secure.wm.o - https://phabricator.wikimedia.org/T119274 (10Dzahn) 05Openโ†’03Resolved a:03Dzahn hits on `secure.wikimedia.org` from "1 month ago" until "now". {F35826054} Biggest referer is MIT by the way. That being said, I th... [00:26:33] 10SRE, 10SEO: secure.wikimedia.org entries still showing up in Google search results - https://phabricator.wikimedia.org/T93531 (10Dzahn) [00:26:35] 10SRE, 10Patch-For-Review: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790 (10Dzahn) [00:27:38] 10SRE, 10SEO: secure.wikimedia.org entries still showing up in Google search results - https://phabricator.wikimedia.org/T93531 (10Dzahn) T119274#8434032 [00:29:06] 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846 (10Dzahn) [00:29:35] 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846 (10Dzahn) I think we should either decline this OR redirect to gitlab OR to gerrit, just definitely not to Phabricator anymore. [00:30:41] 10SRE, 10Deployments, 10Infrastructure-Foundations, 10serviceops-radar: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585 (10Dzahn) [00:31:03] 10SRE, 10Deployments, 10Infrastructure-Foundations, 10serviceops-radar: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585 (10Dzahn) Probably this means it should be created with `systemd::sysuser` in puppet nowadays. [00:32:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P41974 and previous config saved to /var/cache/conftool/dbconfig/20221201-003205-ladsgroup.json [00:32:14] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: system users with UIDs > 500 - https://phabricator.wikimedia.org/T121610 (10Dzahn) [00:34:22] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: system users with UIDs > 500 - https://phabricator.wikimedia.org/T121610 (10Dzahn) This is old but I think it can be translated to "create all system users with systemd::sysuser in puppet" nowadays. [00:35:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T322618)', diff saved to https://phabricator.wikimedia.org/P41975 and previous config saved to /var/cache/conftool/dbconfig/20221201-003511-ladsgroup.json [00:35:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance [00:35:19] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [00:35:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance [00:35:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T322618)', diff saved to https://phabricator.wikimedia.org/P41976 and previous config saved to /var/cache/conftool/dbconfig/20221201-003533-ladsgroup.json [00:39:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T322618)', diff saved to https://phabricator.wikimedia.org/P41977 and previous config saved to /var/cache/conftool/dbconfig/20221201-003941-ladsgroup.json [00:40:24] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ProtocolError(Connection aborted., ConnectionResetError(104, Connection reset by peer)): /en.wikipedia.org/v1/page/featured/2016/04/29 https://wikitech.wikimedia.org/wiki/Wikifeeds [00:40:50] (03PS2) 10Ssingh: cp5027: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861915 (https://phabricator.wikimedia.org/T322048) [00:42:00] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [00:42:24] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Documentation, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797 (10Dzahn) It's probably unrealistic to see this ticket closed as resolved ever. We could close it and I would be fine with that or we can... [00:43:44] 10SRE, 10Diffusion, 10Release-Engineering-Team, 10serviceops-collab: svn.wikimedia.org redirects to Diffusion main page, hence hard to find e.g. "flexbisonparse" - https://phabricator.wikimedia.org/T140594 (10Dzahn) [00:44:12] 10SRE, 10Traffic-Icebox, 10Wikimedia-Planet, 10serviceops-collab, 10Patch-For-Review: mixed-content issues on planet.wikimedia.org - https://phabricator.wikimedia.org/T141480 (10Dzahn) [00:45:31] 10SRE, 10Infrastructure-Foundations, 10Release-Engineering-Team: Enforce reference to Phabricator task for all commits to modules/admin/data/data.yaml - https://phabricator.wikimedia.org/T142827 (10Dzahn) [00:47:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P41978 and previous config saved to /var/cache/conftool/dbconfig/20221201-004712-ladsgroup.json [00:48:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:48:50] 10SRE, 10Cloud-Services, 10Domains, 10Education-Program-Dashboard, 10Traffic-Icebox: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332 (10Dzahn) [00:50:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [00:50:31] (03PS4) 10Eevans: Promote the aqs_next role to be aqs [puppet] - 10https://gerrit.wikimedia.org/r/859059 (https://phabricator.wikimedia.org/T302278) (owner: 10Btullis) [00:50:59] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:51:12] (03CR) 10CI reject: [V: 04-1] Promote the aqs_next role to be aqs [puppet] - 10https://gerrit.wikimedia.org/r/859059 (https://phabricator.wikimedia.org/T302278) (owner: 10Btullis) [00:51:15] 10SRE, 10Cloud-Services, 10Domains, 10Education-Program-Dashboard, 10Traffic-Icebox: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332 (10Dzahn) status here as of today is: https://dash.wmflabs.org/ exists but shows an error because no proxy is configured h... [00:51:50] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1396 days) https://wikitech.wikimedia.org/wiki/Logs [00:52:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:52:46] 10SRE, 10Cloud-Services, 10Domains, 10Education-Program-Dashboard, 10Traffic-Icebox: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332 (10Dzahn) I added Cloud-Service because it seems to me this needs an admin from https://openstack-browser.toolforge.org/proj... [00:53:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage [00:53:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:54:36] PROBLEM - Check systemd state on kubernetes2014 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P41979 and previous config saved to /var/cache/conftool/dbconfig/20221201-005447-ladsgroup.json [00:55:20] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.108 second response time https://wikitech.wikimedia.org/wiki/Swift [00:55:25] 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846 (10bd808) >>! In T119846#8434042, @Dzahn wrote: > I think we should either decline this OR redirect to gitlab OR to gerrit, j... [00:55:42] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [00:55:55] 10Puppet, 10SRE, 10Infrastructure-Foundations: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10Dzahn) This is yet another one where I would call it resolved once the user is created by systemd::sysuser. [00:56:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage [00:56:53] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops-radar: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10Dzahn) [00:57:38] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:59:42] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift [01:00:44] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve2008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:01:24] PROBLEM - Check systemd state on ml-serve2008 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T322618)', diff saved to https://phabricator.wikimedia.org/P41980 and previous config saved to /var/cache/conftool/dbconfig/20221201-010219-ladsgroup.json [01:02:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2167.codfw.wmnet with reason: Maintenance [01:02:27] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [01:02:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2167.codfw.wmnet with reason: Maintenance [01:02:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P41981 and previous config saved to /var/cache/conftool/dbconfig/20221201-010240-ladsgroup.json [01:04:16] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.186 second response time https://wikitech.wikimedia.org/wiki/Swift [01:04:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P41982 and previous config saved to /var/cache/conftool/dbconfig/20221201-010450-ladsgroup.json [01:08:48] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.136 second response time https://wikitech.wikimedia.org/wiki/Swift [01:09:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P41983 and previous config saved to /var/cache/conftool/dbconfig/20221201-010954-ladsgroup.json [01:11:28] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [01:11:34] RECOVERY - Check systemd state on ml-serve2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:22] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:17:52] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.150 second response time https://wikitech.wikimedia.org/wiki/Swift [01:18:02] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:18:22] PROBLEM - Disk space on stat1004 is CRITICAL: DISK CRITICAL - free space: / 3538 MB (3% inode=80%): /tmp 3538 MB (3% inode=80%): /var/tmp 3538 MB (3% inode=80%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [01:19:26] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [01:19:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P41984 and previous config saved to /var/cache/conftool/dbconfig/20221201-011957-ladsgroup.json [01:20:06] RECOVERY - Check systemd state on kubernetes2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:54] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Swift [01:24:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5026.eqsin.wmnet with OS buster [01:25:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T322618)', diff saved to https://phabricator.wikimedia.org/P41985 and previous config saved to /var/cache/conftool/dbconfig/20221201-012500-ladsgroup.json [01:25:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [01:25:05] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5026.eqsin.wmnet with OS buster completed: - cp5026 (**PASS**) -... [01:25:09] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [01:25:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [01:25:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T322618)', diff saved to https://phabricator.wikimedia.org/P41986 and previous config saved to /var/cache/conftool/dbconfig/20221201-012522-ladsgroup.json [01:26:10] (03CR) 10Wugapodes: Add ContactPage and ArbCom form to EnWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860946 (https://phabricator.wikimedia.org/T321447) (owner: 10Wugapodes) [01:26:18] (03CR) 10Ssingh: [C: 03+2] cp5027: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861915 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [01:26:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T322618)', diff saved to https://phabricator.wikimedia.org/P41987 and previous config saved to /var/cache/conftool/dbconfig/20221201-012630-ladsgroup.json [01:27:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5027.eqsin.wmnet with OS buster [01:27:11] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5027.eqsin.wmnet with OS buster [01:27:50] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.107 second response time https://wikitech.wikimedia.org/wiki/Swift [01:28:38] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:30:30] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.294 second response time https://wikitech.wikimedia.org/wiki/Swift [01:30:44] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:35:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P41988 and previous config saved to /var/cache/conftool/dbconfig/20221201-013503-ladsgroup.json [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P41989 and previous config saved to /var/cache/conftool/dbconfig/20221201-014136-ladsgroup.json [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:04] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Swift [01:47:36] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:49:00] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Swift [01:50:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P41990 and previous config saved to /var/cache/conftool/dbconfig/20221201-015010-ladsgroup.json [01:50:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [01:50:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [01:50:18] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [01:50:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P41991 and previous config saved to /var/cache/conftool/dbconfig/20221201-015020-ladsgroup.json [01:50:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [01:51:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [01:51:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P41992 and previous config saved to /var/cache/conftool/dbconfig/20221201-015115-ladsgroup.json [01:51:22] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [01:52:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P41993 and previous config saved to /var/cache/conftool/dbconfig/20221201-015230-ladsgroup.json [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [01:53:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [01:53:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [01:53:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P41994 and previous config saved to /var/cache/conftool/dbconfig/20221201-015332-ladsgroup.json [01:53:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [01:53:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T318605)', diff saved to https://phabricator.wikimedia.org/P41995 and previous config saved to /var/cache/conftool/dbconfig/20221201-015340-ladsgroup.json [01:54:16] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:55:08] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [01:55:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318605)', diff saved to https://phabricator.wikimedia.org/P41996 and previous config saved to /var/cache/conftool/dbconfig/20221201-015550-ladsgroup.json [01:56:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P41997 and previous config saved to /var/cache/conftool/dbconfig/20221201-015643-ladsgroup.json [01:57:52] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [01:58:01] !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cephosd - cmjohnson@cumin1001" [01:59:18] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cephosd - cmjohnson@cumin1001" [01:59:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:00:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10Cmjohnson) [02:03:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance [02:03:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance [02:03:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P41998 and previous config saved to /var/cache/conftool/dbconfig/20221201-020308-ladsgroup.json [02:03:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [02:03:18] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [02:03:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [02:04:00] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.175 second response time https://wikitech.wikimedia.org/wiki/Swift [02:04:14] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Swift [02:05:56] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [02:07:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P41999 and previous config saved to /var/cache/conftool/dbconfig/20221201-020737-ladsgroup.json [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:11] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [02:09:37] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [02:09:44] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [02:10:24] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.118 second response time https://wikitech.wikimedia.org/wiki/Swift [02:10:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P42000 and previous config saved to /var/cache/conftool/dbconfig/20221201-021057-ladsgroup.json [02:11:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T322618)', diff saved to https://phabricator.wikimedia.org/P42001 and previous config saved to /var/cache/conftool/dbconfig/20221201-021149-ladsgroup.json [02:11:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [02:11:54] !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-coord - cmjohnson@cumin1001" [02:11:57] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [02:12:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [02:12:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T322618)', diff saved to https://phabricator.wikimedia.org/P42002 and previous config saved to /var/cache/conftool/dbconfig/20221201-021211-ladsgroup.json [02:12:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-coord - cmjohnson@cumin1001" [02:12:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:13:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T322618)', diff saved to https://phabricator.wikimedia.org/P42003 and previous config saved to /var/cache/conftool/dbconfig/20221201-021318-ladsgroup.json [02:14:14] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:16:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:29] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Ladsgroup) Thanks Papaul! @Marostegui: When provisioning this for production, I'd really appreciate if I can shadow you to learn how we add a db to rotation. Please ๐Ÿฅบ [02:20:30] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:59] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:20:59] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5027.eqsin.wmnet with OS buster [02:21:03] (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:21:09] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5027.eqsin.wmnet with OS buster executed with errors: - cp5027 (**... [02:21:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5027.eqsin.wmnet with OS buster [02:21:37] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5027.eqsin.wmnet with OS buster [02:21:52] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5027.eqsin.wmnet with OS buster [02:22:01] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5027.eqsin.wmnet with OS buster executed with errors: - cp5027 (**... [02:22:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5027.eqsin.wmnet with OS buster [02:22:25] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5027.eqsin.wmnet with OS buster [02:22:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P42004 and previous config saved to /var/cache/conftool/dbconfig/20221201-022244-ladsgroup.json [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:03] (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:26:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P42005 and previous config saved to /var/cache/conftool/dbconfig/20221201-022603-ladsgroup.json [02:26:36] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.248 second response time https://wikitech.wikimedia.org/wiki/Swift [02:27:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Cmjohnson) [02:28:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P42006 and previous config saved to /var/cache/conftool/dbconfig/20221201-022825-ladsgroup.json [02:30:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42007 and previous config saved to /var/cache/conftool/dbconfig/20221201-023027-ladsgroup.json [02:30:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [02:32:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [02:33:21] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5027.eqsin.wmnet with OS buster [02:33:30] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5027.eqsin.wmnet with OS buster executed with errors: - cp5027 (**... [02:33:48] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5027.eqsin.wmnet with OS buster [02:35:59] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:37:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P42008 and previous config saved to /var/cache/conftool/dbconfig/20221201-023750-ladsgroup.json [02:37:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [02:37:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [02:37:58] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [02:38:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T322618)', diff saved to https://phabricator.wikimedia.org/P42009 and previous config saved to /var/cache/conftool/dbconfig/20221201-023801-ladsgroup.json [02:40:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T322618)', diff saved to https://phabricator.wikimedia.org/P42010 and previous config saved to /var/cache/conftool/dbconfig/20221201-024011-ladsgroup.json [02:40:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:41:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318605)', diff saved to https://phabricator.wikimedia.org/P42011 and previous config saved to /var/cache/conftool/dbconfig/20221201-024110-ladsgroup.json [02:41:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [02:41:17] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [02:41:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [02:41:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T318605)', diff saved to https://phabricator.wikimedia.org/P42012 and previous config saved to /var/cache/conftool/dbconfig/20221201-024131-ladsgroup.json [02:43:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P42013 and previous config saved to /var/cache/conftool/dbconfig/20221201-024331-ladsgroup.json [02:43:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318605)', diff saved to https://phabricator.wikimedia.org/P42014 and previous config saved to /var/cache/conftool/dbconfig/20221201-024341-ladsgroup.json [02:45:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P42015 and previous config saved to /var/cache/conftool/dbconfig/20221201-024533-ladsgroup.json [02:48:57] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift [02:53:55] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.300 second response time https://wikitech.wikimedia.org/wiki/Swift [02:55:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P42016 and previous config saved to /var/cache/conftool/dbconfig/20221201-025517-ladsgroup.json [02:55:33] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [02:57:43] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Swift [02:58:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T322618)', diff saved to https://phabricator.wikimedia.org/P42017 and previous config saved to /var/cache/conftool/dbconfig/20221201-025838-ladsgroup.json [02:58:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [02:58:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P42018 and previous config saved to /var/cache/conftool/dbconfig/20221201-025848-ladsgroup.json [02:58:51] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [02:58:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [02:59:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T322618)', diff saved to https://phabricator.wikimedia.org/P42019 and previous config saved to /var/cache/conftool/dbconfig/20221201-025900-ladsgroup.json [03:00:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T322618)', diff saved to https://phabricator.wikimedia.org/P42020 and previous config saved to /var/cache/conftool/dbconfig/20221201-030007-ladsgroup.json [03:00:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P42021 and previous config saved to /var/cache/conftool/dbconfig/20221201-030040-ladsgroup.json [03:01:23] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.163 second response time https://wikitech.wikimedia.org/wiki/Swift [03:03:22] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5027.eqsin.wmnet with reason: host reimage [03:03:45] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.231 second response time https://wikitech.wikimedia.org/wiki/Swift [03:05:39] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Swift [03:06:49] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5027.eqsin.wmnet with reason: host reimage [03:09:22] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.276 second response time https://wikitech.wikimedia.org/wiki/Swift [03:10:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P42022 and previous config saved to /var/cache/conftool/dbconfig/20221201-031024-ladsgroup.json [03:12:22] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift [03:13:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P42023 and previous config saved to /var/cache/conftool/dbconfig/20221201-031354-ladsgroup.json [03:15:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P42024 and previous config saved to /var/cache/conftool/dbconfig/20221201-031514-ladsgroup.json [03:15:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42025 and previous config saved to /var/cache/conftool/dbconfig/20221201-031546-ladsgroup.json [03:15:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [03:15:52] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.138 second response time https://wikitech.wikimedia.org/wiki/Swift [03:15:54] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [03:16:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [03:16:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42026 and previous config saved to /var/cache/conftool/dbconfig/20221201-031608-ladsgroup.json [03:18:14] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T322618)', diff saved to https://phabricator.wikimedia.org/P42027 and previous config saved to /var/cache/conftool/dbconfig/20221201-032531-ladsgroup.json [03:25:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [03:25:40] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [03:25:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [03:25:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T322618)', diff saved to https://phabricator.wikimedia.org/P42028 and previous config saved to /var/cache/conftool/dbconfig/20221201-032553-ladsgroup.json [03:28:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T322618)', diff saved to https://phabricator.wikimedia.org/P42029 and previous config saved to /var/cache/conftool/dbconfig/20221201-032803-ladsgroup.json [03:29:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318605)', diff saved to https://phabricator.wikimedia.org/P42030 and previous config saved to /var/cache/conftool/dbconfig/20221201-032901-ladsgroup.json [03:29:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [03:29:08] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [03:29:10] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [03:29:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [03:29:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T318605)', diff saved to https://phabricator.wikimedia.org/P42031 and previous config saved to /var/cache/conftool/dbconfig/20221201-032922-ladsgroup.json [03:30:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P42032 and previous config saved to /var/cache/conftool/dbconfig/20221201-033020-ladsgroup.json [03:31:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318605)', diff saved to https://phabricator.wikimedia.org/P42033 and previous config saved to /var/cache/conftool/dbconfig/20221201-033132-ladsgroup.json [03:34:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [03:34:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [03:34:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T323907)', diff saved to https://phabricator.wikimedia.org/P42034 and previous config saved to /var/cache/conftool/dbconfig/20221201-033449-ladsgroup.json [03:34:56] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [03:35:17] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5027.eqsin.wmnet with OS buster [03:36:44] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.189 second response time https://wikitech.wikimedia.org/wiki/Swift [03:37:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42035 and previous config saved to /var/cache/conftool/dbconfig/20221201-033710-ladsgroup.json [03:40:44] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [03:43:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P42036 and previous config saved to /var/cache/conftool/dbconfig/20221201-034309-ladsgroup.json [03:44:38] (03PS1) 10Andrew Bogott: Add partman rules for cloudvirt10[54-61] [puppet] - 10https://gerrit.wikimedia.org/r/862406 (https://phabricator.wikimedia.org/T313983) [03:45:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T322618)', diff saved to https://phabricator.wikimedia.org/P42037 and previous config saved to /var/cache/conftool/dbconfig/20221201-034527-ladsgroup.json [03:45:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [03:45:35] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [03:45:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Andrew) [03:45:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [03:45:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [03:46:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [03:46:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T322618)', diff saved to https://phabricator.wikimedia.org/P42038 and previous config saved to /var/cache/conftool/dbconfig/20221201-034627-ladsgroup.json [03:46:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P42039 and previous config saved to /var/cache/conftool/dbconfig/20221201-034639-ladsgroup.json [03:47:00] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.118 second response time https://wikitech.wikimedia.org/wiki/Swift [03:47:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T322618)', diff saved to https://phabricator.wikimedia.org/P42040 and previous config saved to /var/cache/conftool/dbconfig/20221201-034734-ladsgroup.json [03:48:54] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [03:52:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P42041 and previous config saved to /var/cache/conftool/dbconfig/20221201-035216-ladsgroup.json [03:55:06] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:55:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42042 and previous config saved to /var/cache/conftool/dbconfig/20221201-035512-ladsgroup.json [03:55:20] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [03:56:46] PROBLEM - Wikitech-static main page has content on cloudweb1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki configuration Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 1652 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:57:56] PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki configuration Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 1652 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:58:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P42043 and previous config saved to /var/cache/conftool/dbconfig/20221201-035816-ladsgroup.json [04:01:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P42044 and previous config saved to /var/cache/conftool/dbconfig/20221201-040145-ladsgroup.json [04:02:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P42045 and previous config saved to /var/cache/conftool/dbconfig/20221201-040240-ladsgroup.json [04:06:37] RECOVERY - Wikitech-static main page has content on cloudweb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 26020 bytes in 1.628 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:07:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P42046 and previous config saved to /var/cache/conftool/dbconfig/20221201-040723-ladsgroup.json [04:10:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P42047 and previous config saved to /var/cache/conftool/dbconfig/20221201-041018-ladsgroup.json [04:13:17] RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 26018 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:13:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T322618)', diff saved to https://phabricator.wikimedia.org/P42048 and previous config saved to /var/cache/conftool/dbconfig/20221201-041322-ladsgroup.json [04:13:30] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [04:13:45] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift [04:16:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318605)', diff saved to https://phabricator.wikimedia.org/P42049 and previous config saved to /var/cache/conftool/dbconfig/20221201-041652-ladsgroup.json [04:16:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [04:16:59] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [04:17:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [04:17:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [04:17:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P42050 and previous config saved to /var/cache/conftool/dbconfig/20221201-041747-ladsgroup.json [04:17:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [04:17:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [04:17:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [04:17:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T318605)', diff saved to https://phabricator.wikimedia.org/P42051 and previous config saved to /var/cache/conftool/dbconfig/20221201-041758-ladsgroup.json [04:18:23] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:20:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318605)', diff saved to https://phabricator.wikimedia.org/P42052 and previous config saved to /var/cache/conftool/dbconfig/20221201-042008-ladsgroup.json [04:22:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42053 and previous config saved to /var/cache/conftool/dbconfig/20221201-042229-ladsgroup.json [04:22:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance [04:22:37] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [04:22:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance [04:22:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T323907)', diff saved to https://phabricator.wikimedia.org/P42054 and previous config saved to /var/cache/conftool/dbconfig/20221201-042251-ladsgroup.json [04:23:15] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Swift [04:25:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P42055 and previous config saved to /var/cache/conftool/dbconfig/20221201-042525-ladsgroup.json [04:32:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T322618)', diff saved to https://phabricator.wikimedia.org/P42056 and previous config saved to /var/cache/conftool/dbconfig/20221201-043253-ladsgroup.json [04:32:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [04:33:02] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [04:33:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [04:33:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T322618)', diff saved to https://phabricator.wikimedia.org/P42057 and previous config saved to /var/cache/conftool/dbconfig/20221201-043315-ladsgroup.json [04:33:35] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [04:34:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T322618)', diff saved to https://phabricator.wikimedia.org/P42058 and previous config saved to /var/cache/conftool/dbconfig/20221201-043422-ladsgroup.json [04:35:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P42059 and previous config saved to /var/cache/conftool/dbconfig/20221201-043514-ladsgroup.json [04:37:11] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:39:45] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [04:40:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42060 and previous config saved to /var/cache/conftool/dbconfig/20221201-044031-ladsgroup.json [04:40:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [04:40:37] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [04:40:39] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [04:40:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [04:40:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42061 and previous config saved to /var/cache/conftool/dbconfig/20221201-044053-ladsgroup.json [04:49:01] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.139 second response time https://wikitech.wikimedia.org/wiki/Swift [04:49:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P42062 and previous config saved to /var/cache/conftool/dbconfig/20221201-044929-ladsgroup.json [04:50:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P42063 and previous config saved to /var/cache/conftool/dbconfig/20221201-045020-ladsgroup.json [04:52:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:53:35] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Swift [05:03:53] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.061 second response time https://wikitech.wikimedia.org/wiki/Swift [05:03:54] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/861874 (https://phabricator.wikimedia.org/T324179) [05:04:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P42064 and previous config saved to /var/cache/conftool/dbconfig/20221201-050435-ladsgroup.json [05:05:17] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [05:05:27] (03Abandoned) 10Ladsgroup: mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/861874 (https://phabricator.wikimedia.org/T324179) (owner: 10Gerrit maintenance bot) [05:05:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318605)', diff saved to https://phabricator.wikimedia.org/P42065 and previous config saved to /var/cache/conftool/dbconfig/20221201-050527-ladsgroup.json [05:05:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [05:05:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [05:05:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [05:05:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42066 and previous config saved to /var/cache/conftool/dbconfig/20221201-050548-ladsgroup.json [05:06:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T323907)', diff saved to https://phabricator.wikimedia.org/P42067 and previous config saved to /var/cache/conftool/dbconfig/20221201-050600-ladsgroup.json [05:06:08] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [05:06:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42068 and previous config saved to /var/cache/conftool/dbconfig/20221201-050658-ladsgroup.json [05:08:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T323907)', diff saved to https://phabricator.wikimedia.org/P42069 and previous config saved to /var/cache/conftool/dbconfig/20221201-050818-ladsgroup.json [05:10:50] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/861875 (https://phabricator.wikimedia.org/T324180) [05:15:35] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.256 second response time https://wikitech.wikimedia.org/wiki/Swift [05:16:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42070 and previous config saved to /var/cache/conftool/dbconfig/20221201-051640-ladsgroup.json [05:16:49] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [05:18:41] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [05:19:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T322618)', diff saved to https://phabricator.wikimedia.org/P42071 and previous config saved to /var/cache/conftool/dbconfig/20221201-051942-ladsgroup.json [05:19:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance [05:19:50] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [05:20:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance [05:20:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T322618)', diff saved to https://phabricator.wikimedia.org/P42072 and previous config saved to /var/cache/conftool/dbconfig/20221201-052014-ladsgroup.json [05:20:41] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [05:21:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P42073 and previous config saved to /var/cache/conftool/dbconfig/20221201-052107-ladsgroup.json [05:22:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P42074 and previous config saved to /var/cache/conftool/dbconfig/20221201-052205-ladsgroup.json [05:22:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T322618)', diff saved to https://phabricator.wikimedia.org/P42075 and previous config saved to /var/cache/conftool/dbconfig/20221201-052223-ladsgroup.json [05:23:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P42076 and previous config saved to /var/cache/conftool/dbconfig/20221201-052325-ladsgroup.json [05:24:05] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift [05:25:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T322618)', diff saved to https://phabricator.wikimedia.org/P42077 and previous config saved to /var/cache/conftool/dbconfig/20221201-052524-ladsgroup.json [05:25:32] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [05:31:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P42078 and previous config saved to /var/cache/conftool/dbconfig/20221201-053147-ladsgroup.json [05:34:27] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.229 second response time https://wikitech.wikimedia.org/wiki/Swift [05:36:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P42079 and previous config saved to /var/cache/conftool/dbconfig/20221201-053613-ladsgroup.json [05:37:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P42080 and previous config saved to /var/cache/conftool/dbconfig/20221201-053711-ladsgroup.json [05:38:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P42081 and previous config saved to /var/cache/conftool/dbconfig/20221201-053831-ladsgroup.json [05:46:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P42082 and previous config saved to /var/cache/conftool/dbconfig/20221201-054653-ladsgroup.json [05:49:14] (03PS1) 10KartikMistry: testwiki: Enable Section Translation for 15 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862412 (https://phabricator.wikimedia.org/T323825) [05:51:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T323907)', diff saved to https://phabricator.wikimedia.org/P42083 and previous config saved to /var/cache/conftool/dbconfig/20221201-055120-ladsgroup.json [05:51:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [05:51:31] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [05:51:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [05:51:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T323907)', diff saved to https://phabricator.wikimedia.org/P42084 and previous config saved to /var/cache/conftool/dbconfig/20221201-055142-ladsgroup.json [05:52:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42085 and previous config saved to /var/cache/conftool/dbconfig/20221201-055218-ladsgroup.json [05:52:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [05:52:25] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [05:52:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [05:52:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42086 and previous config saved to /var/cache/conftool/dbconfig/20221201-055239-ladsgroup.json [05:53:23] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 ret [05:53:23] e unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read article [05:53:23] nuary 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [05:53:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T323907)', diff saved to https://phabricator.wikimedia.org/P42087 and previous config saved to /var/cache/conftool/dbconfig/20221201-055337-ladsgroup.json [05:53:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance [05:53:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42088 and previous config saved to /var/cache/conftool/dbconfig/20221201-055349-ladsgroup.json [05:53:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance [05:54:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T323907)', diff saved to https://phabricator.wikimedia.org/P42089 and previous config saved to /var/cache/conftool/dbconfig/20221201-055359-ladsgroup.json [05:56:55] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift [05:57:33] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [06:01:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 37 hosts with reason: Primary switchover s1 T323547 [06:01:22] T323547: Switchover s1 master (db1163 -> db1118) - https://phabricator.wikimedia.org/T323547 [06:01:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 37 hosts with reason: Primary switchover s1 T323547 [06:01:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1118 with weight 0 T323547', diff saved to https://phabricator.wikimedia.org/P42090 and previous config saved to /var/cache/conftool/dbconfig/20221201-060157-ladsgroup.json [06:02:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42091 and previous config saved to /var/cache/conftool/dbconfig/20221201-060206-ladsgroup.json [06:02:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [06:02:15] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [06:02:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [06:02:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T318605)', diff saved to https://phabricator.wikimedia.org/P42092 and previous config saved to /var/cache/conftool/dbconfig/20221201-060230-ladsgroup.json [06:03:11] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.080 second response time https://wikitech.wikimedia.org/wiki/Swift [06:06:23] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P42093 and previous config saved to /var/cache/conftool/dbconfig/20221201-060855-ladsgroup.json [06:09:03] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift [06:12:09] 10SRE, 10ops-codfw, 10DBA: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Marostegui) Thank you Papaul! [06:12:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T318605)', diff saved to https://phabricator.wikimedia.org/P42094 and previous config saved to /var/cache/conftool/dbconfig/20221201-061218-ladsgroup.json [06:12:26] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [06:16:37] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.123 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:45] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Marostegui) [06:22:25] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Marostegui) [06:24:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P42095 and previous config saved to /var/cache/conftool/dbconfig/20221201-062402-ladsgroup.json [06:27:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P42096 and previous config saved to /var/cache/conftool/dbconfig/20221201-062724-ladsgroup.json [06:27:31] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.267 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:37] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Marostegui) Looks like @Damilare is already part of the `analytics-privatedata-users`: T319057 [06:29:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10Marostegui) 05Openโ†’03Resolved I am going to close this, please reopen if adding you to the WMF group wasn't enough. [06:30:37] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [06:30:55] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [06:30:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T323907)', diff saved to https://phabricator.wikimedia.org/P42097 and previous config saved to /var/cache/conftool/dbconfig/20221201-063055-ladsgroup.json [06:30:57] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [06:31:00] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [06:35:51] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [06:36:49] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [06:37:17] (03PS2) 10Ladsgroup: mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/858382 (https://phabricator.wikimedia.org/T323547) (owner: 10Gerrit maintenance bot) [06:37:22] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/858382 (https://phabricator.wikimedia.org/T323547) (owner: 10Gerrit maintenance bot) [06:39:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42098 and previous config saved to /var/cache/conftool/dbconfig/20221201-063908-ladsgroup.json [06:39:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [06:39:12] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [06:39:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [06:39:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T318605)', diff saved to https://phabricator.wikimedia.org/P42099 and previous config saved to /var/cache/conftool/dbconfig/20221201-063930-ladsgroup.json [06:40:43] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [06:41:13] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:41:17] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.253 second response time https://wikitech.wikimedia.org/wiki/Swift [06:41:21] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [06:41:37] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318605)', diff saved to https://phabricator.wikimedia.org/P42100 and previous config saved to /var/cache/conftool/dbconfig/20221201-064140-ladsgroup.json [06:42:02] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [06:42:07] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Marostegui) @andrea.denisse - can you merge and submit your patch so we can create the kerberos principal and close this task? Thanks! [06:42:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Marostegui) a:05Jcrossโ†’03andrea.denisse [06:42:19] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [06:42:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P42101 and previous config saved to /var/cache/conftool/dbconfig/20221201-064230-ladsgroup.json [06:43:41] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:52] (03Abandoned) 10Giuseppe Lavagetto: WIP: Add cassandra-table-properties tool to Cassandra deployments [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226553) (owner: 10Holger Knust) [06:45:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] function-evaluator: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860829 (owner: 10Giuseppe Lavagetto) [06:45:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] function-orchestrator: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860830 (owner: 10Giuseppe Lavagetto) [06:46:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P42102 and previous config saved to /var/cache/conftool/dbconfig/20221201-064602-ladsgroup.json [06:50:13] (03Merged) 10jenkins-bot: function-evaluator: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860829 (owner: 10Giuseppe Lavagetto) [06:50:25] (03Merged) 10jenkins-bot: function-orchestrator: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860830 (owner: 10Giuseppe Lavagetto) [06:51:27] (03PS1) 10Marostegui: install_server: Do not reimage db1206 [puppet] - 10https://gerrit.wikimedia.org/r/862643 [06:56:24] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1206 [puppet] - 10https://gerrit.wikimedia.org/r/862643 (owner: 10Marostegui) [06:56:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P42103 and previous config saved to /var/cache/conftool/dbconfig/20221201-065646-ladsgroup.json [06:57:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T318605)', diff saved to https://phabricator.wikimedia.org/P42104 and previous config saved to /var/cache/conftool/dbconfig/20221201-065737-ladsgroup.json [06:57:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [06:57:40] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [06:57:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [07:00:04] kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T0700). [07:00:37] let's get the party started [07:01:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P42105 and previous config saved to /var/cache/conftool/dbconfig/20221201-070108-ladsgroup.json [07:01:21] !log Starting s1 eqiad failover from db1163 to db1118 - T323547 [07:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:24] T323547: Switchover s1 master (db1163 -> db1118) - https://phabricator.wikimedia.org/T323547 [07:01:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T323547', diff saved to https://phabricator.wikimedia.org/P42106 and previous config saved to /var/cache/conftool/dbconfig/20221201-070131-ladsgroup.json [07:01:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki-dev: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860713 (owner: 10Giuseppe Lavagetto) [07:02:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1118 to s1 primary and set section read-write T323547', diff saved to https://phabricator.wikimedia.org/P42107 and previous config saved to /var/cache/conftool/dbconfig/20221201-070203-ladsgroup.json [07:03:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] tegola-vector-tiles: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860707 (owner: 10Giuseppe Lavagetto) [07:05:23] (03PS2) 10Ladsgroup: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/858383 (https://phabricator.wikimedia.org/T323547) (owner: 10Gerrit maintenance bot) [07:06:31] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:09] (03CR) 10Ladsgroup: [C: 03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/858383 (https://phabricator.wikimedia.org/T323547) (owner: 10Gerrit maintenance bot) [07:07:30] (03Merged) 10jenkins-bot: mediawiki-dev: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860713 (owner: 10Giuseppe Lavagetto) [07:07:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1163 T323547', diff saved to https://phabricator.wikimedia.org/P42108 and previous config saved to /var/cache/conftool/dbconfig/20221201-070758-ladsgroup.json [07:08:02] T323547: Switchover s1 master (db1163 -> db1118) - https://phabricator.wikimedia.org/T323547 [07:08:07] (03Merged) 10jenkins-bot: tegola-vector-tiles: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860707 (owner: 10Giuseppe Lavagetto) [07:09:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:09:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:11:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P42109 and previous config saved to /var/cache/conftool/dbconfig/20221201-071153-ladsgroup.json [07:12:26] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [07:12:49] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.113 second response time https://wikitech.wikimedia.org/wiki/Swift [07:12:54] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [07:13:18] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [07:13:31] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [07:13:44] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [07:14:07] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [07:16:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T323907)', diff saved to https://phabricator.wikimedia.org/P42110 and previous config saved to /var/cache/conftool/dbconfig/20221201-071615-ladsgroup.json [07:16:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [07:16:19] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [07:16:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [07:16:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [07:16:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [07:16:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T323907)', diff saved to https://phabricator.wikimedia.org/P42111 and previous config saved to /var/cache/conftool/dbconfig/20221201-071641-ladsgroup.json [07:16:51] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:18:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:19:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:19:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:19:35] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [07:20:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:20:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:21:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] flink-session-cluster: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860828 (owner: 10Giuseppe Lavagetto) [07:21:39] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [07:22:27] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Swift [07:22:46] (03CR) 10Giuseppe Lavagetto: "Please when you merge this change, remember to go to the puppet private repository and change the "tls:" stanzas for eventgate to "mesh:"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/860518 (owner: 10Giuseppe Lavagetto) [07:23:13] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:26:05] (03Merged) 10jenkins-bot: flink-session-cluster: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860828 (owner: 10Giuseppe Lavagetto) [07:27:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318605)', diff saved to https://phabricator.wikimedia.org/P42113 and previous config saved to /var/cache/conftool/dbconfig/20221201-072659-ladsgroup.json [07:27:04] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [07:29:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T323907)', diff saved to https://phabricator.wikimedia.org/P42114 and previous config saved to /var/cache/conftool/dbconfig/20221201-072914-ladsgroup.json [07:29:18] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [07:29:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [07:29:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [07:29:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:30:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:30:13] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.302 second response time https://wikitech.wikimedia.org/wiki/Swift [07:30:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T318605)', diff saved to https://phabricator.wikimedia.org/P42115 and previous config saved to /var/cache/conftool/dbconfig/20221201-073015-ladsgroup.json [07:35:59] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:36:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T318605)', diff saved to https://phabricator.wikimedia.org/P42116 and previous config saved to /var/cache/conftool/dbconfig/20221201-073634-ladsgroup.json [07:36:38] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [07:37:27] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [07:41:13] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.095 second response time https://wikitech.wikimedia.org/wiki/Swift [07:43:01] (03CR) 10Muehlenhoff: scap: move firewall rules out of the module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/862378 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [07:44:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P42117 and previous config saved to /var/cache/conftool/dbconfig/20221201-074420-ladsgroup.json [07:49:15] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: remove dns5001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/862321 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [07:49:27] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops-radar: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10MoritzMuehlenhoff) Whether the user is created via adduser or systemd::sysuser doesn't matter, the fix is to have a reserved UID defined via data.yaml in th... [07:51:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 400474 [07:51:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P42118 and previous config saved to /var/cache/conftool/dbconfig/20221201-075140-ladsgroup.json [07:52:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 400474 [07:55:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P42119 and previous config saved to /var/cache/conftool/dbconfig/20221201-075506-ladsgroup.json [07:55:10] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [07:56:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T323907)', diff saved to https://phabricator.wikimedia.org/P42120 and previous config saved to /var/cache/conftool/dbconfig/20221201-075606-ladsgroup.json [07:56:09] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [07:59:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P42122 and previous config saved to /var/cache/conftool/dbconfig/20221201-075927-ladsgroup.json [08:00:05] Amir1, apergos, and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T0800). [08:00:14] morning! once again there are no trainees signed up today for training and no patches scheduled for deployment during the window. so we'll see you next time! [08:01:07] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift [08:05:18] (03CR) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova) [08:05:23] (03PS2) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) [08:06:08] (03CR) 10CI reject: [V: 04-1] Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova) [08:06:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P42123 and previous config saved to /var/cache/conftool/dbconfig/20221201-080647-ladsgroup.json [08:07:27] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:10:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P42124 and previous config saved to /var/cache/conftool/dbconfig/20221201-081013-ladsgroup.json [08:11:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P42125 and previous config saved to /var/cache/conftool/dbconfig/20221201-081112-ladsgroup.json [08:11:57] (03PS3) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) [08:12:39] (03CR) 10CI reject: [V: 04-1] Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova) [08:14:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T323907)', diff saved to https://phabricator.wikimedia.org/P42126 and previous config saved to /var/cache/conftool/dbconfig/20221201-081433-ladsgroup.json [08:14:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [08:14:38] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [08:14:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [08:14:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42127 and previous config saved to /var/cache/conftool/dbconfig/20221201-081444-ladsgroup.json [08:15:31] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [08:15:59] (03PS4) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) [08:20:59] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:21:51] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.254 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T318605)', diff saved to https://phabricator.wikimedia.org/P42128 and previous config saved to /var/cache/conftool/dbconfig/20221201-082154-ladsgroup.json [08:21:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [08:21:58] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [08:22:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [08:22:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T318605)', diff saved to https://phabricator.wikimedia.org/P42129 and previous config saved to /var/cache/conftool/dbconfig/20221201-082215-ladsgroup.json [08:23:23] (03PS2) 10Giuseppe Lavagetto: eventgate: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860518 [08:23:25] (03PS1) 10Giuseppe Lavagetto: calculator-service: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/862829 [08:24:47] (03CR) 10CI reject: [V: 04-1] calculator-service: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/862829 (owner: 10Giuseppe Lavagetto) [08:25:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P42130 and previous config saved to /var/cache/conftool/dbconfig/20221201-082519-ladsgroup.json [08:26:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P42131 and previous config saved to /var/cache/conftool/dbconfig/20221201-082619-ladsgroup.json [08:27:09] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift [08:30:59] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:33:21] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.095 second response time https://wikitech.wikimedia.org/wiki/Swift [08:36:43] <_joe_> not sure how we have a backend on ms-fe [08:39:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T318605)', diff saved to https://phabricator.wikimedia.org/P42134 and previous config saved to /var/cache/conftool/dbconfig/20221201-083914-ladsgroup.json [08:39:18] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [08:40:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P42135 and previous config saved to /var/cache/conftool/dbconfig/20221201-084026-ladsgroup.json [08:40:45] _joe_: check_https_url!ms-fe.svc.eqiad.wmnet!/monitoring/backend [08:41:07] <_joe_> volans: yeah saw on icinga's interface [08:41:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T323907)', diff saved to https://phabricator.wikimedia.org/P42136 and previous config saved to /var/cache/conftool/dbconfig/20221201-084125-ladsgroup.json [08:41:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [08:41:29] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [08:41:34] now what that does check... [08:41:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [08:41:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42137 and previous config saved to /var/cache/conftool/dbconfig/20221201-084147-ladsgroup.json [08:42:15] PROBLEM - Host mw1334 is DOWN: PING CRITICAL - Packet loss = 100% [08:42:45] uh [08:43:03] apergos: I might backport something to wmf.12 if that's still possible [08:43:17] RECOVERY - Host mw1334 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [08:43:17] I'm checking the mgmt [08:44:13] _joe_: mw1334 just got rebooted [08:44:43] nothing in syslog, I'll check hw logs [08:44:55] <_joe_> uh [08:46:03] and ofc... ssh to mgmt doesn't work [08:46:16] troibleshooting [08:47:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:48:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:48:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:49:00] !log restart idrac on mw1334, ipmi and remote ipmi works fine, ssh not responding [08:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:49:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet [08:50:27] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:51] (03PS1) 10Kosta Harlan: User impact: Fix per-page pageview numbers [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862354 (https://phabricator.wikimedia.org/T323253) [08:51:23] _joe_ / volans is it OK if I backport something to wmf.12 now, or would that interfere with what you are troublehsooting? [08:51:28] *troubleshooting, even [08:51:44] <_joe_> no go on please [08:51:47] kostajh: go on [08:52:01] thx [08:52:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:53:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862354 (https://phabricator.wikimedia.org/T323253) (owner: 10Kosta Harlan) [08:54:21] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [08:54:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P42138 and previous config saved to /var/cache/conftool/dbconfig/20221201-085421-ladsgroup.json [08:55:59] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Swift [08:56:36] ACKNOWLEDGEMENT - MD RAID on ganeti2013 is CRITICAL: CRITICAL: State: degraded, Active: 9, Working: 9, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T324185 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:56:41] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T324185 (10ops-monitoring-bot) [09:00:43] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.140 second response time https://wikitech.wikimedia.org/wiki/Swift [09:01:08] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2013.codfw.wmnet [09:02:21] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.098 second response time https://wikitech.wikimedia.org/wiki/Swift [09:02:39] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [09:04:45] Emperor: is this "normal" ^^^ [09:04:50] has been flapping for a bit [09:07:41] !log rebuilding raid on ganeti2013 T323222 [09:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:45] T323222: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 [09:08:21] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift [09:08:27] volans: no :-/ [09:08:58] there is no mention of the backend monitoring on wikitech [09:09:08] so not sure how to further debug without starting reading code [09:09:18] (03Merged) 10jenkins-bot: User impact: Fix per-page pageview numbers [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862354 (https://phabricator.wikimedia.org/T323253) (owner: 10Kosta Harlan) [09:09:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P42139 and previous config saved to /var/cache/conftool/dbconfig/20221201-090927-ladsgroup.json [09:09:52] ms-fe1011 looks only lightly loaded to me currently [09:09:52] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:862354|User impact: Fix per-page pageview numbers (T323253)]] [09:09:55] T323253: NewImpact module: Page view data should be limited to when user made their first edit - https://phabricator.wikimedia.org/T323253 [09:10:21] (03PS1) 10Volans: setup.py: add temporary upper limit for pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/862830 [09:10:27] but there are a bunch of errors in server.log [09:10:43] (03CR) 10David Caro: [C: 03+1] setup.py: add temporary upper limit for pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/862830 (owner: 10Volans) [09:11:01] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:11:08] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:862354|User impact: Fix per-page pageview numbers (T323253)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [09:11:36] Emperor: let us (oncallers) know if we can be of any assistance [09:11:54] it's not a behaviour I've seen before [09:12:58] (03PS18) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [09:13:18] (03CR) 10Volans: [C: 03+2] setup.py: add temporary upper limit for pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/862830 (owner: 10Volans) [09:13:32] and the backtraces on ms-fe1011 are largely from inside python libraries rather than swift (IYSWIM) [09:14:30] seem to have started around 15:55:11 yesterday [09:14:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:14:41] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:15:08] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38543/console" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [09:15:09] !log depool, restart, repool swift-proxy on ms-fe1011 [09:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:19] (03Merged) 10jenkins-bot: setup.py: add temporary upper limit for pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/862830 (owner: 10Volans) [09:15:51] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:19] Emperor: as unrelated it might seem, the only thing that happened at that time was to shutdown thumbor2004 for idrac maintenance [09:16:35] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Swift [09:16:56] volans: that seems entirely plausibly related, is it still off? swift talks to thumbor so I could well believe thumbor being unhappy would be enough to make swift unhappy [09:17:03] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Swift [09:17:15] Emperor: but cross-dc? [09:17:36] thumbor2004 is up since 16h17m [09:18:05] (03PS19) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [09:18:24] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:862354|User impact: Fix per-page pageview numbers (T323253)]] (duration: 08m 31s) [09:18:27] T323253: NewImpact module: Page view data should be limited to when user made their first edit - https://phabricator.wikimedia.org/T323253 [09:19:11] !log UTC morning deploys done [09:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:33] volans: probably coincidence, then. ms-fe1011 has swift-proxy and nginx restarted and repooled, let's see how it goes [09:19:43] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38544/console" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [09:19:56] ack [09:20:40] (03CR) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [09:21:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:21:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:22:05] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:31] volans: seems to be behaving itself now (no more backtraces in server.log) [09:24:13] good! thanks [09:24:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [09:24:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T318605)', diff saved to https://phabricator.wikimedia.org/P42140 and previous config saved to /var/cache/conftool/dbconfig/20221201-092434-ladsgroup.json [09:24:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [09:24:38] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [09:24:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [09:24:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T318605)', diff saved to https://phabricator.wikimedia.org/P42141 and previous config saved to /var/cache/conftool/dbconfig/20221201-092455-ladsgroup.json [09:27:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:29:29] 10SRE, 10Infrastructure-Foundations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815 (10MoritzMuehlenhoff) [09:29:46] 10SRE, 10Infrastructure-Foundations, 10Release-Engineering-Team: Enforce reference to Phabricator task for all commits to modules/admin/data/data.yaml - https://phabricator.wikimedia.org/T142827 (10MoritzMuehlenhoff) 05Openโ†’03Declined Mid-term data.yaml will be generated via the IDM which will include pr... [09:30:25] (03PS1) 10Slyngshede: ldap:client:utils remove outdated ldaplist util. [puppet] - 10https://gerrit.wikimedia.org/r/862833 (https://phabricator.wikimedia.org/T114063) [09:32:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T318605)', diff saved to https://phabricator.wikimedia.org/P42142 and previous config saved to /var/cache/conftool/dbconfig/20221201-093214-ladsgroup.json [09:32:17] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [09:34:44] (03PS2) 10Slyngshede: ldap:client:utils remove outdated ldaplist util. [puppet] - 10https://gerrit.wikimedia.org/r/862833 (https://phabricator.wikimedia.org/T114063) [09:40:25] (03PS7) 10David Caro: wmcs.create_instance_with_prefix: Add a sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 [09:41:00] (03PS7) 10David Caro: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [09:41:40] (03CR) 10David Caro: "It was due to https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/862830" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [09:43:55] (03PS1) 10Kosta Harlan: DatabaseUserImpactStore: Fix parameter style for upsert keys [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862355 (https://phabricator.wikimedia.org/T324188) [09:47:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P42143 and previous config saved to /var/cache/conftool/dbconfig/20221201-094720-ladsgroup.json [09:49:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42144 and previous config saved to /var/cache/conftool/dbconfig/20221201-094907-ladsgroup.json [09:49:10] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [09:49:43] (03CR) 10David Caro: [C: 03+2] wmcs.create_instance_with_prefix: Add a sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 (owner: 10David Caro) [09:49:57] (03CR) 10David Caro: [C: 03+2] wmcs.create_instance_with_prefix: Add a sec group default (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 (owner: 10David Caro) [09:50:28] (03PS1) 10Filippo Giunchedi: hiera: depool graphite1004 for reads [puppet] - 10https://gerrit.wikimedia.org/r/862838 (https://phabricator.wikimedia.org/T324089) [09:51:07] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T324185 (10Kizule) [09:51:10] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10Kizule) [09:51:38] (03PS2) 10Filippo Giunchedi: decom graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) [09:52:23] (03CR) 10Muehlenhoff: [C: 03+1] "Good riddance!" [puppet] - 10https://gerrit.wikimedia.org/r/862833 (https://phabricator.wikimedia.org/T114063) (owner: 10Slyngshede) [09:52:38] (03PS7) 10Kosta Harlan: GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) [09:52:40] (03PS6) 10Kosta Harlan: GrowthExperiments: Start oldimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) [09:52:42] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable new impact module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526) [09:52:44] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable new impact module on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862840 (https://phabricator.wikimedia.org/T323686) [09:53:03] (03CR) 10Kosta Harlan: GrowthExperiments: Start oldimpact experiment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [09:53:39] (03CR) 10Muehlenhoff: [C: 03+2] Set role_contacts for role analytics_cluster::coordinator::replica [puppet] - 10https://gerrit.wikimedia.org/r/860608 (owner: 10Muehlenhoff) [09:53:46] (03Merged) 10jenkins-bot: wmcs.create_instance_with_prefix: Add a sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 (owner: 10David Caro) [09:54:10] (03CR) 10David Caro: [C: 03+2] harbor: remove support for (03CR) 10David Caro: [C: 03+2] harbor: remove unused harbor::db module/role [puppet] - 10https://gerrit.wikimedia.org/r/860627 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [09:54:15] (03CR) 10David Caro: [C: 03+2] toolforge harbor: update certs with acmechief [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [09:54:18] (03CR) 10David Caro: [C: 03+2] harbor: ensure that it's started [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [09:56:09] RECOVERY - DPKG on grafana1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:56:44] (03CR) 10Sergio Gimeno: [C: 03+1] [no-op] GrowthExperiments: Enable D3 in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861506 (https://phabricator.wikimedia.org/T318854) (owner: 10Gergล‘ Tisza) [09:57:00] (03CR) 10Sergio Gimeno: [C: 03+1] GrowthExperiments: Enable new impact module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [09:57:20] (03PS1) 10Muehlenhoff: Make ganeti5004 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/862841 [09:58:21] (03CR) 10Kosta Harlan: [C: 04-2] GrowthExperiments: Run refreshUserImpactData maintenance script in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan) [09:58:39] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti5004 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/862841 (owner: 10Muehlenhoff) [10:00:26] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Feel free to merge at will." [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [10:01:19] (03PS2) 10David Caro: quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 [10:01:39] (03PS2) 10David Caro: wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) [10:02:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P42145 and previous config saved to /var/cache/conftool/dbconfig/20221201-100227-ladsgroup.json [10:04:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P42146 and previous config saved to /var/cache/conftool/dbconfig/20221201-100413-ladsgroup.json [10:04:25] (03CR) 10CI reject: [V: 04-1] quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro) [10:04:55] (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro) [10:09:03] (03PS3) 10Filippo Giunchedi: prometheus: move traffic rules to 'ops' instance [puppet] - 10https://gerrit.wikimedia.org/r/861825 (https://phabricator.wikimedia.org/T288196) [10:09:09] (03PS3) 10David Caro: wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) [10:10:22] (03PS6) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [10:11:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, there's a few more places where 1004 is referenced (such as service::catalog in Hiera), but those can happen in subsequent com" [puppet] - 10https://gerrit.wikimedia.org/r/862838 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi) [10:12:17] (03PS2) 10Giuseppe Lavagetto: calculator-service: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/862829 [10:12:19] (03PS1) 10Giuseppe Lavagetto: Remove common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/862842 [10:12:33] (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro) [10:12:48] (03CR) 10CI reject: [V: 04-1] Remove common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/862842 (owner: 10Giuseppe Lavagetto) [10:13:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42147 and previous config saved to /var/cache/conftool/dbconfig/20221201-101356-ladsgroup.json [10:14:00] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [10:16:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/862838 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi) [10:16:05] (03CR) 10Filippo Giunchedi: [C: 03+2] hiera: depool graphite1004 for reads [puppet] - 10https://gerrit.wikimedia.org/r/862838 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi) [10:16:29] (03PS5) 10Stevemunene: Add an-presto1006 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) [10:17:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T318605)', diff saved to https://phabricator.wikimedia.org/P42148 and previous config saved to /var/cache/conftool/dbconfig/20221201-101733-ladsgroup.json [10:17:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [10:17:37] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [10:17:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [10:17:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T318605)', diff saved to https://phabricator.wikimedia.org/P42149 and previous config saved to /var/cache/conftool/dbconfig/20221201-101754-ladsgroup.json [10:18:23] (03CR) 10Stevemunene: [C: 03+2] Add an-presto1006 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [10:19:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P42150 and previous config saved to /var/cache/conftool/dbconfig/20221201-101920-ladsgroup.json [10:20:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [10:23:18] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) @dcaro also advised me this can be set for the monitor traffic also, with 'mon_use_min_delay_socket': https://github.com/c... [10:23:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T318605)', diff saved to https://phabricator.wikimedia.org/P42151 and previous config saved to /var/cache/conftool/dbconfig/20221201-102357-ladsgroup.json [10:24:00] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [10:28:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [10:29:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P42152 and previous config saved to /var/cache/conftool/dbconfig/20221201-102903-ladsgroup.json [10:34:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1 [10:34:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1 [10:34:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42153 and previous config saved to /var/cache/conftool/dbconfig/20221201-103426-ladsgroup.json [10:34:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:34:30] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [10:34:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:34:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42154 and previous config saved to /var/cache/conftool/dbconfig/20221201-103448-ladsgroup.json [10:35:07] (03PS1) 10Filippo Giunchedi: hiera: set thanos-web service to production [puppet] - 10https://gerrit.wikimedia.org/r/862843 (https://phabricator.wikimedia.org/T323913) [10:35:44] (03CR) 10CI reject: [V: 04-1] hiera: set thanos-web service to production [puppet] - 10https://gerrit.wikimedia.org/r/862843 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [10:36:08] (03CR) 10Clรฉment Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/862304 (https://phabricator.wikimedia.org/T324091) (owner: 10Clรฉment Goubert) [10:36:57] (03PS2) 10Filippo Giunchedi: hiera: set thanos-web service to production [puppet] - 10https://gerrit.wikimedia.org/r/862843 (https://phabricator.wikimedia.org/T323913) [10:37:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [10:39:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P42155 and previous config saved to /var/cache/conftool/dbconfig/20221201-103903-ladsgroup.json [10:44:00] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) For the record, we decided to start with option 3 and we're starting with rollout phase 1, specifically we'll move test2.wikipedia.org to kubernetes first. [10:44:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P42156 and previous config saved to /var/cache/conftool/dbconfig/20221201-104409-ladsgroup.json [10:44:13] (03CR) 10Filippo Giunchedi: [C: 03+2] "Oncall has been notified" [puppet] - 10https://gerrit.wikimedia.org/r/862843 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [10:45:19] (03PS1) 10Giuseppe Lavagetto: trafficserver: move test2wiki to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536) [10:45:49] 10SRE, 10SRE-Access-Requests: nahidunlimited with same SSH password for WMCS and production - https://phabricator.wikimedia.org/T324197 (10Marostegui) [10:45:53] 10SRE, 10SRE-Access-Requests: nahidunlimited with same SSH password for WMCS and production - https://phabricator.wikimedia.org/T324197 (10Marostegui) p:05Triageโ†’03High [10:45:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:48:07] 10SRE, 10SRE-Access-Requests: Turnilo access request for User:wfan - https://phabricator.wikimedia.org/T324057 (10Marostegui) Thanks Greg!. I will wait for the correct template and then proceed. [10:48:17] (03PS1) 10Filippo Giunchedi: Revert "Revert thanos-web discovery record" [dns] - 10https://gerrit.wikimedia.org/r/862356 [10:52:46] (03PS1) 10Michael GroรŸe: Fix broken search with vector-2022 on www.wikidata.org [extensions/Wikibase] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862357 (https://phabricator.wikimedia.org/T324148) [10:53:01] (03CR) 10Vgutierrez: [C: 03+1] prometheus: move traffic rules to 'ops' instance [puppet] - 10https://gerrit.wikimedia.org/r/861825 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [10:53:07] jouncebot: nowandnext [10:53:07] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [10:53:07] In 0 hour(s) and 6 minute(s): Services โ€“ Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1100) [10:53:49] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "Revert thanos-web discovery record" [dns] - 10https://gerrit.wikimedia.org/r/862356 (owner: 10Filippo Giunchedi) [10:53:55] (03PS2) 10Filippo Giunchedi: Revert "Revert thanos-web discovery record" [dns] - 10https://gerrit.wikimedia.org/r/862356 [10:54:03] does anyone mind if I backport โ€œfix broken search with vector-2022 on www.wikidata.orgโ€ (just above) now, without waiting for the afternoon backport window? [10:54:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P42157 and previous config saved to /var/cache/conftool/dbconfig/20221201-105410-ladsgroup.json [10:55:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [10:55:45] <_joe_> Lucas_WMDE: go ahead [10:55:52] ok, thanks [10:55:59] (KubernetesAPILatency) resolved: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:56:19] * MichaelG_WMDE is here as well [10:56:33] !log deleted knative controller + net-istio controllers on ml-serve-eqiad to clear out some weird state (causing high latencies for the k8s api) [10:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862357 (https://phabricator.wikimedia.org/T324148) (owner: 10Michael GroรŸe) [10:57:11] !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-web [10:57:21] CI will probably take 15 minutes or so, so thereโ€™s plenty of time for someone else to object ;) [10:59:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42158 and previous config saved to /var/cache/conftool/dbconfig/20221201-105916-ladsgroup.json [10:59:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:59:20] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [10:59:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:59:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T323907)', diff saved to https://phabricator.wikimedia.org/P42159 and previous config saved to /var/cache/conftool/dbconfig/20221201-105938-ladsgroup.json [11:00:04] mvolz: gettimeofday() says it's time for Services โ€“ Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1100) [11:00:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [11:00:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [11:01:50] (03CR) 10Vgutierrez: [C: 04-1] trafficserver: move test2wiki to kubernetes (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536) (owner: 10Giuseppe Lavagetto) [11:02:41] (03CR) 10Hnowlan: [C: 03+2] APIGW/Liftwing: Fix missing part of path regexen [deployment-charts] - 10https://gerrit.wikimedia.org/r/862311 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman) [11:05:07] 10Puppet, 10SRE, 10Infrastructure-Foundations: Write, publish and deploy puppet-lint plug-in for ensure attribute bareword check - https://phabricator.wikimedia.org/T95377 (10jbond) >>! In T95377#8433934, @Dzahn wrote: > @jbond and all. I wonder what you would think about this now in 2022. > > Are the barew... [11:05:37] 10SRE-OnFire, 10observability, 10serviceops-radar, 10Sustainability (Incident Followup): Monitor high load on etcd/conf* hosts to prevent incidents of software requiring config reload too often - https://phabricator.wikimedia.org/T322400 (10JMeybohm) [11:07:27] (03Merged) 10jenkins-bot: APIGW/Liftwing: Fix missing part of path regexen [deployment-charts] - 10https://gerrit.wikimedia.org/r/862311 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman) [11:09:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T318605)', diff saved to https://phabricator.wikimedia.org/P42160 and previous config saved to /var/cache/conftool/dbconfig/20221201-110916-ladsgroup.json [11:09:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [11:09:20] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:09:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [11:09:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T318605)', diff saved to https://phabricator.wikimedia.org/P42161 and previous config saved to /var/cache/conftool/dbconfig/20221201-110938-ladsgroup.json [11:10:41] (03Merged) 10jenkins-bot: Fix broken search with vector-2022 on www.wikidata.org [extensions/Wikibase] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862357 (https://phabricator.wikimedia.org/T324148) (owner: 10Michael GroรŸe) [11:11:07] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:862357|Fix broken search with vector-2022 on www.wikidata.org (T324148)]] [11:11:10] T324148: search box not working with Vector 2022 on Wikidata - https://phabricator.wikimedia.org/T324148 [11:12:15] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and migr: Backport for [[gerrit:862357|Fix broken search with vector-2022 on www.wikidata.org (T324148)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [11:12:22] MichaelG_WMDE: ^ [11:12:26] testing on mwdebug [11:12:51] * MichaelG_WMDE looks [11:12:53] seems to work fine as far as I can tell [11:13:01] (03CR) 10Vgutierrez: [C: 03+1] "looking good, see comments about x-wikimedia-debug" [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536) (owner: 10Giuseppe Lavagetto) [11:13:34] @Lucas_WMDE same! [11:13:39] ok, syncing [11:14:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:15:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:15:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:15:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T318605)', diff saved to https://phabricator.wikimedia.org/P42162 and previous config saved to /var/cache/conftool/dbconfig/20221201-111542-ladsgroup.json [11:15:46] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:15:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:16:05] (03CR) 10Ilias Sarantopoulos: "This change is ready for review." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [11:16:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, one optional request" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova) [11:16:52] (03PS2) 10Ilias Sarantopoulos: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) [11:17:29] (03PS4) 10David Caro: wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) [11:18:03] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:862357|Fix broken search with vector-2022 on www.wikidata.org (T324148)]] (duration: 06m 56s) [11:18:07] T324148: search box not working with Vector 2022 on Wikidata - https://phabricator.wikimedia.org/T324148 [11:18:35] looks like itโ€™s working without mwdebug now [11:18:36] (03CR) 10Elukey: [C: 03+1] Rewrite as kubernetes operator/controller (035 comments) [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [11:18:36] * Lucas_WMDE done [11:18:50] (03PS1) 10Jbond: do not merge: test CI [puppet] - 10https://gerrit.wikimedia.org/r/862848 (https://phabricator.wikimedia.org/T95377) [11:19:26] (03CR) 10CI reject: [V: 04-1] do not merge: test CI [puppet] - 10https://gerrit.wikimedia.org/r/862848 (https://phabricator.wikimedia.org/T95377) (owner: 10Jbond) [11:20:43] (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro) [11:20:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:21:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:21:28] (03CR) 10Elukey: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [11:21:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:21:40] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Write, publish and deploy puppet-lint plug-in for ensure attribute bareword check - https://phabricator.wikimedia.org/T95377 (10jbond) >>! In T95377#8435033, @jbond wrote: >> Or... does the fact that we haven't done it since 2015 just show... [11:25:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:25:57] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 2 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Vgutierrez) I can confirm that they've added HSTS support and stopped serving traffic in port 80 and redirect it to port 443: ` $ curl -I l... [11:26:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] kube-env: Move environments and services to config [puppet] - 10https://gerrit.wikimedia.org/r/862304 (https://phabricator.wikimedia.org/T324091) (owner: 10Clรฉment Goubert) [11:26:28] (03PS5) 10David Caro: wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) [11:27:04] (03CR) 10Clรฉment Goubert: [V: 03+1 C: 03+2] kube-env: Move environments and services to config [puppet] - 10https://gerrit.wikimedia.org/r/862304 (https://phabricator.wikimedia.org/T324091) (owner: 10Clรฉment Goubert) [11:30:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P42163 and previous config saved to /var/cache/conftool/dbconfig/20221201-113049-ladsgroup.json [11:32:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [11:35:51] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:37:18] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10BTullis) @cmooney - This looks very useful. We can certainly look at using `osd_heartbeat_use_min_delay_socket=true` from the outset... [11:37:55] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:38:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:40:37] _joe_: I think this is known right? anything to do about it? ^^^ [11:41:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [11:42:29] (03PS1) 10Hnowlan: api-gateway: Correct behaviour when handling multiple regexes in path [deployment-charts] - 10https://gerrit.wikimedia.org/r/862851 (https://phabricator.wikimedia.org/T317326) [11:42:46] (03CR) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [11:43:30] <_joe_> volans: the latency? yes, you can blame jayme [11:43:37] yes [11:43:39] <_joe_> it's his software causing it [11:43:44] :D [11:44:08] (03CR) 10Vgutierrez: "dstat --varnish-hit currently works as expected. Ema took care of varnish-be missing in 6c89146a832b0290f00de9123e8531dd6e71b600. Do we ha" [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall) [11:44:48] (03Abandoned) 10Jbond: do not merge: test CI [puppet] - 10https://gerrit.wikimedia.org/r/862848 (https://phabricator.wikimedia.org/T95377) (owner: 10Jbond) [11:45:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P42164 and previous config saved to /var/cache/conftool/dbconfig/20221201-114555-ladsgroup.json [11:46:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1 [11:47:33] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1 [11:48:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:50:01] (03CR) 10Vgutierrez: "on the other hand, dstat --varnishstat is currently broken and it varnish-be references to be removed to fix it" [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall) [11:50:05] 10SRE, 10SRE-Access-Requests: nahidunlimited with same SSH password for WMCS and production - https://phabricator.wikimedia.org/T324197 (10Marostegui) [11:50:45] (03CR) 10FNegri: WIP: idea for cloud cumin::target (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [11:50:53] (03CR) 10Klausman: [C: 03+1] api-gateway: Correct behaviour when handling multiple regexes in path [deployment-charts] - 10https://gerrit.wikimedia.org/r/862851 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [11:51:29] (03CR) 10FNegri: [C: 03+1] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro) [11:53:31] volans: You can ack it for another week if it's too noisy (the LIST services latency) [11:53:41] (03CR) 10FNegri: "Can you add an example of a use case that is fixed by this patch?" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro) [11:53:47] Or I can do it idc [11:54:44] Heh, it's silenced but only for eqiad [11:54:46] Fixing [11:55:45] Done, I bumped the silence another 24h and removed the site matcher [11:55:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1 [11:56:13] (03CR) 10Vgutierrez: "and as a side effect if this CR gets merged, T277910 could be closed" [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall) [11:57:15] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1 [11:57:34] (03PS1) 10Jbond: CI - puppet-lint: Add puppet-lint-param-docs [puppet] - 10https://gerrit.wikimedia.org/r/862855 (https://phabricator.wikimedia.org/T127797) [11:57:36] (03PS1) 10Jbond: do not merge: test puppet-lint-param-docs [puppet] - 10https://gerrit.wikimedia.org/r/862856 (https://phabricator.wikimedia.org/T127797) [11:57:38] (03PS1) 10Jbond: do not merge: CI should no longer complain [puppet] - 10https://gerrit.wikimedia.org/r/862857 [11:58:45] (03CR) 10CI reject: [V: 04-1] do not merge: test puppet-lint-param-docs [puppet] - 10https://gerrit.wikimedia.org/r/862856 (https://phabricator.wikimedia.org/T127797) (owner: 10Jbond) [11:59:12] (03CR) 10CI reject: [V: 04-1] do not merge: CI should no longer complain [puppet] - 10https://gerrit.wikimedia.org/r/862857 (owner: 10Jbond) [12:01:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T318605)', diff saved to https://phabricator.wikimedia.org/P42165 and previous config saved to /var/cache/conftool/dbconfig/20221201-120102-ladsgroup.json [12:01:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [12:01:06] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:01:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [12:01:34] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Documentation, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797 (10jbond) I would be +1 for joe's suggestion. The above patches add this and show what the output looks like. > This way we don't force... [12:02:40] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10serviceops: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10LSobanski) [12:03:28] (03CR) 10Muehlenhoff: [C: 03+2] postgresql: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/862260 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [12:04:01] 10SRE, 10SRE-Access-Requests: nahidunlimited with same SSH password for WMCS and production - https://phabricator.wikimedia.org/T324197 (10Nahid) Apologies for the trouble. I forgot that I needed two keys. Here's the new one: ` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC0YqFoY0FU966/s5yB2JSzP3II2kGfSH5sMLfa5xrt... [12:07:57] (03PS1) 10Marostegui: data.yaml: Replace nahidunlimited key [puppet] - 10https://gerrit.wikimedia.org/r/862860 (https://phabricator.wikimedia.org/T324197) [12:09:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: nahidunlimited with same SSH password for WMCS and production - https://phabricator.wikimedia.org/T324197 (10Marostegui) a:05Nahidโ†’03Marostegui [12:11:39] (03CR) 10Volans: [C: 03+1] "Syntax LGTM, I trust you got the key via a secure method ;)" [puppet] - 10https://gerrit.wikimedia.org/r/862860 (https://phabricator.wikimedia.org/T324197) (owner: 10Marostegui) [12:11:56] (03CR) 10Clรฉment Goubert: [C: 03+1] data.yaml: Replace nahidunlimited key [puppet] - 10https://gerrit.wikimedia.org/r/862860 (https://phabricator.wikimedia.org/T324197) (owner: 10Marostegui) [12:13:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42166 and previous config saved to /var/cache/conftool/dbconfig/20221201-121301-ladsgroup.json [12:13:05] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [12:14:53] (03CR) 10Marostegui: [C: 03+2] "It is now verified" [puppet] - 10https://gerrit.wikimedia.org/r/862860 (https://phabricator.wikimedia.org/T324197) (owner: 10Marostegui) [12:15:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: nahidunlimited with same SSH password for WMCS and production - https://phabricator.wikimedia.org/T324197 (10Marostegui) 05Openโ†’03Resolved ssh key verified and replaced. Please allow 30-60 minutes for it to totally spread across the fleet. [12:24:57] PROBLEM - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:31] ACKNOWLEDGEMENT - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service Btullis T323783 - host being brought into service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P42167 and previous config saved to /var/cache/conftool/dbconfig/20221201-122807-ladsgroup.json [12:29:44] (03PS5) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) [12:30:02] (03CR) 10Muehlenhoff: [C: 03+2] Add Hiera settings for second bookworm puppetdb pair [puppet] - 10https://gerrit.wikimedia.org/r/862256 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [12:30:15] (03CR) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova) [12:31:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova) [12:34:09] (03PS4) 10Jbond: cumin::target: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) [12:34:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T323907)', diff saved to https://phabricator.wikimedia.org/P42168 and previous config saved to /var/cache/conftool/dbconfig/20221201-123430-ladsgroup.json [12:34:35] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [12:34:48] (03PS5) 10Jbond: cumin::target: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) [12:35:29] (03CR) 10Jbond: cumin::target: idea for cloud cumin::target (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [12:36:58] (03CR) 10CI reject: [V: 04-1] cumin::target: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [12:37:29] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10MoritzMuehlenhoff) ganeti5004 has been added to the eqsin Ganeti cluster. [12:39:00] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Correct behaviour when handling multiple regexes in path [deployment-charts] - 10https://gerrit.wikimedia.org/r/862851 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [12:40:33] (03CR) 10Thiemo Kreuz (WMDE): "I'm not able to do an actual review on this, sorry. Just some questions. This is meant to be manually executed, right? How does one know w" [puppet] - 10https://gerrit.wikimedia.org/r/752748 (https://phabricator.wikimedia.org/T218097) (owner: 10MSantos) [12:42:54] (03Abandoned) 10Hnowlan: thumbor: correct tinyrgb path [deployment-charts] - 10https://gerrit.wikimedia.org/r/860628 (https://phabricator.wikimedia.org/T323775) (owner: 10Hnowlan) [12:43:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P42169 and previous config saved to /var/cache/conftool/dbconfig/20221201-124314-ladsgroup.json [12:43:42] !log installing glibc security updates on buster [12:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:19] (03Merged) 10jenkins-bot: api-gateway: Correct behaviour when handling multiple regexes in path [deployment-charts] - 10https://gerrit.wikimedia.org/r/862851 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [12:45:48] (03CR) 10Raymond Ndibe: [C: 03+2] cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [12:46:56] PROBLEM - puppet last run on idp-test1002 is CRITICAL: CRITICAL: Puppet has been disabled for 604896 seconds, message: jmm test, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:47:39] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [12:47:57] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [12:48:16] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [12:48:45] (03CR) 10CI reject: [V: 04-1] cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [12:48:47] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [12:49:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P42170 and previous config saved to /var/cache/conftool/dbconfig/20221201-124936-ladsgroup.json [12:49:40] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [12:50:01] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [12:50:22] (03CR) 10Filippo Giunchedi: "Very cool! Thank you for the quick action on this" [puppet] - 10https://gerrit.wikimedia.org/r/862304 (https://phabricator.wikimedia.org/T324091) (owner: 10Clรฉment Goubert) [12:50:26] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [12:50:30] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [12:52:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:53:07] (03PS1) 10Muehlenhoff: superset: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862883 (https://phabricator.wikimedia.org/T135991) [12:55:09] (03PS2) 10Jbond: idp::standalone: add django oidc app [puppet] - 10https://gerrit.wikimedia.org/r/862328 [12:55:47] (03PS1) 10Muehlenhoff: hue: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862885 (https://phabricator.wikimedia.org/T135991) [12:57:16] (03CR) 10CI reject: [V: 04-1] idp::standalone: add django oidc app [puppet] - 10https://gerrit.wikimedia.org/r/862328 (owner: 10Jbond) [12:58:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42171 and previous config saved to /var/cache/conftool/dbconfig/20221201-125821-ladsgroup.json [12:58:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:58:25] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [12:58:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:00:47] (03PS1) 10Muehlenhoff: yarn: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862886 (https://phabricator.wikimedia.org/T135991) [13:04:19] (03CR) 10FNegri: "Thanks John, looks good to me! Are you gonna create the private key and store it in the private repo, or do you want me to do it?" [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [13:04:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P42172 and previous config saved to /var/cache/conftool/dbconfig/20221201-130443-ladsgroup.json [13:06:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10matthiasmullie) >>! In T324101#8432262, @Ottomata wrote: >> Reason for access: need query search usage via jupyter for Structured Data pipelines > I'm... [13:06:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10matthiasmullie) [13:06:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10matthiasmullie) [13:07:19] (03PS3) 10Jbond: idp::standalone: add django oidc app [puppet] - 10https://gerrit.wikimedia.org/r/862328 [13:09:11] (03CR) 10Jbond: cumin::target: idea for cloud cumin::target (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [13:09:17] (03CR) 10Jbond: [C: 04-1] cumin::target: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [13:14:36] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 148 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:16:36] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:19:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Ottomata) Ah okay, approved for analytics-privatedata-users and ssh and kerberos then. You can use this ticket for the Kerberos access too. [13:19:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T323907)', diff saved to https://phabricator.wikimedia.org/P42174 and previous config saved to /var/cache/conftool/dbconfig/20221201-131950-ladsgroup.json [13:19:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [13:19:54] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [13:19:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [13:20:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42175 and previous config saved to /var/cache/conftool/dbconfig/20221201-132000-ladsgroup.json [13:24:20] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10Ottomata) [13:28:32] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:28:54] (03PS1) 10Jaime Nuche: scap.cfg: enable K8s deployments in prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/862892 [13:28:57] (03Abandoned) 10Dzahn: scap: move firewall rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/862378 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [13:30:37] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adjust DNS for LVS eqsin. - cmooney@cumin1001" [13:38:44] (03PS1) 10Volans: Revert "setup.py: add temporary upper limit for pylint" [cookbooks] - 10https://gerrit.wikimedia.org/r/862364 [13:39:57] (03CR) 10CI reject: [V: 04-1] Revert "setup.py: add temporary upper limit for pylint" [cookbooks] - 10https://gerrit.wikimedia.org/r/862364 (owner: 10Volans) [13:43:39] (03CR) 10Jbond: [C: 03+2] idp::standalone: add django oidc app [puppet] - 10https://gerrit.wikimedia.org/r/862328 (owner: 10Jbond) [13:43:56] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [13:49:21] 10SRE-Access-Requests, 10DBA: mariadb: grant user 'phstats' additional select on phabricator_search db - https://phabricator.wikimedia.org/T324205 (10Aklapper) p:05Triageโ†’03Low [13:50:16] (03PS1) 10Aklapper: mariadb: grant user 'phstats' additional select on phabricator_search db [puppet] - 10https://gerrit.wikimedia.org/r/862895 (https://phabricator.wikimedia.org/T324205) [13:53:26] (03PS3) 10Elukey: ml-services: enable multi-processing for ml-staging editquality-goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [13:58:51] (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging editquality-goodfaith (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [13:58:58] (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging editquality-goodfaith (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [13:59:18] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable new impact module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526) [13:59:27] (03PS8) 10Kosta Harlan: GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) [13:59:33] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable new impact module on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862840 (https://phabricator.wikimedia.org/T323686) [13:59:36] (03PS7) 10Kosta Harlan: GrowthExperiments: Start oldimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) [14:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1400) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor I ๏ฟฝ Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1400). [14:00:04] Sohom_Datta and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] o/ [14:00:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adjust DNS for LVS eqsin. - cmooney@cumin1001" [14:00:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:00:21] hi [14:00:42] (03PS5) 10Eevans: Promote the aqs_next role to be aqs [puppet] - 10https://gerrit.wikimedia.org/r/859059 (https://phabricator.wikimedia.org/T302278) (owner: 10Btullis) [14:01:08] (03CR) 10Volans: "And ofc there is a new issue with the new release, opened https://github.com/PyCQA/prospector/issues/545 upstream" [cookbooks] - 10https://gerrit.wikimedia.org/r/862364 (owner: 10Volans) [14:01:17] I can self-serve my deploys [14:01:28] sure, go ahead [14:01:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862355 (https://phabricator.wikimedia.org/T324188) (owner: 10Kosta Harlan) [14:12:40] (03CR) 10Eevans: [C: 03+2] Promote the aqs_next role to be aqs [puppet] - 10https://gerrit.wikimedia.org/r/859059 (https://phabricator.wikimedia.org/T302278) (owner: 10Btullis) [14:12:50] (03PS2) 10BBlack: docker_registry_ha: remove unused cache::nodes ref [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) [14:12:53] (03PS1) 10BBlack: Remove some old utility files [puppet] - 10https://gerrit.wikimedia.org/r/862903 [14:13:19] (03PS2) 10BBlack: Remove some old utility files [puppet] - 10https://gerrit.wikimedia.org/r/862903 [14:13:32] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10EChetty) [14:14:54] (03CR) 10BBlack: [C: 03+2] Remove some old utility files [puppet] - 10https://gerrit.wikimedia.org/r/862903 (owner: 10BBlack) [14:15:53] (03Abandoned) 10BBlack: cipher_sim.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670985 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [14:16:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Jclark-ctr) @Papaul they are R440 we are missing configG in netbox [14:18:17] (03PS1) 10Jbond: P:idp::standalone: update requirements and add second vhost [puppet] - 10https://gerrit.wikimedia.org/r/862927 [14:19:16] 10Puppet, 10SRE, 10Infrastructure-Foundations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10MoritzMuehlenhoff) [14:19:55] 10Puppet, 10SRE, 10Infrastructure-Foundations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10MoritzMuehlenhoff) [14:20:01] (03Merged) 10jenkins-bot: DatabaseUserImpactStore: Fix parameter style for upsert keys [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862355 (https://phabricator.wikimedia.org/T324188) (owner: 10Kosta Harlan) [14:20:28] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:20:29] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:862355|DatabaseUserImpactStore: Fix parameter style for upsert keys (T324188)]] [14:20:32] T324188: Wikimedia\Rdbms\Platform\SQLPlatform::normalizeUpsertKeys called with deprecated parameter style: the unique key array should be a string or array of string arrays - https://phabricator.wikimedia.org/T324188 [14:20:37] (03PS6) 10Giuseppe Lavagetto: sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 [14:21:38] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:862355|DatabaseUserImpactStore: Fix parameter style for upsert keys (T324188)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [14:21:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:22:15] reviewing on mwdebug1002 [14:22:22] (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [14:22:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:23:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10EChetty) [14:23:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:23:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:24:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:24:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10EChetty) Merging the Kerberos request into this ticket. [14:25:08] (03CR) 10Jbond: [C: 03+2] P:idp::standalone: update requirements and add second vhost [puppet] - 10https://gerrit.wikimedia.org/r/862927 (owner: 10Jbond) [14:25:50] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [14:26:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:27:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:27:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:27:20] (03CR) 10JMeybohm: [C: 03+1] admin_ng: set thumbor max memory limit higher [deployment-charts] - 10https://gerrit.wikimedia.org/r/862230 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:27:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:27:32] (03PS7) 10Giuseppe Lavagetto: sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 [14:27:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T323907)', diff saved to https://phabricator.wikimedia.org/P42176 and previous config saved to /var/cache/conftool/dbconfig/20221201-142735-ladsgroup.json [14:27:39] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [14:27:54] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:862355|DatabaseUserImpactStore: Fix parameter style for upsert keys (T324188)]] (duration: 07m 25s) [14:27:57] T324188: Wikimedia\Rdbms\Platform\SQLPlatform::normalizeUpsertKeys called with deprecated parameter style: the unique key array should be a string or array of string arrays - https://phabricator.wikimedia.org/T324188 [14:28:10] alright, on to the next one [14:28:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861506 (https://phabricator.wikimedia.org/T318854) (owner: 10Gergล‘ Tisza) [14:28:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:28:51] (03PS2) 10Kosta Harlan: [no-op] GrowthExperiments: Enable D3 in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861506 (https://phabricator.wikimedia.org/T318854) (owner: 10Gergล‘ Tisza) [14:28:55] (03CR) 10TrainBranchBot: "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861506 (https://phabricator.wikimedia.org/T318854) (owner: 10Gergล‘ Tisza) [14:29:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:29:28] (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [14:29:40] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:29:50] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [14:29:50] (03Merged) 10jenkins-bot: [no-op] GrowthExperiments: Enable D3 in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861506 (https://phabricator.wikimedia.org/T318854) (owner: 10Gergล‘ Tisza) [14:29:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:29:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:29:58] (03CR) 10David Caro: [C: 03+2] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro) [14:30:13] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:861506|[no-op] GrowthExperiments: Enable D3 in production (T318854)]] [14:30:16] T318854: Application Security Review Request : d3.js - https://phabricator.wikimedia.org/T318854 [14:30:16] D3: test - ignore - https://phabricator.wikimedia.org/D3 [14:30:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:31:19] !log kharlan@deploy1002 kharlan and tgr: Backport for [[gerrit:861506|[no-op] GrowthExperiments: Enable D3 in production (T318854)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:32:07] (03PS3) 10Kosta Harlan: GrowthExperiments: Enable new impact module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526) [14:32:37] syncing [14:32:44] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move traffic rules to 'ops' instance [puppet] - 10https://gerrit.wikimedia.org/r/861825 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [14:33:39] (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro) [14:35:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:36:17] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:861506|[no-op] GrowthExperiments: Enable D3 in production (T318854)]] (duration: 06m 04s) [14:36:20] T318854: Application Security Review Request : d3.js - https://phabricator.wikimedia.org/T318854 [14:36:20] D3: test - ignore - https://phabricator.wikimedia.org/D3 [14:36:27] on to the last one [14:36:57] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Superset [puppet] - 10https://gerrit.wikimedia.org/r/862933 (https://phabricator.wikimedia.org/T135991) [14:37:31] (03PS4) 10Kosta Harlan: GrowthExperiments: Enable new impact module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526) [14:37:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [14:38:53] (03Merged) 10jenkins-bot: GrowthExperiments: Enable new impact module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [14:39:15] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:862839|GrowthExperiments: Enable new impact module on testwiki (T323526)]] [14:39:19] T323526: New Impact Module: Start experiment for the new Impact module on Growth Pilot wikis (ar, bn, cs, es) - https://phabricator.wikimedia.org/T323526 [14:39:39] (03PS1) 10Filippo Giunchedi: kubernetes: output envs/services separated by newlines [puppet] - 10https://gerrit.wikimedia.org/r/862935 (https://phabricator.wikimedia.org/T324091) [14:40:23] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:862839|GrowthExperiments: Enable new impact module on testwiki (T323526)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:41:10] (03PS1) 10Filippo Giunchedi: hiera: replace thanos-sso with thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/862937 (https://phabricator.wikimedia.org/T323913) [14:41:12] syncing [14:41:49] (03CR) 10CI reject: [V: 04-1] hiera: replace thanos-sso with thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/862937 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [14:42:06] !log add BGP sessions to RIPE RIS in drmrs [14:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:34] (03PS2) 10Filippo Giunchedi: hiera: replace thanos-sso with thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/862937 (https://phabricator.wikimedia.org/T323913) [14:42:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:42:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:42:48] (03PS1) 10Clรฉment Goubert: kube-env: Add completion for alias [puppet] - 10https://gerrit.wikimedia.org/r/862938 (https://phabricator.wikimedia.org/T324091) [14:43:43] (03CR) 10Clรฉment Goubert: [C: 03+1] kubernetes: output envs/services separated by newlines [puppet] - 10https://gerrit.wikimedia.org/r/862935 (https://phabricator.wikimedia.org/T324091) (owner: 10Filippo Giunchedi) [14:44:16] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [14:44:27] (03PS1) 10Filippo Giunchedi: wmnet: remove thanos-sso [dns] - 10https://gerrit.wikimedia.org/r/862939 (https://phabricator.wikimedia.org/T323913) [14:44:33] (03CR) 10Filippo Giunchedi: [C: 03+2] kubernetes: output envs/services separated by newlines [puppet] - 10https://gerrit.wikimedia.org/r/862935 (https://phabricator.wikimedia.org/T324091) (owner: 10Filippo Giunchedi) [14:44:55] jbond: merged your changes to labs/private [14:45:02] * godog high fives claime [14:45:17] godog: thanks [14:45:28] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:862839|GrowthExperiments: Enable new impact module on testwiki (T323526)]] (duration: 06m 12s) [14:45:31] T323526: New Impact Module: Start experiment for the new Impact module on Growth Pilot wikis (ar, bn, cs, es) - https://phabricator.wikimedia.org/T323526 [14:45:44] (03PS2) 10Clรฉment Goubert: kube-env: Add completion for alias [puppet] - 10https://gerrit.wikimedia.org/r/862938 (https://phabricator.wikimedia.org/T324091) [14:46:01] (03PS4) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) [14:46:04] * claime high fives godog [14:46:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:46:20] (03CR) 10Filippo Giunchedi: [C: 03+1] kube-env: Add completion for alias [puppet] - 10https://gerrit.wikimedia.org/r/862938 (https://phabricator.wikimedia.org/T324091) (owner: 10Clรฉment Goubert) [14:46:44] I don't think Sohom_Datta is around, so skipping their patch [14:46:51] unless Lucas_WMDE you think we should do it? [14:47:01] I havenโ€™t looked at the patch yet [14:47:30] Lucas_WMDE: it has a +1 from Jdlrobson [14:48:17] hi Sohom_Datta, we were just discussing your patch [14:48:20] hi Sohom_Datta! [14:48:27] Hi [14:48:28] (03CR) 10Clรฉment Goubert: [C: 03+2] kube-env: Add completion for alias [puppet] - 10https://gerrit.wikimedia.org/r/862938 (https://phabricator.wikimedia.org/T324091) (owner: 10Clรฉment Goubert) [14:48:53] (03CR) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [14:49:01] kostajh: do you want to deploy it or should I? [14:49:07] Lucas_WMDE: are you able to take over? [14:49:12] I can, sure [14:49:16] (03PS1) 10Jbond: idp::standalon: Add OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/862942 (https://phabricator.wikimedia.org/T311999) [14:49:23] Lucas_WMDE: then, I am off to the kindergarten. danke! [14:49:26] ok! [14:50:12] (03PS2) 10Lucas Werkmeister (WMDE): Enable limited width on plwikisource MAIN namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861431 (https://phabricator.wikimedia.org/T323185) (owner: 10Sohom Datta) [14:50:16] (03PS1) 10Ssingh: lvs5004: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/862943 (https://phabricator.wikimedia.org/T322048) [14:50:24] !log installing krb5 security updates [14:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:26] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "diffConfig looks good to me (effectively removes ns0 from the setting)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861431 (https://phabricator.wikimedia.org/T323185) (owner: 10Sohom Datta) [14:50:44] (03PS34) 10Andrew Bogott: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:51:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:51:22] (03PS1) 10Ssingh: sites.yaml: add lvs5004 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/862944 (https://phabricator.wikimedia.org/T322048) [14:51:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861431 (https://phabricator.wikimedia.org/T323185) (owner: 10Sohom Datta) [14:52:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:52:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:52:17] (03Merged) 10jenkins-bot: Enable limited width on plwikisource MAIN namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861431 (https://phabricator.wikimedia.org/T323185) (owner: 10Sohom Datta) [14:52:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:52:42] (03PS1) 10Volans: setup.py: temporary upper limit to prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/862945 [14:52:46] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:861431|Enable limited width on plwikisource MAIN namespace (T323185)]] [14:52:49] T323185: Enabled limited width preference disables limited width in Wikisource main namespace - https://phabricator.wikimedia.org/T323185 [14:53:12] (03PS1) 10Ssingh: lvs5001: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/862946 (https://phabricator.wikimedia.org/T323830) [14:53:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42177 and previous config saved to /var/cache/conftool/dbconfig/20221201-145337-ladsgroup.json [14:53:40] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [14:53:53] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and soda: Backport for [[gerrit:861431|Enable limited width on plwikisource MAIN namespace (T323185)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [14:53:59] (03PS2) 10Ssingh: lvs5001: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/862946 (https://phabricator.wikimedia.org/T323830) [14:54:31] Sohom_Datta: the change should be on mwdebug, can you test it? [14:54:35] (03CR) 10Volans: [C: 03+2] "unblocking CI" [cookbooks] - 10https://gerrit.wikimedia.org/r/862945 (owner: 10Volans) [14:54:45] yeah, checking :) [14:54:49] (03PS5) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) [14:54:52] ok :) [14:54:53] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38549/console" [puppet] - 10https://gerrit.wikimedia.org/r/862946 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [14:55:45] Yep works fine :) [14:56:16] (03Merged) 10jenkins-bot: setup.py: temporary upper limit to prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/862945 (owner: 10Volans) [14:56:19] ok \o/ [14:56:28] (03PS8) 10Volans: sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [14:57:43] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA ๐Ÿชƒ), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) Sorry for lack of update. I did dig into the Gerrit cache which are backed up by H2 Database. Some... [14:57:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:58:08] (03PS2) 10Jbond: idp::standalon: Add OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/862942 (https://phabricator.wikimedia.org/T311999) [14:58:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:58:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:59:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 25): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38548/console" [puppet] - 10https://gerrit.wikimedia.org/r/862942 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [14:59:24] (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [14:59:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:00:25] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns5001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/862321 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [15:00:53] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:861431|Enable limited width on plwikisource MAIN namespace (T323185)]] (duration: 08m 06s) [15:00:56] T323185: Enabled limited width preference disables limited width in Wikisource main namespace - https://phabricator.wikimedia.org/T323185 [15:01:03] (03Merged) 10jenkins-bot: sites.yaml: remove dns5001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/862321 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [15:01:30] !log UTC afternoon backport+config window done [15:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:40] only a minute over time ;) [15:01:41] jouncebot: now [15:01:41] No deployments scheduled for the next 1 hour(s) and 58 minute(s) [15:01:45] ok :) [15:02:36] (03CR) 10Jbond: [C: 03+2] idp::standalon: Add OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/862942 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [15:02:46] 10SRE, 10Cassandra, 10RESTBase-Cassandra: setup an alertable threshold for Cassandra heap dumps - https://phabricator.wikimedia.org/T106346 (10LSobanski) [15:07:23] (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [15:07:52] (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:08:43] (03CR) 10Vgutierrez: [C: 03+1] lvs5004: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/862943 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [15:08:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P42178 and previous config saved to /var/cache/conftool/dbconfig/20221201-150843-ladsgroup.json [15:09:22] (03CR) 10Vgutierrez: [C: 03+1] lvs5001: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/862946 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [15:10:06] (03PS6) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) [15:10:34] !log homer "cr*-eqsin*" commit "running homer for Gerrit: 862321" [15:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:41] (03PS1) 10Jbond: idp::standalon: use production value for oidc_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/862948 [15:11:56] !log [done] homer "cr*-eqsin*" commit "running homer for Gerrit: 862321" [15:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:26] !log php7.4 upgrade + apache upgrade + rolling restarts of app servers - T323358 [15:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:31] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA ๐Ÿชƒ), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) @Jelto thank you for the excellent analysis about monitoring. I will look at integrating that to th... [15:13:51] (03CR) 10Jbond: [C: 03+2] idp::standalon: use production value for oidc_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/862948 (owner: 10Jbond) [15:19:20] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P42179 and previous config saved to /var/cache/conftool/dbconfig/20221201-152350-ladsgroup.json [15:24:07] (03PS7) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) [15:25:10] (03PS1) 10Jbond: idp::standalone: correct name of local_settings [puppet] - 10https://gerrit.wikimedia.org/r/862950 [15:25:26] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp::standalone: correct name of local_settings [puppet] - 10https://gerrit.wikimedia.org/r/862950 (owner: 10Jbond) [15:26:45] (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns5001 [puppet] - 10https://gerrit.wikimedia.org/r/862316 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [15:26:55] (03PS2) 10Ssingh: hiera: decommission dns5001 [puppet] - 10https://gerrit.wikimedia.org/r/862316 (https://phabricator.wikimedia.org/T323830) [15:27:04] (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [15:28:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns5001.wikimedia.org [15:30:53] (03PS8) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) [15:31:00] 10SRE, 10ops-codfw, 10decommission-hardware, 10User-fgiunchedi: decommission graphite2003.codfw.wmnet - https://phabricator.wikimedia.org/T323718 (10Papaul) 05Openโ†’03Resolved a:03Papaul Complete [15:31:19] (03CR) 10Jdrewniak: [C: 03+1] mediawiki: Extend /portals max-age from 24h to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/817409 (https://phabricator.wikimedia.org/T313881) (owner: 10Krinkle) [15:31:23] (03PS3) 10Marostegui: Add mlitn to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/862245 (https://phabricator.wikimedia.org/T324101) (owner: 10Matthias Mullie) [15:34:06] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [15:34:26] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Marostegui) @matthiasmullie do you want to also add yourself to `analytics-privatedata-users` in the gerrit patch? Once done I can merge and add the k... [15:35:11] (03CR) 10Marostegui: [C: 03+2] mariadb: grant user 'phstats' additional select on phabricator_search db [puppet] - 10https://gerrit.wikimedia.org/r/862895 (https://phabricator.wikimedia.org/T324205) (owner: 10Aklapper) [15:35:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:26] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [15:36:31] 10SRE, 10SRE-Access-Requests, 10DBA, 10Patch-For-Review: mariadb: grant user 'phstats' additional select on phabricator_search db - https://phabricator.wikimedia.org/T324205 (10Marostegui) 05Openโ†’03Resolved a:03Marostegui Merged and applied the grants. Please test it and reopen if it is not working!... [15:37:14] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1054'] [15:38:12] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [15:38:12] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:38:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns5001.wikimedia.org [15:38:21] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns5001.wikimedia.org` - dns5001.wikimedia.... [15:38:23] (03PS1) 10Jbond: idp::standalone: config is not a hash [puppet] - 10https://gerrit.wikimedia.org/r/862954 [15:38:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42180 and previous config saved to /var/cache/conftool/dbconfig/20221201-153856-ladsgroup.json [15:38:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [15:39:00] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [15:39:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [15:39:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T323907)', diff saved to https://phabricator.wikimedia.org/P42181 and previous config saved to /var/cache/conftool/dbconfig/20221201-153918-ladsgroup.json [15:39:21] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [15:40:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:41:11] (03CR) 10Jbond: [C: 03+2] idp::standalone: config is not a hash [puppet] - 10https://gerrit.wikimedia.org/r/862954 (owner: 10Jbond) [15:41:39] !log php7.4 upgrade + apache upgrade + rolling restarts of api servers - T323358 [15:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:25] (03CR) 10Vgutierrez: [C: 03+1] P:cache::haproxy: harden systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [15:45:44] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1055'] [15:46:25] (03PS1) 10Jbond: idp::standlone: corrcet file name [puppet] - 10https://gerrit.wikimedia.org/r/862957 [15:46:56] (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [15:47:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1054'] [15:48:09] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp::standlone: corrcet file name [puppet] - 10https://gerrit.wikimedia.org/r/862957 (owner: 10Jbond) [15:48:22] (03PS9) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) [15:50:31] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1054'] [15:51:26] (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [15:52:09] (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [15:52:38] (03CR) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [15:53:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10matthiasmullie) I am already part of `analytics-privatedata-users` (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/862245/3/modules/admin/da... [15:53:39] (03PS10) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) [15:54:49] (03PS1) 10Papaul: Add new cloudvirt node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/862958 (https://phabricator.wikimedia.org/T313983) [15:55:34] (03CR) 10Papaul: [C: 03+2] Add new cloudvirt node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/862958 (https://phabricator.wikimedia.org/T313983) (owner: 10Papaul) [15:56:13] (03CR) 10Marostegui: [C: 03+2] Add mlitn to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/862245 (https://phabricator.wikimedia.org/T324101) (owner: 10Matthias Mullie) [15:57:16] !log php7.4 upgrade + apache upgrade + rolling restarts of jobrunners/videoscalers servers - T323358 [15:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Marostegui) [15:57:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1056'] [15:58:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1055'] [15:58:16] (03PS11) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) [15:59:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T323907)', diff saved to https://phabricator.wikimedia.org/P42182 and previous config saved to /var/cache/conftool/dbconfig/20221201-155917-ladsgroup.json [15:59:22] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [15:59:53] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Marostegui) 05Openโ†’03Resolved a:03Marostegui I have merged your patch. Also, you should've gotten an email about your kerberos principal. Please... [16:00:29] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1055'] [16:00:53] !log php7.4 upgrade + apache upgrade + rolling restarts of parsoid servers - T323358 [16:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1054'] [16:07:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1054.eqiad.wmnet with OS bullseye [16:07:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1054.eqiad.wmn... [16:08:17] (03PS12) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) [16:09:10] (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging model servers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [16:10:28] (03PS7) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [16:12:11] (03CR) 10CI reject: [V: 04-1] WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [16:13:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1056'] [16:13:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1055'] [16:14:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P42183 and previous config saved to /var/cache/conftool/dbconfig/20221201-161424-ladsgroup.json [16:15:20] (03CR) 10Elukey: [C: 03+2] ml-services: enable multi-processing for ml-staging model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [16:17:26] (03PS1) 10Jbond: idp::standalone: correct permissions [puppet] - 10https://gerrit.wikimedia.org/r/862967 [16:19:28] (03CR) 10Papaul: [C: 03+2] Add partman rules for cloudvirt10[54-61] [puppet] - 10https://gerrit.wikimedia.org/r/862406 (https://phabricator.wikimedia.org/T313983) (owner: 10Andrew Bogott) [16:19:42] (03PS2) 10Papaul: Add partman rules for cloudvirt10[54-61] [puppet] - 10https://gerrit.wikimedia.org/r/862406 (https://phabricator.wikimedia.org/T313983) (owner: 10Andrew Bogott) [16:20:45] (03CR) 10Jbond: [C: 03+2] idp::standalone: correct permissions [puppet] - 10https://gerrit.wikimedia.org/r/862967 (owner: 10Jbond) [16:21:21] 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://static-codereview.wikimedia.org - https://phabricator.wikimedia.org/T119846 (10Krinkle) [16:22:18] 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://static-codereview.wikimedia.org - https://phabricator.wikimedia.org/T119846 (10Krinkle) I've boldly updated the task description to suggest targetting instead... [16:25:08] (03PS8) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [16:26:04] (03CR) 10CI reject: [V: 04-1] WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [16:26:29] (03PS9) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [16:26:40] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1056'] [16:28:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T323907)', diff saved to https://phabricator.wikimedia.org/P42184 and previous config saved to /var/cache/conftool/dbconfig/20221201-162815-ladsgroup.json [16:28:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1055.eqiad.wmnet with OS bullseye [16:28:19] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [16:28:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1055.eqiad.wmn... [16:29:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P42185 and previous config saved to /var/cache/conftool/dbconfig/20221201-162930-ladsgroup.json [16:30:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) [16:31:40] (03PS10) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [16:32:37] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) a:05Papaulโ†’03None [16:34:08] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1057'] [16:36:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage [16:36:30] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/862937 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [16:37:46] RECOVERY - mediawiki-installation DSH group on mw1307 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:37:56] 10SRE, 10SRE-Access-Requests, 10DBA: mariadb: grant user 'phstats' additional select on phabricator_search db - https://phabricator.wikimedia.org/T324205 (10Aklapper) Works. Thank you! [16:38:59] (03PS11) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [16:39:10] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Rewrite as kubernetes operator/controller [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [16:39:19] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] update vendor [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861353 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [16:39:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage [16:40:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage [16:41:48] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] helm-state-metrics: Update to v0.2.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/862173 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [16:42:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1056'] [16:43:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P42187 and previous config saved to /var/cache/conftool/dbconfig/20221201-164322-ladsgroup.json [16:43:35] !log installing ini4j security updates [16:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage [16:44:32] (03PS12) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [16:44:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T323907)', diff saved to https://phabricator.wikimedia.org/P42188 and previous config saved to /var/cache/conftool/dbconfig/20221201-164437-ladsgroup.json [16:44:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [16:44:44] (03CR) 10JMeybohm: [C: 03+2] helm-state-metrics: Update to v0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/862174 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [16:44:45] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [16:45:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [16:45:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T323907)', diff saved to https://phabricator.wikimedia.org/P42189 and previous config saved to /var/cache/conftool/dbconfig/20221201-164509-ladsgroup.json [16:45:30] (03CR) 10Ssingh: [V: 03+1] "Thank you both for the review." [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [16:45:44] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:cache::haproxy: harden systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [16:46:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10aborrero) note, because {T319184} these hosts only use 1 single network interface. It should be the default in puppet. We need a particular sw... [16:46:41] !log robh@cumin2002 START - Cookbook sre.dns.netbox [16:47:42] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200) https:/ [16:47:42] h.wikimedia.org/wiki/Wikifeeds [16:48:37] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns5004 fix - robh@cumin2002" [16:48:44] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [16:49:14] (03Merged) 10jenkins-bot: helm-state-metrics: Update to v0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/862174 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [16:49:42] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns5004 fix - robh@cumin2002" [16:49:42] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:50:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1057'] [16:50:31] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns5004 [16:50:56] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns5004 [16:51:24] 10Puppet, 10SRE, 10Infrastructure-Foundations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10MoritzMuehlenhoff) [16:52:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:53:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1054.eqiad.wmnet with OS bullseye [16:53:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1054.eqiad.wmnet with OS bullseye comple... [16:55:59] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:56:38] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:57:22] (03PS1) 10Vivian Rook: k8s [puppet] - 10https://gerrit.wikimedia.org/r/862992 [16:58:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P42190 and previous config saved to /var/cache/conftool/dbconfig/20221201-165828-ladsgroup.json [16:58:42] (03Abandoned) 10Vivian Rook: k8s [puppet] - 10https://gerrit.wikimedia.org/r/862992 (owner: 10Vivian Rook) [16:58:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1055.eqiad.wmnet with OS bullseye [16:59:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1055.eqiad.wmnet with OS bullseye comple... [16:59:29] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1057'] [17:00:05] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1700). [17:00:05] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:35] tgr_: hi, looking [17:01:08] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:01:29] * jbond steps away unless needed [17:01:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1056.eqiad.wmnet with OS bullseye [17:02:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1056.eqiad.wmnet with OS bullseye [17:02:52] tgr_: looks straightforward, running pcc as a formality then I'll merge :) will you want me to kick off a test run of any of these? [17:02:54] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:03:53] rzl: no, thanks, we'll do a test run later today when wmf.12 is live [17:03:56] ๐Ÿ‘ [17:04:06] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38550/console" [puppet] - 10https://gerrit.wikimedia.org/r/861964 (https://phabricator.wikimedia.org/T323958) (owner: 10Gergล‘ Tisza) [17:04:39] (03CR) 10RLazarus: [V: 03+1 C: 03+2] growthexperiments: Use min edit limit for user impact refresh [puppet] - 10https://gerrit.wikimedia.org/r/861964 (https://phabricator.wikimedia.org/T323958) (owner: 10Gergล‘ Tisza) [17:05:23] (03PS1) 10Vivian Rook: aptrepo: add thirdparty/kubeadm-k8s-1-2[34] [puppet] - 10https://gerrit.wikimedia.org/r/862994 [17:07:21] (03PS1) 10Ssingh: dns5004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/862996 (https://phabricator.wikimedia.org/T322048) [17:07:39] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1058'] [17:08:11] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1059'] [17:08:13] all set [17:08:52] (03PS1) 10Ssingh: sites.yaml: add dns5004 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/862998 (https://phabricator.wikimedia.org/T322048) [17:11:28] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 199 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:13:08] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:13:28] (03CR) 10Ssingh: [C: 03+2] dns5004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/862996 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [17:13:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T323907)', diff saved to https://phabricator.wikimedia.org/P42191 and previous config saved to /var/cache/conftool/dbconfig/20221201-171335-ladsgroup.json [17:13:40] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [17:14:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage [17:14:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns5004.wikimedia.org with OS buster [17:15:05] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster [17:16:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) [17:18:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage [17:21:47] (03PS1) 10Volans: setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 [17:21:49] (03PS1) 10Volans: spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) [17:22:33] (03PS1) 10Jbond: hiera: add oidc endpoint to apero_cas global [puppet] - 10https://gerrit.wikimedia.org/r/863005 [17:22:35] (03PS1) 10Jbond: apereo_cas: add OidcRegisteredService service support [puppet] - 10https://gerrit.wikimedia.org/r/863006 [17:23:38] (03PS2) 10Jbond: hiera: add oidc endpoint to apero_cas global [puppet] - 10https://gerrit.wikimedia.org/r/863005 [17:23:57] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/863005 (owner: 10Jbond) [17:24:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1059'] [17:24:32] (03CR) 10Jbond: [C: 03+2] hiera: add oidc endpoint to apero_cas global [puppet] - 10https://gerrit.wikimedia.org/r/863005 (owner: 10Jbond) [17:25:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1058'] [17:26:33] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1058'] [17:26:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T323907)', diff saved to https://phabricator.wikimedia.org/P42192 and previous config saved to /var/cache/conftool/dbconfig/20221201-172634-ladsgroup.json [17:26:37] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [17:27:09] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1059'] [17:29:12] (03PS1) 10Sharvaniharan: Merge branch 'master' into HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863008 [17:29:24] (03CR) 10David Caro: quota_increase: Fix issue with dashed quota names (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro) [17:29:30] (03CR) 10CI reject: [V: 04-1] Merge branch 'master' into HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863008 (owner: 10Sharvaniharan) [17:29:32] (03Abandoned) 10Sharvaniharan: Merge branch 'master' into HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863008 (owner: 10Sharvaniharan) [17:29:40] (03PS1) 10Papaul: Add new sretest codfw node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/863009 (https://phabricator.wikimedia.org/T322578) [17:30:25] (03PS3) 10David Caro: quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 [17:31:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1057'] [17:32:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1056.eqiad.wmnet with OS bullseye [17:32:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1056.eqiad.wmnet with OS bullseye comple... [17:32:44] (03CR) 10David Caro: Revert "setup.py: add temporary upper limit for pylint" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/862364 (owner: 10Volans) [17:33:25] (03CR) 10Volans: "This is my proposal for the external modules injection and the possibility to inject additional accessors into the Spicerack instance." [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:33:39] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1057'] [17:33:40] (03CR) 10CI reject: [V: 04-1] quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro) [17:33:52] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add dns5004 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/862998 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [17:34:24] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1060'] [17:36:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1057'] [17:37:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) [17:38:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1057.eqiad.wmnet with OS bullseye [17:38:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1057.eqiad.wmnet with OS bullseye [17:40:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1059'] [17:40:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1058'] [17:41:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P42193 and previous config saved to /var/cache/conftool/dbconfig/20221201-174140-ladsgroup.json [17:42:13] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10AnnWF) [17:42:15] (03PS1) 10Sharvaniharan: New configs for android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 [17:42:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1058.eqiad.wmnet with OS bullseye [17:42:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1058.eqiad.wmnet with OS bullseye [17:43:57] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10AnnWF) [17:44:38] (03CR) 10Sharvaniharan: "Hi @Ottomata @Bearloga" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 (owner: 10Sharvaniharan) [17:44:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1059.eqiad.wmnet with OS bullseye [17:44:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1059.eqiad.wmnet with OS bullseye [17:44:58] (03CR) 10BCornwall: varnish: Remove unused dstat plugins (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall) [17:45:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1060'] [17:46:19] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1060'] [17:47:06] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1060'] [17:47:30] (03CR) 10CI reject: [V: 04-1] setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 (owner: 10Volans) [17:47:33] (03CR) 10CI reject: [V: 04-1] spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:47:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5004.wikimedia.org with reason: host reimage [17:47:41] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10daniel) Note that we only need active purging if/when we emit cache control headers that tell the edge case to cache long-term. One k... [17:50:10] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1060'] [17:50:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1060'] [17:51:24] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5004.wikimedia.org with reason: host reimage [17:51:25] (03PS2) 10BCornwall: varnish: Remove unused dstat plugins [puppet] - 10https://gerrit.wikimedia.org/r/862371 [17:53:48] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38551/console" [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall) [17:55:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1058.eqiad.wmnet with reason: host reimage [17:55:50] (03PS2) 10Ssingh: lvs5004: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/862943 (https://phabricator.wikimedia.org/T322048) [17:56:14] (03CR) 10Ssingh: "rebased, no code change" [puppet] - 10https://gerrit.wikimedia.org/r/862943 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [17:56:34] (03CR) 10Bking: snapshot: Parallelize cirrus dumps by db shard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [17:56:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P42194 and previous config saved to /var/cache/conftool/dbconfig/20221201-175647-ladsgroup.json [17:57:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage [17:57:42] PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:57:52] (03CR) 10Ebernhardson: "o" [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [17:58:05] recursive DNS is me, will be fixed shortly [17:58:17] (reimaging in progress) [17:58:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1058.eqiad.wmnet with reason: host reimage [17:59:34] (03CR) 10Ssingh: [C: 03+2] lvs5004: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/862943 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [18:00:04] bd808: How many deployers does it take to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1800). [18:01:29] I will be deploying a new developer-portal build today. We've got some new translations and also have updated some of the static site generation libraries. [18:01:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage [18:01:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5004.eqsin.wmnet with OS buster [18:01:45] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs5004.eqsin.wmnet with OS buster [18:01:51] (03PS1) 10Hnowlan: api-gateway: define restbase match correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/863013 (https://phabricator.wikimedia.org/T324222) [18:02:10] (03CR) 10Ahmon Dancy: [C: 03+1] scap.cfg: enable K8s deployments in prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/862892 (owner: 10Jaime Nuche) [18:02:39] (03CR) 10Bking: [C: 03+2] snapshot: Apply minor cleanups to cirrus dump script [puppet] - 10https://gerrit.wikimedia.org/r/856653 (owner: 10Ebernhardson) [18:04:06] (03CR) 10Bking: snapshot: Parallelize cirrus dumps by db shard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [18:04:31] (03CR) 10Bking: [C: 03+2] snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [18:04:47] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2022-12-01-121802-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/863014 [18:04:52] (03PS7) 10Bking: snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [18:10:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1058.eqiad.wmnet with OS bullseye [18:10:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1058.eqiad.wmnet with OS bullseye comple... [18:10:37] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2022-12-01-121802-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/863014 (owner: 10BryanDavis) [18:11:31] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1060'] [18:11:39] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICALError response or zero answers: https://wikitech.wikimedia.org/wiki/DNS [18:11:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1060'] [18:11:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T323907)', diff saved to https://phabricator.wikimedia.org/P42195 and previous config saved to /var/cache/conftool/dbconfig/20221201-181153-ladsgroup.json [18:11:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [18:11:57] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [18:12:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [18:12:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T323907)', diff saved to https://phabricator.wikimedia.org/P42196 and previous config saved to /var/cache/conftool/dbconfig/20221201-181215-ladsgroup.json [18:12:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1060.eqiad.wmnet with OS bullseye [18:12:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1060.eqiad.wmnet with OS bullseye [18:14:14] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1061'] [18:14:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:15:38] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2022-12-01-121802-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/863014 (owner: 10BryanDavis) [18:16:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1059.eqiad.wmnet with OS bullseye [18:16:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1059.eqiad.wmnet with OS bullseye comple... [18:16:45] (03CR) 10BPirkle: [C: 03+1] api-gateway: define restbase match correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/863013 (https://phabricator.wikimedia.org/T324222) (owner: 10Hnowlan) [18:16:55] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:17:22] (03CR) 10Hnowlan: [C: 03+2] api-gateway: define restbase match correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/863013 (https://phabricator.wikimedia.org/T324222) (owner: 10Hnowlan) [18:17:31] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:17:38] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:19:25] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:19:41] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:19:42] (03CR) 10RLazarus: [C: 03+1] api-gateway: define restbase match correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/863013 (https://phabricator.wikimedia.org/T324222) (owner: 10Hnowlan) [18:21:16] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:23:42] (03Merged) 10jenkins-bot: api-gateway: define restbase match correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/863013 (https://phabricator.wikimedia.org/T324222) (owner: 10Hnowlan) [18:25:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:25:18] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [18:25:35] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [18:26:49] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [18:27:11] jouncebot: nowandnext [18:27:11] For the next 0 hour(s) and 32 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1800) [18:27:11] In 0 hour(s) and 32 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1900) [18:27:22] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [18:27:29] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [18:27:38] starting to decom mw[1307-1348], they'll be fully depooled when the train deploy starts, so no conflict [18:27:57] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [18:29:52] * bd808 is done with the tech engagement deploy window [18:30:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) [18:30:34] rzl: Would it be possible to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/862892 now ? [18:31:02] dancy: sure but give me a few minutes on this decom first :) [18:31:16] ok.. no rush. Thank you! [18:31:39] (03CR) 10Herron: "sketching out this approach with a simple panel layout initially, interested in your notes" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [18:33:23] 10Puppet, 10Infrastructure-Foundations: Fix autorestart and debclient dependency - https://phabricator.wikimedia.org/T324229 (10jbond) p:05Triageโ†’03Medium [18:34:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1057.eqiad.wmnet with OS bullseye [18:34:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1057.eqiad.wmnet with OS bullseye execut... [18:35:57] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [18:36:36] jhathaway, brett: meant to highlight you as oncall, fyi decomming mw[1307-1348] [18:36:58] thanks! [18:37:00] !log rzl@cumin2002 conftool action : set/pooled=no; selector: name=mw13(0[7-9]|[1-3]\d|4[0-8])\..* [18:37:05] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 197 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:38:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1061'] [18:38:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1057.eqiad.wmnet with OS bullseye [18:38:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1057.eqiad.wmnet with OS bullseye [18:38:27] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [18:39:31] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:43:39] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 127 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:43:55] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1061'] [18:46:35] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:51:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage [18:51:42] (03CR) 10Ottomata: [C: 03+1] New configs for android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 (owner: 10Sharvaniharan) [18:53:49] (03CR) 10Ottomata: Add a new production images for spark and spark-operator (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [18:54:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:55:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage [18:57:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T323907)', diff saved to https://phabricator.wikimedia.org/P42197 and previous config saved to /var/cache/conftool/dbconfig/20221201-185742-ladsgroup.json [18:57:46] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [19:00:04] dancy and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1900). [19:00:43] Hello! I'm going to deploy a new release of scap before rolling the train [19:01:39] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 364 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:01:45] !log dancy@deploy1002 Installing scap version "4.30.0" for 601 hosts [19:01:46] 10SRE, 10Dependency-Tracking, 10Wikibase-Quality-Constraints, 10Wikidata, and 2 others: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache - https://phabricator.wikimedia.org/T204024 (10Eevans) [19:02:17] !log dancy@deploy1002 Installation of scap version "4.30.0" completed for 601 hosts [19:02:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage [19:02:59] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:05:55] (03PS1) 10TrainBranchBot: group2 wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863020 (https://phabricator.wikimedia.org/T320517) [19:05:57] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863020 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [19:06:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage [19:06:39] (03Merged) 10jenkins-bot: group2 wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863020 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [19:06:47] (03PS1) 10Hnowlan: api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) [19:06:53] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) [19:07:18] (03CR) 10Hnowlan: "Sample output for dev change here: https://phabricator.wikimedia.org/P42198" [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [19:07:27] dancy: sorry, had to step away for a moment -- am I too late for that scap.cfg change to be helpful? [19:08:01] yeah, I specified the option manually for this train operation. You're good to deploy after train finishes (~3 minutes) [19:08:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [19:08:27] ah, apologies -- let me know when, and I'll go ahead [19:08:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [19:08:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [19:08:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [19:08:42] (03PS2) 10Hnowlan: api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) [19:08:56] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) >>! In T307035#8078353, @Cmjohnson wrote: > @Eevans take your time, I just want to make sure that we're not falling behind on-site. Let me know whenever you're ready... [19:09:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1057.eqiad.wmnet with OS bullseye [19:09:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1057.eqiad.wmnet with OS bullseye comple... [19:12:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P42199 and previous config saved to /var/cache/conftool/dbconfig/20221201-191248-ladsgroup.json [19:13:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [19:14:02] won't merge this until the train is finished: [19:14:09] (03PS1) 10RLazarus: scap: Replace proxies that are being decommed [puppet] - 10https://gerrit.wikimedia.org/r/863022 (https://phabricator.wikimedia.org/T306162) [19:15:54] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1061'] [19:16:37] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.40.0-wmf.12 refs T320517 [19:16:40] T320517: 1.40.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T320517 [19:16:45] rzl: done! [19:18:28] (03PS1) 10Ssingh: hiera: add dns5004.yaml [puppet] - 10https://gerrit.wikimedia.org/r/863024 [19:19:55] (03CR) 10Ssingh: [C: 03+2] hiera: add dns5004.yaml [puppet] - 10https://gerrit.wikimedia.org/r/863024 (owner: 10Ssingh) [19:20:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:20:24] (03CR) 10RLazarus: [C: 03+2] scap.cfg: enable K8s deployments in prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/862892 (owner: 10Jaime Nuche) [19:20:52] dancy: merged, want a manual run anywhere? [19:20:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [19:21:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [19:21:07] I'll run sync-world now [19:21:21] oh of course :) ๐Ÿ‘ [19:21:37] (03CR) 10Ahmon Dancy: "FYI I ran "touch /var/lib/deploy-mwdebug/pause" on deploy1002 and left the file in place." [puppet] - 10https://gerrit.wikimedia.org/r/862892 (owner: 10Jaime Nuche) [19:21:51] !log dancy@deploy1002 Started scap: testing k8s deployment [19:22:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1060.eqiad.wmnet with OS bullseye [19:22:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1060.eqiad.wmnet with OS bullseye comple... [19:25:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:25:48] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1061'] [19:26:22] (03CR) 10RLazarus: [C: 03+2] scap: Replace proxies that are being decommed [puppet] - 10https://gerrit.wikimedia.org/r/863022 (https://phabricator.wikimedia.org/T306162) (owner: 10RLazarus) [19:27:16] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs5004.eqsin.wmnet with OS buster [19:27:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1061'] [19:27:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) [19:27:34] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs5004.eqsin.wmnet with OS buster executed with errors: - lvs5004 (... [19:27:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [19:27:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P42200 and previous config saved to /var/cache/conftool/dbconfig/20221201-192755-ladsgroup.json [19:28:08] !log dancy@deploy1002 Finished scap: testing k8s deployment (duration: 06m 17s) [19:28:59] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1061'] [19:35:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:35:19] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1061'] [19:37:32] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:38:18] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1061'] [19:38:46] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:39:40] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns5004.wikimedia.org with OS buster [19:39:50] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster executed with errors: - dns5004... [19:40:10] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 202 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:40:14] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns5004.wikimedia.org with OS buster [19:40:24] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster [19:41:22] (03PS1) 10Eevans: Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026 [19:41:30] !log gitlab2002 (gitlab-replica) - upgrading gitlab-ce [19:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:20] (03PS2) 10Eevans: Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026 [19:42:47] !log rzl@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 42 hosts with reason: decom [19:42:56] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans) [19:43:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T323907)', diff saved to https://phabricator.wikimedia.org/P42201 and previous config saved to /var/cache/conftool/dbconfig/20221201-194301-ladsgroup.json [19:43:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:43:04] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:43:05] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [19:43:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:43:47] !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 42 hosts with reason: decom [19:43:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6c20a9fc-5041-4ab7-bed4-f80a2643f954) set by rzl@cumin2002 for 1 day, 0:00:00 on 42 host(s) and their se... [19:44:34] !log rzl@cumin2002 conftool action : set/pooled=inactive; selector: name=mw13(0[7-9]|[1-3]\d|4[0-8])\..* [19:44:53] !log gitlab-runner1002 - upgrading gitlab-runner package [19:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:00] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 318 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:46:27] (03CR) 10Eevans: [C: 04-1] "Not yet ready." [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans) [19:47:40] PROBLEM - Host 2001:df2:e500:1:103:102:166:8 is DOWN: PING CRITICAL - Packet loss = 100% [19:47:40] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:49:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:49:50] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:50:40] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:50:50] RECOVERY - MD RAID on ganeti2013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:52:26] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:53:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1061'] [19:56:08] PROBLEM - Host 2001:df2:e500:1:103:102:166:8 is DOWN: CRITICAL - Destination Unreachable (2001:df2:e500:1:103:102:166:8) [19:56:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1061.eqiad.wmnet with OS bullseye [19:56:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1061.eqiad.wmnet with OS bullseye [19:59:19] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:59:48] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version https://phabricator.wikmiedia.org/T324195 [20:00:01] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version https://phabricator.wikmiedia.org/T324195 [20:02:37] (03PS3) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) [20:04:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:07:00] (03CR) 10Vgutierrez: [C: 03+1] varnish: Remove unused dstat plugins [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall) [20:09:05] (03CR) 10Ssingh: [C: 03+1] "(Thanks to Valentin for fixing the remaining issues with this CR.)" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [20:09:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage [20:09:55] (03PS5) 10Vgutierrez: setup.py: update dependencies for bullseye [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [20:11:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:12:23] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [20:12:23] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:12:23] (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [20:12:48] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5004.wikimedia.org with reason: host reimage [20:12:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage [20:14:31] (03CR) 10Vgutierrez: [C: 03+2] setup.py: update dependencies for bullseye [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [20:14:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:15:04] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [20:16:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5004.wikimedia.org with reason: host reimage [20:16:28] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [20:17:07] (03PS1) 10Vgutierrez: Release 0.36 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/863028 (https://phabricator.wikimedia.org/T321309) [20:17:09] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/862908 (owner: 10Paladox) [20:17:11] (03PS4) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) [20:17:13] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/862909 (owner: 10Paladox) [20:17:23] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [20:17:23] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:17:45] (03PS5) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) [20:18:14] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:04] PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:21:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:22:08] (03CR) 10Paladox: "This change is ready for review." [labs/private] - 10https://gerrit.wikimedia.org/r/862910 (owner: 10Paladox) [20:22:23] (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:22:23] (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [20:23:13] (03PS6) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) [20:23:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:00] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:26:58] (03PS7) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) [20:27:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1061.eqiad.wmnet with OS bullseye [20:27:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1061.eqiad.wmnet with OS bullseye comple... [20:28:08] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:28:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:32:03] (03PS8) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) [20:33:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) [20:34:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) 05Openโ†’03Resolved @Andrew all yours [20:35:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:36:21] (03CR) 10Ottomata: Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [20:36:23] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [20:37:23] (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [20:37:31] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 320 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:38:51] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:41:23] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [20:41:47] (03Abandoned) 10Sharvaniharan: Stream configs for newly migrated android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783874 (https://phabricator.wikimedia.org/T306385) (owner: 10Sharvaniharan) [20:41:54] (03PS9) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) [20:44:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:46:01] (03PS10) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) [20:47:17] !log aokoth@cumin1001 START - Cookbook sre.hosts.remove-downtime for gitlab1004.wikimedia.org [20:47:18] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gitlab1004.wikimedia.org [20:50:07] (03PS3) 10Gergล‘ Tisza: GrowthExperiments: Run refreshUserImpactData maintenance script in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan) [20:50:23] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:51:50] (03PS4) 10Gergล‘ Tisza: GrowthExperiments: Enable user impact refresh script on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan) [20:52:13] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:52:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:54:23] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [20:55:30] (03CR) 10Gergล‘ Tisza: GrowthExperiments: Enable user impact refresh script on pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan) [20:56:07] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 135 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:59:23] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [21:00:05] brennen: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T2100). Please do the needful. [21:00:05] zabe, MatmaRex, and tgr: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:35] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:42] hi [21:01:19] i'm requesting a few interesting operations today, so you might want to do everyone else first :) [21:01:33] o/ [21:02:23] (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [21:02:44] hey [21:02:48] (03PS11) 10Ryan Kemper: wdqs: add grizzly dashboard for uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) [21:02:58] brennen: fyi I'm still decomming mw1307-mw1348, but they're pooled=inactive so should be totally out of your hair [21:03:15] happy to wait if you'd prefer to avoid the noise though :) [21:03:20] o/ [21:04:00] (03CR) 10Ryan Kemper: "Okay, this is ready for official review. With the latest iteration the request_sl*_query formulas are working properly (albeit perf-wise s" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [21:04:14] rzl: no worries [21:04:21] ๐Ÿ‘ [21:04:25] zabe, starting with yours since sharvani_ doesn't seem to be here [21:04:33] ok [21:05:12] (03PS2) 10Brennen Bearnes: Start writing to cul_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861853 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:05:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861853 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:06:09] (03Merged) 10jenkins-bot: Start writing to cul_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861853 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:06:23] !log brennen@deploy1002 Started scap: Backport for [[gerrit:861853|Start writing to cul_actor on test wikis (T233004)]] [21:06:27] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:07:09] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [21:08:08] !log brennen@deploy1002 brennen and zabe: Backport for [[gerrit:861853|Start writing to cul_actor on test wikis (T233004)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:08:45] Hi Brennen .. here for the config patch deployment! ๐Ÿ‘‹ [21:08:57] hi sharvani_! welcome. just doing a patch for zabe then we'll get yours underway. [21:09:04] zabe: anything to test? [21:09:15] yes [21:09:22] Ty ๐Ÿ™‚ [21:09:27] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [21:09:37] brennen, could you do a query for me? [21:10:17] zabe: yeah, will figure it out - pestering thcipriani. :) [21:10:24] basically 'select * from cu_log limit 1' on testwiki [21:10:33] !log rzl@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1307-1326].eqiad.wmnet [21:10:41] feel free to post the result to a wmf-nda protected paste if you prefer [21:10:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [21:11:05] I would like to see if the field is correctly being written to [21:11:33] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:12:57] zabe: https://phabricator.wikimedia.org/P42202 [21:13:22] (03PS1) 10RLazarus: cumin: Replace the mw-jobrunner-canary which is about to be decommed [puppet] - 10https://gerrit.wikimedia.org/r/863032 (https://phabricator.wikimedia.org/T303162) [21:13:28] !log rzl@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts mw[1307-1326].eqiad.wmnet [21:13:43] bah sorry, forgot the order by, could you 'select * from cu_log order by cul_id desc limit 1;'? [21:14:29] zabe: paste updated, looks more like what you're expecting i'd think [21:15:01] brennen, yep thanks and the actor id is correct, so lgtm [21:15:11] cool, syncing [21:21:09] (03CR) 10RLazarus: [C: 03+2] cumin: Replace the mw-jobrunner-canary which is about to be decommed [puppet] - 10https://gerrit.wikimedia.org/r/863032 (https://phabricator.wikimedia.org/T303162) (owner: 10RLazarus) [21:21:20] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:861853|Start writing to cul_actor on test wikis (T233004)]] (duration: 14m 56s) [21:21:24] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:21:38] 10SRE, 10SRE-Access-Requests: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10matthiasmullie) Thanks, all! [21:21:41] sharvani_: you're up next [21:21:59] Ready! thank you! [21:22:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 (owner: 10Sharvaniharan) [21:22:38] (03PS2) 10Brennen Bearnes: New configs for android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 (owner: 10Sharvaniharan) [21:22:50] (03CR) 10TrainBranchBot: "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 (owner: 10Sharvaniharan) [21:22:50] !log rzl@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1307-1326].eqiad.wmnet [21:22:59] (03PS7) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [21:23:13] thanks for your help :) [21:23:37] (03PS8) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [21:23:56] (03CR) 10Ottomata: flink and flink-kubernetes-operator image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [21:24:48] (03Merged) 10jenkins-bot: New configs for android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 (owner: 10Sharvaniharan) [21:25:03] !log brennen@deploy1002 Started scap: Backport for [[gerrit:863011|New configs for android schemas]] [21:25:45] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:25:58] !log saving an image of wikitech-static-ord (aka wikitech-static) before upgrading the host to Buster [21:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:09] party time [21:26:22] andrewbogott: I'm around for MW stuff if it's unhappy after the OS upgrade :) [21:26:33] thank you! [21:26:48] !log brennen@deploy1002 brennen and sharvaniharan: Backport for [[gerrit:863011|New configs for android schemas]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:26:58] sharvani_: on mwdebug servers - please test [21:27:10] testing now.. [21:28:26] seeing it! thank you! [21:28:30] cool, syncing [21:28:41] ๐Ÿ™Œ [21:29:54] boy it is taking a surprisingly long time to make this backup [21:30:00] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Enable user impact refresh script on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan) [21:31:17] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [21:34:46] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [21:34:52] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:863011|New configs for android schemas]] (duration: 09m 49s) [21:35:56] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [21:36:05] (03PS5) 10Brennen Bearnes: GrowthExperiments: Enable user impact refresh script on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan) [21:36:23] Thanks for deploying Brennen! [21:36:24] tgr: yrs next [21:36:27] sure thing sharvani_ [21:36:42] brennen: thanks! no need to test [21:36:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan) [21:37:00] tgr: cool [21:38:16] (03Merged) 10jenkins-bot: GrowthExperiments: Enable user impact refresh script on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan) [21:38:29] !log brennen@deploy1002 Started scap: Backport for [[gerrit:859568|GrowthExperiments: Enable user impact refresh script on pilot wikis (T322541)]] [21:38:32] T322541: UserImpact: Set up maintenance script to run in betalabs and production - https://phabricator.wikimedia.org/T322541 [21:40:13] !log brennen@deploy1002 brennen and kharlan: Backport for [[gerrit:859568|GrowthExperiments: Enable user impact refresh script on pilot wikis (T322541)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:42:10] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:43:06] (ConfdResourceFailed) firing: (68) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:43:34] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:43:48] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [21:45:08] (03CR) 10RLazarus: [C: 03+2] "Too late to fix it now, but for the record the commit message should have referenced T306162, not 303162." [puppet] - 10https://gerrit.wikimedia.org/r/863032 (https://phabricator.wikimedia.org/T303162) (owner: 10RLazarus) [21:46:17] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:859568|GrowthExperiments: Enable user impact refresh script on pilot wikis (T322541)]] (duration: 07m 48s) [21:46:21] T322541: UserImpact: Set up maintenance script to run in betalabs and production - https://phabricator.wikimedia.org/T322541 [21:46:48] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [21:47:48] MatmaRex: couple questions - do you have the necessary access for your stuff in beta? and how long do you estimate for that maintenance script in prod? i've never run this. [21:48:25] hi [21:48:42] brennen: i don't have access to beta, as far as i know [21:48:56] do you want access to beta? [21:49:21] eeeeh not really [21:49:30] lol [21:49:34] c'mon [21:49:38] heh [21:49:42] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:49:48] i don't know how long the script will take, but on the order of days [21:50:05] i guess we're almost out of time in the window today, i can reschedule this. it's not urgent [21:50:14] sounds good. :) [21:50:15] (03PS6) 10Andrea Denisse: admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) [21:51:13] (03CR) 10CI reject: [V: 04-1] admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) (owner: 10Andrea Denisse) [21:51:18] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [21:51:18] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:51:50] brennen: so how would i actually get access to the beta cluster? [21:52:27] MatmaRex: can add you as an admin in horizon and you'd have root shell access. seems like you know what you're about, so seems reasonable for you to do it. :) [21:53:44] my only worry is that i'll get pinged one day when beta goes down for no reason ;) [21:54:56] there are hundreds of folks with this access [21:55:09] (most of them studiously ignoring what happens in beta, if i'm any example) [22:01:27] !log end of utc late backport & config window [22:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:08] !log restart swift-proxy on thanos::frontend eqiad [22:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:13] (03PS3) 10Eevans: Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026 [22:03:20] (03PS7) 10Andrea Denisse: admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) [22:04:05] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans) [22:04:06] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 206 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:04:15] (03CR) 10CI reject: [V: 04-1] admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) (owner: 10Andrea Denisse) [22:06:14] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 203 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:07:20] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:07:28] !log rzl@cumin1001 START - Cookbook sre.dns.netbox [22:10:24] (03PS4) 10Eevans: Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026 [22:14:26] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans) [22:15:07] (03CR) 10David Caro: "The only blocker here is creating the service on port 8000, and would be nice to address the other unresolved comments." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [22:16:34] PROBLEM - Disk space on aphlict1001 is CRITICAL: DISK CRITICAL - free space: / 667 MB (3% inode=91%): /tmp 667 MB (3% inode=91%): /var/tmp 667 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops [22:16:52] PROBLEM - gdnsd daemon runs exactly once on dns5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 497 (gdnsd), args /usr/sbin/gdnsd https://wikitech.wikimedia.org/wiki/DNS [22:17:04] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns5004 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [22:17:26] PROBLEM - AuthDNS-over-TLS Works on dns5004 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [22:17:34] PROBLEM - Auth DNS on dns5004 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [22:18:09] (03PS8) 10Andrea Denisse: admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) [22:18:44] ^ dns5004 issues are expected, bblack and I are debugging [22:18:47] please safely ignore for now [22:18:59] (03CR) 10Eevans: [C: 03+1] "PCC (no-op): https://puppet-compiler.wmflabs.org/output/863026/1486/" [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans) [22:19:12] !log rzl@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1307-1326].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - rzl@cumin1001" [22:20:28] sukhe: was about to ask, running the netbox cookbook (via the decom cookbook) also gave me some dns5004 errors but I assume those are similarly expected [22:20:43] I guess the real question is, am I okay to ignore them and continue, or should I not have started decomming in the first place :P [22:21:16] rzl: out of curiosity, which host were you running the decom cookbook on? [22:21:19] was it in eqsin? [22:21:27] no, mw1307-1326 [22:21:29] ok [22:21:32] thank you [22:21:35] (with more to follow but all eqiad appservers) [22:21:49] you can share the dns5004 errors with me [22:21:57] and I will let you know if it's fine to continue or not [22:21:58] PROBLEM - Check systemd state on dns5004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_gdnsd_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:23:46] sukhe: sure, partial output was this https://www.irccloud.com/pastebin/YcgWE4EA/ [22:23:53] lmk if that's not enough context [22:24:12] oh yeah, you can ignore this [22:24:22] rad, thanks [22:25:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:26:04] PROBLEM - gdnsd checkconf on dns5004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [22:26:25] papaul: my decom cookbook wants to sync some of your netbox-hiera changes for cloudvirt hosts at the same time -- are they okay to merge, or should I hold off? [22:27:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:28:32] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:30:08] PROBLEM - Wikitech and wt-static content in sync on cloudweb1003 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static https://wikitech.wikimedia.org/wiki/Wikitech-static [22:30:48] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49122 bytes in 0.146 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:32:10] Reedy: I've upgraded things but it insists on running php7.3 even though [22:32:12] https://www.irccloud.com/pastebin/nbplO6wh/ [22:32:23] any guess which package I'm missing? [22:33:21] andrewbogott: is PHP7.3 still installed, and $webserver config is still pointing at 7.3? [22:34:10] Yeah, libapache2-mod-php7.3 is still installed [22:34:16] (along with other 7.3 stuffs) [22:34:26] yeah, if I remove it it breaks in other interesting ways. [22:34:44] (03PS1) 10Ssingh: hiera: temporarily remove references to dns5004 [puppet] - 10https://gerrit.wikimedia.org/r/863046 (https://phabricator.wikimedia.org/T322048) [22:34:50] a2dismod php7.3 && a2enmod php7.4 && service apache2 restart [22:34:51] or something? [22:35:34] hmm, mods-enabled looks like it's 7.4 [22:35:42] RECOVERY - gdnsd daemon runs exactly once on dns5004 is OK: PROCS OK: 1 process with UID = 497 (gdnsd), args /usr/sbin/gdnsd https://wikitech.wikimedia.org/wiki/DNS [22:35:44] just restart apache? [22:35:57] a2enmod was the piece I was missing apparently [22:36:12] I guess installing libapache2-mod-php7.4 doesn't do that? [22:36:51] (03PS2) 10Ssingh: hiera: temporarily remove references to dns5004 [puppet] - 10https://gerrit.wikimedia.org/r/863046 (https://phabricator.wikimedia.org/T322048) [22:36:56] Anyway it seems better now! thank you [22:36:56] I think it will if there's no other php module enabled... but if there is, probably decides not to clobber the existing [22:37:17] that makes sense... sort of [22:37:46] (03CR) 10BBlack: [C: 03+1] hiera: temporarily remove references to dns5004 [puppet] - 10https://gerrit.wikimedia.org/r/863046 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [22:37:52] (03CR) 10Ssingh: [C: 03+2] hiera: temporarily remove references to dns5004 [puppet] - 10https://gerrit.wikimedia.org/r/863046 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [22:38:06] (ConfdResourceFailed) resolved: (68) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:38:48] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:42:26] !log upgradedwikitech-static-ord (aka wikitech-static) to Debian Buster, installed php7.4, upgraded MW to 1_39. Will delete the rackspace backup image in a few days. [22:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:28] jouncebot: nowandnext [22:48:28] No deployments scheduled for the next 9 hour(s) and 11 minute(s) [22:48:28] In 9 hour(s) and 11 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221202T0800) [22:48:36] (03PS6) 10Urbanecm: GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno) [22:48:42] (03CR) 10Urbanecm: [C: 03+2] "cleanup" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno) [22:49:06] !log urbanecm@deploy1002 backport aborted: (duration: 00m 03s) [22:49:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno) [22:50:11] (03Merged) 10jenkins-bot: GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno) [22:50:28] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:856008|GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue]] [22:50:29] (03PS5) 10Urbanecm: GrowthExperiments: Remove unused GEAllowAccessToNewImpactModule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [22:50:39] (03CR) 10Urbanecm: [C: 03+2] "cleanup, no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [22:53:08] RECOVERY - Check systemd state on dns5004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:38] !log rzl@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1307-1326].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - rzl@cumin1001" [22:54:38] !log rzl@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:54:38] !log rzl@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw[1307-1326].eqiad.wmnet [22:54:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by rzl@cumin1001 for hosts: `mw[1307-1326].eqiad.wmnet` - mw1307.eqiad.wmnet (**WARN**) - Downtimed host... [22:56:41] (03CR) 10BCornwall: [V: 03+1 C: 03+2] varnish: Remove unused dstat plugins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall) [22:56:58] !log rzl@puppetmaster1001:~$ sudo puppet node deactivate mw1312.eqiad.wmnet # T306162 [22:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:01] T306162: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 [22:57:04] !log rzl@puppetmaster1001:~$ sudo puppet node deactivate mw1320.eqiad.wmnet # T306162 [22:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:50] RECOVERY - gdnsd checkconf on dns5004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [22:57:56] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:856008|GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue]] (duration: 07m 28s) [22:59:39] (03PS1) 10BCornwall: Remove since-deleted dstat plugin dir [puppet] - 10https://gerrit.wikimedia.org/r/863050 [22:59:43] !log rzl@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1327-1346].eqiad.wmnet [23:01:44] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38553/console" [puppet] - 10https://gerrit.wikimedia.org/r/863050 (owner: 10BCornwall) [23:03:08] 10Puppet, 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting, 10User-jbond: Port dstat related scripts to Python 3 - https://phabricator.wikimedia.org/T277910 (10BCornwall) 05Openโ†’03Invalid No longer necessary since the scripts have been removed (See https://gerrit.wikimedia.org/r/c/operati... [23:03:13] 10Puppet, 10SRE, 10SRE-tools, 10Infrastructure-Foundations, and 4 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10BCornwall) [23:18:17] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:23:17] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:31:40] !log rzl@cumin1001 START - Cookbook sre.dns.netbox [23:33:18] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:34:15] RECOVERY - Recursive DNS on 103.102.166.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [23:34:25] !log rzl@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1327-1346].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - rzl@cumin1001" [23:34:27] RECOVERY - Auth DNS on dns5004 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [23:34:37] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [23:35:46] !log rzl@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1327-1346].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - rzl@cumin1001" [23:35:46] !log rzl@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:35:47] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1327-1346].eqiad.wmnet [23:35:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by rzl@cumin1001 for hosts: `mw[1327-1346].eqiad.wmnet` - mw1327.eqiad.wmnet (**WARN**) - Downtimed host... [23:37:29] !log rzl@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1347-1348].eqiad.wmnet [23:39:45] (JobUnavailable) resolved: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:43:45] !log rzl@cumin1001 START - Cookbook sre.dns.netbox [23:44:09] RECOVERY - AuthDNS-over-TLS Works on dns5004 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [23:45:58] !log rzl@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1347-1348].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - rzl@cumin1001" [23:47:13] !log rzl@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1347-1348].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - rzl@cumin1001" [23:47:13] !log rzl@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:47:14] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1347-1348].eqiad.wmnet [23:48:57] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns5004 is OK: OK: UP (pid=26982) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [23:53:18] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:54:16] (03PS4) 10RLazarus: conftool: remove old mw servers [puppet] - 10https://gerrit.wikimedia.org/r/859966 (https://phabricator.wikimedia.org/T306162) (owner: 10Giuseppe Lavagetto) [23:56:05] (03CR) 10RLazarus: [C: 03+2] conftool: remove old mw servers [puppet] - 10https://gerrit.wikimedia.org/r/859966 (https://phabricator.wikimedia.org/T306162) (owner: 10Giuseppe Lavagetto) [23:56:18] (03PS4) 10RLazarus: site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162) (owner: 10Giuseppe Lavagetto) [23:58:15] (03CR) 10RLazarus: [C: 03+2] site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162) (owner: 10Giuseppe Lavagetto)