[00:04:11] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1159.eqiad.wmnet with OS bullseye completed: - an-worker1159 (**WA... [00:14:38] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) I've created another 24-hour silence for this alert, UUID 59b5ca30-1aeb-4d06-b083-7023a373ccb3 . [00:24:57] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.37% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:27:55] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye completed: - an-worker1160 (**WA... [01:08:43] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [01:38:30] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1162.eqiad.wmnet with OS bullseye [01:38:36] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1163.eqiad.wmnet with OS bullseye [01:38:40] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye [01:40:26] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1165.eqiad.wmnet with OS bullseye [01:40:30] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1164.eqiad.wmnet with OS bullseye [01:41:37] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1166.eqiad.wmnet with OS bullseye [01:42:13] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1167.eqiad.wmnet with OS bullseye [01:43:13] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1168.eqiad.wmnet with OS bullseye [01:43:18] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1169.eqiad.wmnet with OS bullseye [01:43:55] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1170.eqiad.wmnet with OS bullseye [01:45:13] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1171.eqiad.wmnet with OS bullseye [01:45:20] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1172.eqiad.wmnet with OS bullseye [01:45:27] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1173.eqiad.wmnet with OS bullseye [01:46:25] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1174.eqiad.wmnet with OS bullseye [01:46:29] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1175.eqiad.wmnet with OS bullseye [02:12:42] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1164.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [02:17:17] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1163.eqiad.wmnet with OS bullseye completed: - an-worker1163 (**WA... [02:18:24] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1168.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [02:19:42] (SystemdUnitFailed) firing: prometheus-ipmi-exporter.service Failed on an-worker1168:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:21:35] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1166.eqiad.wmnet with OS bullseye completed: - an-worker1166 (**WA... [02:24:33] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1169.eqiad.wmnet with OS bullseye completed: - an-worker1169 (**WA... [02:26:06] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1165.eqiad.wmnet with OS bullseye completed: - an-worker1165 (**WA... [02:27:22] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1175.eqiad.wmnet with OS bullseye completed: - an-worker1175 (**PA... [02:28:16] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1164.eqiad.wmnet with OS bullseye [02:28:23] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1168.eqiad.wmnet with OS bullseye [02:28:52] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1173.eqiad.wmnet with OS bullseye completed: - an-worker1173 (**WA... [02:30:35] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1167.eqiad.wmnet with OS bullseye completed: - an-worker1167 (**WA... [02:31:40] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1174.eqiad.wmnet with OS bullseye completed: - an-worker1174 (**WA... [02:34:03] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1170.eqiad.wmnet with OS bullseye completed: - an-worker1170 (**WA... [02:58:45] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1162.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [02:58:53] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [03:05:25] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1171.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [03:05:33] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1172.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [03:08:38] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1168.eqiad.wmnet with OS bullseye completed: - an-worker1168 (**WA... [03:11:40] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1164.eqiad.wmnet with OS bullseye completed: - an-worker1164 (**WA... [03:13:02] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye [03:13:10] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1162.eqiad.wmnet with OS bullseye [03:13:15] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1171.eqiad.wmnet with OS bullseye [03:13:28] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1172.eqiad.wmnet with OS bullseye [04:24:57] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.369% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:33:03] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [04:33:18] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1162.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [04:33:29] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1171.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [04:33:42] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1172.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [06:18:25] 10Data-Engineering (Sprint 5): [Data Quality] [Needs Grooming] Define concept for Alerting in coordination with SRE - https://phabricator.wikimedia.org/T351093 (10Ahoelzl) Discussion with Brian and Guillaume. [08:24:57] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.369% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:21:07] Hi brouberol ! I've setup a time in our agenda to discuss about the Spark history service https://phabricator.wikimedia.org/T330176 Feel free to change it at a more convenient time for you. Also, we may not need a full hour. [09:27:01] Thanks! I'll be there. It wasn't reflected in phabricator (yet), but we're making good progress in getting it deployed to k8s. We're currently fighting kerberos, but as it stands, we have the server running, talking to the kerberos server, and getting denied authorization. [09:38:24] btullis: let me know when is good to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/735029 (or feel free to merge yourself) [09:39:15] jbond: Thanks. I'll merge it later this morning then. [09:40:46] btullis: great thanks [09:57:33] 10Data-Engineering, 10Data Products (Data Product Sprint 04): Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 (10phuedx) >>! In T351909#9359501, @JAllemandou wrote: > … and implement the metric monitori... [09:59:25] 10Data-Platform-SRE, 10sre-alert-triage: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10BTullis) p:05Triage→03High [09:59:56] 10Data-Platform-SRE: Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) p:05Triage→03Medium [10:00:14] 10Data-Platform-SRE: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) p:05Triage→03High [10:01:47] 10Data-Platform-SRE: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I'm setting this to high priority because dbstore1003 is currently at 90% of capacity on `/srv` ` btullis@dbstore1003:~$ df -h /srv Filesystem Size Used Avail Use% Mount... [10:31:22] I am starting to work on the airflow instances now. I will pause all active DAGs with API calls. [10:33:17] (03PS1) 10Phuedx: product_metrics: Add performer.is_bot property to common fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/978487 (https://phabricator.wikimedia.org/T350883) [11:38:13] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) The airflow 2.7.3 package containing the statsd module is now installed on every airflow instance. When w... [11:50:47] hi all who would be a good person to talk with about superset [11:52:05] btullis: someone named you ;). [11:52:26] i have this patch which sets users with no ssh keys to have a shell of nologin https://gerrit.wikimedia.org/r/c/operations/puppet/+/666367 [11:53:02] this is basiccally the superset kerberso users. wondering if you can a) check the patch and b) be around to validate it dosn;t break anything [11:56:42] Hiya. Sure, having a look now. I don't think that we have many users who fit in that category any more, but I'll check. [11:56:56] cheers [12:01:47] jbond: pcc fails on it. [12:02:31] ahh ok ill take a look at it then ping you again sorry about that [12:02:59] No worries, happy to help. I'm checking the compute::user cr as well now. [12:03:09] great thanks [12:12:52] OK, the compute::user one is deployed. Thanks for that. [12:24:57] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.371% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:26:59] 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) 05Resolved→03Open Reopening, since we are seeing this alert again at 95% of capacity. {F41545855} [13:58:24] 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) Looking at this, we do have some spare capacity on the LVS volume group. ` btullis@an-web1001:~$ df -h /srv Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg0-srv 1.4T 1.3T... [14:00:58] 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) I have increased the size and the volume is now at 76% of capacity. ` btullis@an-web1001:~$ sudo lvresize -L +350G vg0/srv Size of logical volume vg0/srv changed from 1.38 TiB (362144 extents) t... [14:01:33] !log increased the size of the vg0/srv logical volume on an-web1001 by 350 GB for T349889 [14:01:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:01:36] T349889: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 [14:02:11] 10Data-Platform-SRE: /srv on an-web1001 is low on disk space - https://phabricator.wikimedia.org/T349889 (10BTullis) 05Open→03Resolved [14:04:00] (DiskSpace) resolved: Disk space an-web1001:9100:/srv 5.371% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:04:40] !log depooling schema1003 for reimage T349286 [14:04:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:04:43] T349286: Upgrade schema hosts to bullseye - https://phabricator.wikimedia.org/T349286 [14:10:29] !log reimaging schema1003 to bookworm for T349286 [14:10:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:10:32] T349286: Upgrade schema hosts to bullseye - https://phabricator.wikimedia.org/T349286 [14:10:50] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bullseye - https://phabricator.wikimedia.org/T349286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host schema1003.eqiad.wmnet with OS bookworm [14:38:09] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bullseye - https://phabricator.wikimedia.org/T349286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host schema1003.eqiad.wmnet with OS bookworm completed: - schema1003 (**PASS**)... [14:41:51] !log pooled schema1003 after upgrade to bookeworm [14:41:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:43:02] !log depooling schema1004 for reimage T349286 [14:43:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:43:05] T349286: Upgrade schema hosts to bullseye - https://phabricator.wikimedia.org/T349286 [14:43:58] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bullseye - https://phabricator.wikimedia.org/T349286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host schema1004.eqiad.wmnet with OS bookworm [14:44:09] !log reimaging schema1004 to bookworm for T349286 [14:44:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:02:17] (KafkaReplicationFactorTooLow) firing: SKafka topic replication factor is too low - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=codfw.inuka.wiki_highlights_experiment&viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [15:03:35] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1171.eqiad.wmnet with OS bullseye [15:04:01] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye [15:04:03] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1162.eqiad.wmnet with OS bullseye [15:04:06] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1172.eqiad.wmnet with OS bullseye [15:07:17] (KafkaReplicationFactorTooLow) resolved: SKafka topic replication factor is too low - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=codfw.inuka.wiki_highlights_experiment&viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [15:15:06] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bullseye - https://phabricator.wikimedia.org/T349286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host schema1004.eqiad.wmnet with OS bookworm completed: - schema1004 (**PASS**)... [15:23:29] 10Data-Engineering (Sprint 6): [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables - https://phabricator.wikimedia.org/T347879 (10lbowmaker) [15:24:51] !log pooled schema1004 after upgrade to bookworm for T349286 [15:24:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:24:55] T349286: Upgrade schema hosts to bullseye - https://phabricator.wikimedia.org/T349286 [15:30:11] !log depool schema2003 for upgrade to bookworm [15:30:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:30:46] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bullseye - https://phabricator.wikimedia.org/T349286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host schema2003.codfw.wmnet with OS bookworm [15:31:30] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bookworm - https://phabricator.wikimedia.org/T349286 (10BTullis) [15:49:18] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10bking) [] remove unused secrets from kubernetes.yaml on private puppet [15:50:05] btullis: fyi i applied the admin nologin patch but in testing it on an-master1002 it updated way more people then i expected so im reverting [15:50:21] OK, thanks. [15:50:31] however it could mean that there are some access issues intill the rollback is fully reverted (i.e. 30 mins) [15:50:57] OK, I'll be on the lookout, cheers. [15:51:02] thanks [16:05:08] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye completed: - an-worker1161 (**WA... [16:05:42] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1162.eqiad.wmnet with OS bullseye completed: - an-worker1162 (**WA... [16:07:35] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1172.eqiad.wmnet with OS bullseye completed: - an-worker1172 (**PA... [16:08:59] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1171.eqiad.wmnet with OS bullseye completed: - an-worker1171 (**WA... [16:16:42] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bookworm - https://phabricator.wikimedia.org/T349286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host schema2003.codfw.wmnet with OS bookworm completed: - schema2003 (**PASS**)... [16:21:33] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [16:54:32] eqi an-web1001 [16:54:35] woop [16:55:19] mwarf :-) [16:55:27] XD [16:59:36] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10bking) [16:59:47] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10bking) [17:07:24] !log pooled schema2003 after reimages a bookworm [17:07:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:10:28] !log depool schema2004 for reimage to bookworm for T349286 [17:10:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:10:31] T349286: Upgrade schema hosts to bookworm - https://phabricator.wikimedia.org/T349286 [17:10:59] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bookworm - https://phabricator.wikimedia.org/T349286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host schema2004.codfw.wmnet with OS bookworm [17:35:21] PROBLEM - Host flink-zk2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:35:55] RECOVERY - Host flink-zk2001 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [17:38:13] PROBLEM - Host flink-zk2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:38:39] RECOVERY - Host flink-zk2003 is UP: PING OK - Packet loss = 0%, RTA = 72.94 ms [17:39:42] (SystemdUnitFailed) firing: ifup@ens13.service Failed on flink-zk2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:42] (SystemdUnitFailed) resolved: (2) ifup@ens13.service Failed on flink-zk2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:50:25] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bookworm - https://phabricator.wikimedia.org/T349286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host schema2004.codfw.wmnet with OS bookworm completed: - schema2004 (**PASS**)... [18:12:18] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bookworm - https://phabricator.wikimedia.org/T349286 (10BTullis) [18:12:43] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [18:12:46] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bookworm - https://phabricator.wikimedia.org/T349286 (10BTullis) 05Open→03Resolved [18:15:24] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [18:18:22] 10Data-Engineering, 10Data-Platform-SRE, 10AQS2.0: Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10BTullis) 05Open→03Declined This is no longer necessary, since we have migrated all AQS endpoints to AQS 2.0. [21:28:33] 10Data-Engineering, 10Data-Platform-SRE, 10Movement-Insights, 10Product-Analytics: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10OSefu-WMF) p:05High→03Low [21:42:41] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [21:47:58] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10bking) [21:57:43] (03CR) 10Clare Ming: [C: 03+2] product_metrics: Add performer.is_bot property to common fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/978487 (https://phabricator.wikimedia.org/T350883) (owner: 10Phuedx) [21:58:22] (03Merged) 10jenkins-bot: product_metrics: Add performer.is_bot property to common fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/978487 (https://phabricator.wikimedia.org/T350883) (owner: 10Phuedx) [22:01:10] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [22:03:37] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) I'm happy to say the flink operator migration is complete. Commons an... [22:05:00] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) 05Open→03Resolved [22:23:16] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Papaul) [22:56:35] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2091.codfw.wmnet with OS bookworm [22:57:11] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) We've silenced the alert for another 24 hours. The [[ https://grafana-rw.wikimedia.org/d/O0nHhdhnz/network-probes-ove... [23:34:54] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Papaul) [23:45:53] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2091.codfw.wmnet with OS bookworm completed: - elastic2091 (**PASS**)... [23:46:11] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Papaul) [23:47:00] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Papaul) 05Open→03Resolved @bking all your's