[00:07:04] 10Data-Engineering, 10Product-Analytics: Suspicious user pageview activity in India during June from Android mobile web browsers - https://phabricator.wikimedia.org/T315267 (10Mayakp.wiki) thanks @Anoop ! do you have data on any increase in **content** added to kn wikipedia during that time? so we know it corr... [00:27:39] 10Data-Engineering (Sprint 5), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization: We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10Ahoelzl) [00:28:33] 10Data-Engineering, 10Event-Platform: [Data Quality] [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10Ahoelzl) [00:30:43] (SystemdUnitFailed) firing: systemd-timedated.service Failed on an-worker1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:31:09] (03PS3) 10Kimberly Sarabia: Adds new readme [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) [00:32:09] (03CR) 10Kimberly Sarabia: Adds new readme (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) (owner: 10Kimberly Sarabia) [00:35:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on an-worker1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:25:34] 10Data-Engineering, 10Product-Analytics: Suspicious user pageview activity in India during June from Android mobile web browsers - https://phabricator.wikimedia.org/T315267 (10Anoop) >>! In T315267#9328592, @Mayakp.wiki wrote: > thanks @Anoop ! do you have data on any increase in **content** added to kn wikipe... [02:45:17] 10Data-Engineering: [Data Quality] root cause analysis, pipeline improvement analysis image suggestion pipeline failure - https://phabricator.wikimedia.org/T351167 (10Ahoelzl) [02:46:36] 10Data-Engineering (Sprint 5): [Data Quality] [Needs Grooming] Calculate and log comprehensive post processing metrics for webrequests - https://phabricator.wikimedia.org/T349456 (10Ahoelzl) [03:58:54] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.302% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:26:34] 10Analytics, 10Product-Analytics: Hive: create table statement failure - https://phabricator.wikimedia.org/T280168 (10awight) I'm running into the exact same error when attempting to CTAS. On stat1009, using either beeline or hive: ` create table awight.translations as select wiki_db, page_id, rev... [05:48:03] 10Analytics, 10Product-Analytics: Hive: create table statement failure - https://phabricator.wikimedia.org/T280168 (10awight) Same stack trace if I create the table from HiveQL and try to "insert into table ... select ..." [06:55:28] 10Analytics, 10Product-Analytics: Hive: create table statement failure - https://phabricator.wikimedia.org/T280168 (10awight) I was able to run the query successfully from spark3-sql. [07:48:15] (EventgateValidationErrors) firing: ... [07:48:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [07:58:55] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.302% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:27:11] (03PS24) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [08:27:44] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [08:29:50] (03PS25) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [08:30:19] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [08:33:28] (03PS26) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [08:47:57] (03CR) 10Cyndywikime: "Done" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [09:44:28] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10fgiunchedi) >>! In T343232#9322650, @BTullis wrote: >> In other words prometheus analytics will be configured with... [10:01:47] headsup, skein certificates were just renewed on all 5 airflow hosts at 11:00 local time [10:01:55] aka now [10:02:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:47] https://usercontent.irccloud-cdn.com/file/HHHvNg5k/image.png [10:13:30] We have a webrequest refine job that has failed from last night, which is causing lots of SLA misses on subsequent jobs. [10:15:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:43] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:37] I have sent an email to the data-engineering-alerts list about my findings for this failed task. I haven't re-run it because I haven't got the right SQL smarts. I can remember some pointers, but not enough to be confident re-running the job. cc sfaci and joal [11:03:49] !log depool druid100[4-6] set pooled=inactive [11:03:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:32:49] btullis: heya - we were on it with Santi :) [11:32:56] btullis: we'll follow up shortly [11:33:39] joal: Great! I thought that you probably would be, just thought it worthy of flagging up. Best wishes. [11:33:47] cheers btullis :) [11:38:43] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10LSobanski) The alert has since recovered but looking at the names in the linked change I'm adding Data Platform SRE to rev... [11:48:31] (EventgateValidationErrors) firing: ... [11:48:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [11:58:42] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on kafka-jumbo1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:58:55] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.302% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:59:49] PROBLEM - Check systemd state on kafka-jumbo1013 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:18] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) [12:15:45] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) The hosts druid100[4-6] have been depooled and have been added to the decommissioningNodes mode. ` stevemunene@druid1004:~$ sudo decommission Decommissioning all services on druid1004.eqiad.wmnet eqiad/dr... [12:17:56] I have just announced a maintenance window starting tomorrow at 11:00 which will briefly affect HDFS, Hive, Druid, Hue, Superset, and DataHub. It is in support of T284150 [12:17:57] T284150: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 [12:19:05] Please do let me know if you have any issues or concerns about the work, or this window in particular. I will put some more details of the implementation plan and roll-back plan in the ticket. [12:19:07] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) [12:19:28] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) 05Open→03Resolved [12:56:19] RECOVERY - Check systemd state on kafka-jumbo1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:42] (SystemdUnitFailed) resolved: export_smart_data_dump.service Failed on kafka-jumbo1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:36] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) I have announced a maintenance window for tomorrow, November 15th at 11:00 UTC. The implementation plan will be as follows: * 10:30 - Merge and deploy https://gerrit.wikimedia.... [13:41:39] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) I'd be glad of anyone being able to sanity check the plan. The intention is to minimise the chances of any data discrepancies, whilst minimising errors from attempted writes to... [13:42:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:42:54] I wrote an implementation plan for tomorrow's MariaDB migration here, in case anyone would like to scrutinise it or sanity check it: https://phabricator.wikimedia.org/T284150#9330525 [14:01:47] I'm going to reimage an-druid1004 [14:04:55] 10Data-Platform-SRE: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host an-druid1004.eqiad.wmnet with OS bullseye [14:24:43] (03CR) 10Urbanecm: [C: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users (035 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [14:28:03] !log performing a rolling restart of the mariadb services on dbstore100[3,5,7] post this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/968668 [14:28:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:30:12] 10Data-Platform-SRE, 10Patch-For-Review: Decommission search-loader1001/2001 VMs - https://phabricator.wikimedia.org/T351123 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: `search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet` - search-loader2001.codfw.wmn... [14:42:43] 10Data-Platform-SRE: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host an-druid1004.eqiad.wmnet with OS bullseye completed: - an-druid1004 (**PASS**) - Downtimed on Icin... [14:46:53] an-druid1004 is back up, running bullseye, with all data [14:47:29] (03CR) 10Ottomata: [C: 03+2] Retry produceCanaryEvents [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965662 (https://phabricator.wikimedia.org/T326002) (owner: 10Aqu) [14:50:30] !log roll-restarting the presto cluster to pick up new puppet 7 CA settings [14:50:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:12:39] (03CR) 10Phuedx: [C: 03+2] Adds new readme [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) (owner: 10Kimberly Sarabia) [15:13:17] (03Merged) 10jenkins-bot: Adds new readme [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) (owner: 10Kimberly Sarabia) [15:17:24] (03PS27) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [15:17:31] (03CR) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users (034 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [15:17:52] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [15:23:53] (03PS28) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [15:24:28] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [15:28:37] (03PS29) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [15:29:07] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [15:30:17] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [15:30:49] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-4].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [15:36:26] 10Data-Platform-SRE: Decom search-loader VMs still using Buster - https://phabricator.wikimedia.org/T350078 (10bking) 05Open→03Invalid Duplicate of T351123 ... closing. [15:36:29] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [15:37:12] 10Data-Platform-SRE: Check log rotation settings on airflow instances - https://phabricator.wikimedia.org/T339015 (10Stevemunene) a:03Stevemunene [15:41:23] 10Data-Platform-SRE, 10Patch-For-Review: Decommission search-loader1001/2001 VMs - https://phabricator.wikimedia.org/T351123 (10bking) a:03bking [15:41:26] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Jhancock.wm) [15:42:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:43:47] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [15:44:07] (03PS30) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [15:44:35] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [15:44:39] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Jhancock.wm) [15:45:41] 10Data-Platform-SRE, 10Patch-For-Review: Decommission search-loader1001/2001 VMs - https://phabricator.wikimedia.org/T351123 (10bking) p:05Triage→03Medium [15:47:07] (03PS31) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [15:48:31] (EventgateValidationErrors) firing: ... [15:48:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [15:50:27] 10Data-Platform-SRE, 10Discovery-Search (Current work): Update search-loader dashboard to reflect new search-loader hosts - https://phabricator.wikimedia.org/T351233 (10bking) [15:58:55] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.302% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:00:21] 10Data-Platform-SRE, 10Cassandra, 10Patch-For-Review: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [16:09:25] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-4].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [16:09:56] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-4].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [16:13:02] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Jhancock.wm) [16:14:59] (PuppetFailure) firing: Puppet has failed on an-worker1151:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:15:41] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Jhancock.wm) [16:19:59] (PuppetFailure) firing: (2) Puppet has failed on an-worker1151:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:24:59] (PuppetFailure) resolved: (2) Puppet has failed on an-worker1151:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:26:21] 10Data-Engineering, 10Observability-Logging, 10Traffic: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Gehel) + #data-engineering for visibility [16:30:41] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [16:46:34] joal: do you know what gobblin version we're running, and where I can find the information myself? Thanks! [16:57:10] 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) 05Open→03Resolved @sguebo_WMF thanks so much for your input. I've gone ahead and enabled the TagManager plugin, so I can m... [16:58:48] brouberol: Hi - we're running version 0.16 as of now [16:59:19] thanks! [17:03:03] and if I could, same question for ice [17:03:09] *iceberg? 🙏 [17:05:35] brouberol: the iceberg jar is pulled in during the conda-analytics build. Have a look at data-engineering/conda-analytics in gitlab. [17:06:03] thanks, that's 100% what I needed to know [17:12:51] 10Data-Engineering, 10Data-Platform-SRE, 10Foundational Technology Requests: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013 (10BTullis) This plugin and its functionality has been reviewed by the privacy team and their assement is that it is **low risk**... [17:28:45] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) [17:32:42] 10Data-Engineering, 10Event-Platform: Refactor or create new revision visibility and user block event streams to better handle privacy in external state updates - https://phabricator.wikimedia.org/T349845 (10Ottomata) [17:32:46] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) p:05Triage→03High The external communications department has requested that we treat this as a high priority request, since they would like to be able... [17:38:41] 10Data-Engineering, 10Event-Platform: Refactor or create new revision visibility and user block event streams to better handle privacy in external state updates - https://phabricator.wikimedia.org/T349845 (10Ottomata) [17:40:46] 10Data-Engineering, 10Event-Platform: Refactor or create new revision visibility and user block event streams to better handle privacy in external state updates - https://phabricator.wikimedia.org/T349845 (10Ottomata) [17:40:59] 10Data-Engineering, 10Event-Platform: Refactor or create new revision visibility and user block event streams to better handle privacy in external state updates - https://phabricator.wikimedia.org/T349845 (10Ottomata) [17:41:23] 10Data-Engineering, 10Event-Platform: Refactor or create new revision visibility and user block event streams to better handle privacy in external state updates - https://phabricator.wikimedia.org/T349845 (10Ottomata) [17:46:08] 10Data-Engineering, 10Event-Platform, 10Security, 10Vuln-Infoleak: Refactor or create new revision visibility and user block event streams to better handle privacy in external state updates - https://phabricator.wikimedia.org/T349845 (10Ottomata) [17:46:16] ottomata: what changes do you expect us to deploy in refinery-source? it seems there is only retry canary-events [17:46:21] is that what you wish? [17:48:14] cause, your fix for events-schemas is in evenutilities, and it's not been released nor linked in refinery-source [17:48:42] I FORGOT ABOUT THAT [17:48:48] i need to release event utils first! [17:48:49] sorry! [17:49:01] doing now... if you need to do the train now go for it... [17:49:24] ottomata: we're going to wait - shouldn't be long [17:49:26] k [17:49:33] Starting build #29 for job wikimedia-event-utilities-maven-release-docker [17:50:30] (03PS1) 10Ottomata: update Changelog with producecanaryevents change [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/974225 [17:50:42] (03CR) 10Ottomata: [V: 03+2 C: 03+2] update Changelog with producecanaryevents change [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/974225 (owner: 10Ottomata) [17:53:03] Project wikimedia-event-utilities-maven-release-docker build #29: 09SUCCESS in 3 min 30 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/29/ [17:54:53] (03PS1) 10Ottomata: Bump eventutilities version to 1.3.2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/974246 (https://phabricator.wikimedia.org/T321854) [17:56:00] (03CR) 10Ottomata: [C: 03+2] Bump eventutilities version to 1.3.2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/974246 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata) [17:56:42] okay joal when https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/974246 merges you can proceed [17:56:51] Yup I've followed you :) [17:57:12] :) [17:57:18] ottomata: we're gonna change your patch on changelog, to make the retry-canary appear on the new version [17:57:22] ok? [17:57:34] please change as needed, thought i had the right one. yes please! [17:57:57] o yeah shoulda been 0.2.26 sorry [17:58:08] no worries, will do :) [17:58:17] (EventgateValidationErrors) resolved: ... [17:58:17] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [18:00:43] (03CR) 10CI reject: [V: 04-1] Bump eventutilities version to 1.3.2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/974246 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata) [18:03:48] ottomata: the latest patch broke CI :( [18:04:30] bah [18:04:37] looking [18:04:46] (EventgateValidationErrors) firing: ... [18:04:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [18:05:00] (EventgateValidationErrors) resolved: ... [18:05:01] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [18:05:04] ottomata: could it be that the released artifact has not landed on archiva? [18:05:18] looking... [18:05:43] it's there! [18:06:10] yeah and it downloaded? 13:00:36 [INFO] Downloaded from wmf-releases: https://archiva.wikimedia.org/repository/releases/org/wikimedia/eventutilities-spark/1.3.2/eventutilities-spark-1.3.2.jar (6.4 kB at 133 kB/s) [18:06:26] hm, [18:06:31] weird [18:06:41] i guess this one failed? [18:06:42] 13:00:36 [INFO] Downloading from wmf-releases: https://archiva.wikimedia.org/repository/releases/org/wikimedia/eventutilities/1.3.2/eventutilities-1.3.2-shaded.jar [18:06:46] I'm gonna ask for a recheck [18:07:14] Ah, the shaded one... [18:07:31] yes... [18:07:39] did we rename it in newer eventutilities orsomething? [18:07:51] Classifier [18:07:54] jar-with-dependencies [18:07:55] ? [18:07:56] https://archiva.wikimedia.org/#artifact-details-download-content/org.wikimedia/eventutilities/1.3.2 [18:09:10] hm, I don't think so [18:09:25] https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/937944 [18:09:46] (EventgateValidationErrors) firing: ... [18:09:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [18:09:58] i think this was merged since the last version 1.2.11 [18:10:30] so maybe instead of eventutilities classifer: shaded [18:10:35] we just want to depend on eventutilities-shaded? [18:10:38] dcausse: ^? [18:10:59] I assume you're right ottomata [18:11:06] will try that... [18:12:56] do we still need the classifier: shaded? guess not? [18:13:01] I don't think so [18:13:55] (03PS2) 10Ottomata: Bump eventutilities version to 1.3.2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/974246 (https://phabricator.wikimedia.org/T321854) [18:14:24] ioops need to changein job pom too [18:14:37] (03PS3) 10Ottomata: Bump eventutilities version to 1.3.2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/974246 (https://phabricator.wikimedia.org/T321854) [18:16:07] (03CR) 10Joal: [C: 03+2] Bump eventutilities version to 1.3.2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/974246 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata) [18:16:32] ottomata: Let's see if that thing works [18:20:46] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) TL;DR: Spark 3.3.2 test successful on Production cluster. Longer: On `stat1007.eqiad.wmnet`: create conda env: ` export... [18:22:26] looks like it did! [18:24:46] \o/ [18:24:57] I seems I need to manually submit it - weird! [18:26:14] woops my bad :) [18:26:22] all good - dpeloying with that ottomata [18:26:35] ty joal [18:29:28] 10Analytics, 10Data-Engineering (Sprint 5), 10Event-Platform, 10User-notice: change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) [18:32:10] Starting build #131 for job analytics-refinery-maven-release-docker [18:33:05] 10Data-Engineering (Sprint 5), 10Event-Platform: change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10JJMC89) [18:34:30] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) TL;DR: Spark 3.4.1 test successful on Production cluster. Longer: Spark 3.4.1: On `stat1007.eqiad.wmnet`: create conda... [18:35:30] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) Thanks for all the work to make this one happen @BTullis ! This unblocks a path to production for the Dumps 2.0 effort! 🎉 [18:36:37] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1011.eqiad.wmnet with OS bullseye [18:36:52] ottomata: oops yet I had prepared https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/937954 but forgot to followup [18:46:05] Project analytics-refinery-maven-release-docker build #131: 09SUCCESS in 13 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/131/ [18:46:51] dcausse: ahh, ty anyway! [18:52:25] Starting build #90 for job analytics-refinery-update-jars-docker [18:52:48] Project analytics-refinery-update-jars-docker build #90: 09SUCCESS in 23 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/90/ [19:06:46] 10Data-Engineering (Sprint 5), 10Event-Platform: change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) On investigation of the config [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/changeprop/templates... [19:09:47] 10Data-Engineering (Sprint 5), 10Event-Platform: change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) Related: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/925852 [19:16:24] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1011.eqiad.wmnet with OS bullseye completed: - aqs1011 (**PASS**) - Downtimed on Icinga/Alertmanage... [19:17:57] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [19:21:13] 10Data-Engineering, 10Event-Platform: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ahoelzl) [19:22:39] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10RKemper) a:03RKemper [19:23:12] 10Data-Engineering, 10Event-Platform: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) There seems to be an [[ https://github.com/search?q=repo%3Awikimedia%2Fmediawiki-services-change-propagation%20cases&type=code | undocumented `cases` r... [19:24:22] !log Deploying refinery using scap [19:24:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:29:06] 10Data-Engineering (Sprint 5): [Maintenance] Understand and inventory change-propagation use cases, deployments, and custom business logic - https://phabricator.wikimedia.org/T350156 (10Ottomata) Working notes doc: https://docs.google.com/document/d/1FnNIMpinLb3vKq5qwKckU4z4LtBKuRuWIpCx8P5mPkk/edit I may turn t... [19:55:07] !log Deployed refinery using scap, then deployed onto hdfs [19:55:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:58:55] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.302% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:04:13] (DiskSpace) firing: Disk space druid1011:9100:/srv 5.923% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=druid1011 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:09:13] (DiskSpace) firing: (2) Disk space druid1010:9100:/srv 5.909% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:14:13] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 5.874% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:19:42] (SystemdUnitFailed) firing: prometheus_puppet_agent_stats.service Failed on an-presto1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:38] PROBLEM - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:46] (EventgateValidationErrors) resolved: ... [20:29:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:35:27] !log restarted Druid supervisors [20:35:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:35:47] !log recreated unique_devices iceberg tables [20:35:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:49:42] (SystemdUnitFailed) resolved: prometheus_puppet_agent_stats.service Failed on an-presto1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:34:54] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) > it might be easier to build canary event filtering feature into change-propagation code itself https://gerrit.wikimedia.org/r/9... [21:35:29] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) [21:42:39] 10Data-Platform-SRE, 10Discovery-Search (Current work): Update search-loader dashboard to reflect new search-loader hosts - https://phabricator.wikimedia.org/T351233 (10bking) 05Open→03Invalid Confirmed, we do not need to take further action. The decommissioned hosts have been removed automatically, and ot... [21:42:42] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [21:46:21] (03CR) 10Urbanecm: [C: 03+1] "thanks for all the improvements here. +1'ing the patch for now, will be happy to merge together with the counterpart patch in WikimediaEve" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [22:07:26] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye [22:19:35] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**) - Downtimed on Icinga/... [22:33:27] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye [22:53:08] 10Data-Engineering, 10Data Pipelines, 10SRE, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Isaac) Realizing I never linked any code for this in case folks wanted to work with the data but here's an example where I'm trying to grab both sources:... [22:56:53] PROBLEM - Disk space on druid1009 is CRITICAL: DISK CRITICAL - free space: /srv 47486 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1009&var-datasource=eqiad+prometheus/ops [22:58:19] PROBLEM - Disk space on druid1011 is CRITICAL: DISK CRITICAL - free space: /srv 51183 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1011&var-datasource=eqiad+prometheus/ops [23:01:11] PROBLEM - Disk space on druid1010 is CRITICAL: DISK CRITICAL - free space: /srv 48820 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1010&var-datasource=eqiad+prometheus/ops [23:03:26] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) [[ https://github.com/wikimedia/operations-alerts/blob/master/team-sre/probes.yaml | team-sre/probes.yaml ]] in the alerts repo looks like... [23:04:15] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) a:03bking [23:06:51] 10Data-Engineering: Add "did edit" field to pageview_actor - https://phabricator.wikimedia.org/T277785 (10Isaac) Just noting because I never followed up on this task. I personally would like to just decline this task for a few reasons (my thinking has changed on it) and will if I don't hear anything against decl... [23:26:34] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**) - Removed from Puppet... [23:37:23] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye [23:58:55] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.302% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace