[00:08:32] (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) resolved: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ... [00:08:32] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=codfw.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected [01:15:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:43] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:43] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:26] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Nikki) Could https://qlever.cs.uni-freiburg.de/api/wikimedia-commons also be added? It looks like https:... [05:38:59] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:20:57] 10Data-Platform-SRE, 10SRE, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10CodeReviewBot) pfischer merged https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_req... [08:38:05] 10Data-Engineering: Check home/HDFS leftovers of nickifeajika - https://phabricator.wikimedia.org/T354241 (10MoritzMuehlenhoff) [09:38:59] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:59] (PuppetFailure) firing: Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:08:42] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 05): Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10SGupta-WMF) a:03SGupta-WMF [10:08:59] (PuppetFailure) firing: (2) Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:13:59] (PuppetFailure) firing: (4) Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:18:59] (PuppetFailure) firing: (6) Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:23:59] (PuppetFailure) firing: (7) Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:31:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:12] !log restarted the monitor_refine_event.service on an-launcher1002 to clear alert [10:32:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:33:43] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:40:06] 10Data-Engineering (Sprint 6), 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [S] Coalesce section alignment image suggestions output - https://phabricator.wikimedia.org/T347558 (10mfossati) 05In progress→03Resolved ` mfossati@stat1008:~$ hdfs dfs -coun... [10:51:41] 10Data-Engineering (Sprint 6), 10Data Products, 10Structured-Data-Backlog: [Maintenance] Set up deletion jobs for Structured Data's data pipelines - https://phabricator.wikimedia.org/T347561 (10mfossati) Hi @VirginiaPoundstone , @JAllemandou : friendly note that snapshots are accumulating. For instance: ` mf... [10:56:06] 10Data-Engineering (Sprint 6), 10Data Products, 10Structured-Data-Backlog: [Maintenance] Set up deletion jobs for Structured Data's data pipelines - https://phabricator.wikimedia.org/T347561 (10mfossati) [11:14:12] (SystemdUnitFailed) firing: prometheus-mysqld-exporter.service Failed on dbstore1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:07] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) Transferring the first database snapshot with: ` btullis@cumin1002:~$ sudo transfer.py --type=decompress dbprov1004.eqiad.wmnet:/srv/backups/snapshots/latest... [11:41:21] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) Transferring the second database snapshot with: ` btullis@cumin1002:~$ sudo transfer.py --type=decompress dbprov1002.eqiad.wmnet:/srv/backups/snapshots/lates... [11:54:23] 10Data-Engineering, 10Data-Platform-SRE: NEW BUG REPORT remove mysql databases from SQLLab - https://phabricator.wikimedia.org/T337056 (10KCVelaga_WMF) I have requested for `wikishared` database to be enabled to setup a monitoring dashboard for The Wikipedia Library eligibility echo notifications being sent da... [11:56:28] 10Data-Engineering, 10Data-Platform-SRE: NEW BUG REPORT remove mysql databases from SQLLab - https://phabricator.wikimedia.org/T337056 (10BTullis) As per the request from @KCVelaga, I have re-enabled SQL Lab access for the wikishared database. {F41649561,width=60%} [12:16:06] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I have created the two remaining grants backups from dbstore1003 with: ` root@dbstore1003:~# pt-show-grants -S /run/mysqld/mysqld.s5.so... [12:19:55] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) Started the s6 service ` btullis@dbstore1009:/srv/sqldata.s6$ sudo systemctl start mariadb@s6 ` Obtained the gtid position ` btullis@dbstore1009:/srv/sqldata... [12:33:02] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) @Marostegui - Please could you add the sections on dbstore1009 to the zarcillo database, so that they get picked up by Prometheus? They will replace those cu... [12:36:46] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I used `transfer.py` to copy these files from the old to the new host, as shown. ` btullis@cumin1002:~$ sudo transfer.py dbstore1003.eq... [13:11:44] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I have recreated the grants for the three sections. In each case, I first dropped all grants except those for `root@localhost` and `ma... [13:13:24] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10MoritzMuehlenhoff) [13:14:02] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10MoritzMuehlenhoff) There is one more Search service I had initially missed, the apifeatureusage* Logstash cluster. I've extended the task descr... [13:15:28] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) Started the s8 service ` btullis@dbstore1009:/srv/sqldata.s8$ sudo systemctl start mariadb@s8 ` Obtained the gtid position ` btullis@dbstore1009:/srv/sqldata... [13:17:51] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) Starting the transfer of the x1 section with: ` btullis@cumin1002:~$ sudo transfer.py --type=decompress dbprov1004.eqiad.wmnet:/srv/backups/snapshots/latest/... [13:26:03] 10Data-Platform-SRE, 10SRE, 10serviceops, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10brouberol) ` brouberol@kafka-test1010:~$ kafka topics --topic codfw.cirrussearch.update_pipeline.update.rc0 --alter --partitions 5 kafka-top... [13:27:02] 10Data-Platform-SRE, 10SRE, 10serviceops, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10brouberol) [13:29:12] (SystemdUnitFailed) firing: (2) hadoop-yarn-nodemanager.service Failed on an-worker1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:32] PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:18] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:34:12] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on an-worker1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:12] RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:58] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:39:12] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on an-worker1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:59] (PuppetFailure) firing: (7) Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:48:59] (PuppetFailure) firing: (7) Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:51:15] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10lbowmaker) [13:53:59] (PuppetFailure) firing: (7) Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:56:02] 10Data-Engineering (Sprint 8), 10serviceops-radar, 10Data Products (Data Products Sprint 05), 10Patch-For-Review: Rewrite all Airflow sensors that use datacenter prepartitions to depend on both datacenters - https://phabricator.wikimedia.org/T338796 (10CodeReviewBot) xcollazo merged https://gitlab.wikimedi... [13:58:59] (PuppetFailure) firing: (7) Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:59:49] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I've just noticed something odd about icinga, which I should probably look at before going any further. Icinga shows that it is not a s... [14:03:59] (PuppetFailure) resolved: (6) Puppet has failed on stat1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:06:21] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) It looks like there is a grant missing. THis is what happens when I run the command manually. ` nagios@dbstore1008:/etc/nagios/nrpe.d$... [14:13:55] 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics: Conda analytics environments breakage - conflicting dependencies between r-base and other - https://phabricator.wikimedia.org/T343823 (10nettrom_WMF) I've tested creating new environments on two stat machines with different OSes, and both with and... [14:21:31] Krinkle: thanks for review on the EventLoggingLegacyConverter. Bartoz advised me to keep it simple which was why I didn't use php curl or MediaWiki. But, it sounds like you prefer I use MediaWiki. Am very happy to! I think i see at least one example of doing this (w/favicon.php), but if I do this, I'm not sure how to wire mediawiki-config together with mediawiki code locally so I can test and run things. [14:23:24] ottomata: I consider calling php-curl built ins in favour of legacy streams as part of keeping it simple. I'm neutral on using MW for this. Indeed, that would not be testable locally as-is, you'd need to temporarily replace the "require" and "multiversion" call with a require for your local path to WebStart instead. After that should be testable as-is. [14:27:32] well, your logging argument makes me think mw would be better, although perhaps you are saying trigger_error would be sufficient? [14:30:14] iiuc...php7-fatal-error.php is installed in /etc/php [14:30:54] so...will trigger_error end up using that? ...reading... [14:35:22] hm, this excimer log function looks good...i'll use that [15:42:02] 10Data-Engineering, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Add testing framework to wmfdata-python - https://phabricator.wikimedia.org/T349531 (10mpopov) [15:47:11] 10Data-Engineering, 10Wmfdata-Python: Remove Wmfdata code related to the conda-analytics migration - https://phabricator.wikimedia.org/T346707 (10mpopov) [15:47:16] 10Data-Engineering, 10Wmfdata-Python: Remove Wmfdata's custom update-notification code - https://phabricator.wikimedia.org/T346706 (10mpopov) [15:50:34] 10Data-Engineering, 10Wmfdata-Python: Remove Wmfdata's custom update-notification code - https://phabricator.wikimedia.org/T346706 (10AndrewTavis_WMDE) Sharing a user experience on this: I did update the package because of the notification before we got the email saying that we should update. I don't necessari... [16:25:28] Thanks Krinkle , responded to all comments and pushed new patch. [16:25:59] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10Marostegui) >>! In T351924#9432319, @BTullis wrote: > @Marostegui - Please could you add the sections on dbstore1009 to the zarcillo database, so that they get picked... [16:34:18] 10Data-Engineering (Sprint 6), 10Data Products, 10Structured-Data-Backlog: [Maintenance] Set up deletion jobs for Structured Data's data pipelines - https://phabricator.wikimedia.org/T347561 (10VirginiaPoundstone) a ping to @lbowmaker for Data Engineering eyes. [16:46:14] (03PS4) 10Ottomata: spark HiveExtensions now support column COMMENTs in DDL and merge helpers [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/987195 (https://phabricator.wikimedia.org/T307040) [16:47:49] 10Data-Platform-SRE: Create dashboards/alerts for new Cirrus Streaming Updater - https://phabricator.wikimedia.org/T349772 (10bking) Per today's Weds mtg, we need to add links to other dashboards from our SUP dashboard. You can add a "Markdown cell" to the dashboard to include the other dashboards. [16:49:13] (DiskSpace) firing: Disk space an-worker1127:9100:/var/lib/hadoop/data/j 5.724% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1127 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:54:12] (SystemdUnitFailed) firing: (2) prometheus-mysqld-exporter.service Failed on dbstore1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:14] (DiskSpace) firing: (9) Disk space an-worker1127:9100:/var/lib/hadoop/data/b 5.747% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1127 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:03:25] Oh dear, these HDFS datanodes are starting to get quite full. [17:04:13] (DiskSpace) firing: (12) Disk space an-worker1127:9100:/var/lib/hadoop/data/b 5.6% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1127 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:13:23] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Capacity warnings on an-worker1127 - https://phabricator.wikimedia.org/T354291 (10BTullis) [17:13:49] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Capacity warnings on an-worker1127 - https://phabricator.wikimedia.org/T354291 (10BTullis) p:05Triage→03High [17:36:39] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Capacity warnings on an-worker1127 - https://phabricator.wikimedia.org/T354291 (10xcollazo) This is the Dumps 2.0 backfill. From [[ https://wikimedia.slack.com/archives/C02291Z9YQY/p1704302496823159 | Slack thread ]]: [18:31:08] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search: Investigate connection timeouts between Search Update Pipeline and MediaWiki APIs - https://phabricator.wikimedia.org/T354289 (10bking) [19:08:09] 10Data-Engineering, 10Wmfdata-Python: Remove Wmfdata's custom update-notification code - https://phabricator.wikimedia.org/T346706 (10nshahquinn-wmf) >>! In T346706#9433027, @AndrewTavis_WMDE wrote: > Sharing a user experience on this: I did update the package because of the notification before we got the emai... [19:24:13] (DiskSpace) firing: (13) Disk space an-worker1114:9100:/var/lib/hadoop/data/g 5.945% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:29:13] (DiskSpace) firing: (14) Disk space an-worker1114:9100:/var/lib/hadoop/data/g 5.832% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:44:13] (DiskSpace) firing: (17) Disk space an-worker1114:9100:/var/lib/hadoop/data/g 5.866% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:49:13] (DiskSpace) firing: (21) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 5.925% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:54:13] (DiskSpace) firing: (24) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 5.918% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:39:13] (DiskSpace) firing: (25) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 5.327% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:49:13] (DiskSpace) firing: (26) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 5.203% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:54:13] (SystemdUnitFailed) firing: user-runtime-dir@43623.service Failed on stat1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:13] (DiskSpace) firing: (28) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 5.123% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:59:14] (DiskSpace) firing: (30) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 4.948% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:04:13] (DiskSpace) firing: (32) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 4.904% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:09:13] (DiskSpace) firing: (36) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 4.795% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:14:13] (DiskSpace) firing: (36) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 4.667% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:16:35] RECOVERY - MD RAID on aqs1013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:43:54] 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2087.codfw.wmnet with OS bullseye [21:49:13] (DiskSpace) firing: (37) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 4.28% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:54:13] (DiskSpace) firing: (37) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 4.228% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:19:14] (DiskSpace) firing: (37) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 4.16% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:20:33] 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2087.codfw.wmnet with OS bullseye completed: - elastic2087 (**PASS**) - Remov... [22:24:05] 10Data-Platform-SRE, 10SRE, 10serviceops, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) [22:24:13] (DiskSpace) firing: (38) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 4.026% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:26:00] 10Data-Platform-SRE, 10SRE, 10serviceops, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) a:03pfischer [22:31:20] 10Data-Engineering, 10MediaWiki-General, 10Event-Platform, 10Patch-For-Review: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 (10Krinkle) A few infra-level questions: 1. PHP execution. Afaik PHP execution is limited f... [22:34:13] (DiskSpace) firing: (41) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.697% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:37:18] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10bking) [22:39:13] (DiskSpace) firing: (42) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.514% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:44:13] (DiskSpace) firing: (46) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.475% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:54:37] (DiskSpace) firing: (46) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.445% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:54:38] (DiskSpace) firing: (47) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.424% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:59:13] (DiskSpace) firing: (50) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.398% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:04:14] (DiskSpace) firing: (55) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.36% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:09:13] (DiskSpace) firing: (60) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.316% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:19:13] (DiskSpace) firing: (60) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.207% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:20:35] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10Mayakp.wiki) Hi DPE team, Can you pls let me know the status of this request? I was not able to get any results for `Sec-Purpose: Pref... [23:29:13] (DiskSpace) firing: (57) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.145% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:34:13] (DiskSpace) firing: (58) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.15% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:39:13] (DiskSpace) firing: (60) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.133% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:46:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:54:14] (DiskSpace) firing: (60) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 3.109% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace