[00:04:57] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:43] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:22:36] (03CR) 10Brouberol: [C: 03+2] Remove kafka-jumbo100[1-6] brokers from bootstrap hosts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965166 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [05:22:41] (03CR) 10Brouberol: [V: 03+2 C: 03+2] Remove kafka-jumbo100[1-6] brokers from bootstrap hosts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965166 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [06:37:16] Hi brouberol - I've added the patch you merged to our deploy train for next week - when you provide patches for refinery, it's a good idea to follow up when you merge them, so that we know they are to be delpoyed. For instance in that specific case, in addition to the patch, some job restarts are needed etc :) [06:54:45] you mean add them to the deploy train etherpad? [06:57:50] Yes absolutely brouberol [06:58:23] oh, sorry, indeed. I realized I had forgotten to do so while I was out walking the dog, and was planning to add the MR ref to it this morning [06:58:45] No big deal I've done it - just a friendly reminder :) [06:59:06] thanks! [08:31:09] (03PS1) 10Aqu: Retry produceCanaryEvents [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965662 [08:46:44] (03PS2) 10Aqu: Retry produceCanaryEvents [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965662 [11:21:35] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10JAllemandou) Hi @BTullis - I'm sorry I missed the ping yesterday. I think we can go to prod with the new version. Let's do that on Monday :) [11:36:08] (03Abandoned) 10Aqu: Retry produceCanaryEvents [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965662 (owner: 10Aqu) [11:38:24] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) Great! Will do. [12:22:50] (03CR) 10Mforns: "LGTM! Thank you for putting this together!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/962657 (https://phabricator.wikimedia.org/T344277) (owner: 10MNeisler) [12:47:48] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Alluxio for Improved Superset Query Performance - https://phabricator.wikimedia.org/T288252 (10BTullis) 05Resolved→03Open I'm re-opening this ticket, as we have made significant advances on the use of the built-in Alluxio SDK cache: https://prestodb.io/doc... [12:55:05] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Five deleted Wikidata items pertaining to Wikimedia category pages still present in the Query Service - https://phabricator.wikimedia.org/T342593 (10Gehel) 05Open→03Resolved [12:59:07] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) [12:59:09] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) [12:59:13] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 (10Gehel) 05Open→03Resolved [12:59:40] 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10ayounsi) Found that task when investigating something else. search-loader2002: `The last Puppet run was at Tue Sep 26 06:58:45 UTC 2023 (24831 minutes ago). ` It's not really recommended to hav... [13:00:09] 10Data-Platform-SRE, 10SRE-OnFire, 10Discovery-Search (Current work), 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10Gehel) 05Open→03Resolved a:03Gehel Incident report is written, follow up tasks are created, let's close this. [13:06:41] Hello mforns - Would you be nearby? [13:44:46] hello joal! [13:45:21] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Alluxio for Improved Superset Query Performance - https://phabricator.wikimedia.org/T288252 (10BTullis) [13:45:40] Hi mforns! [13:45:48] Would you give me 15min of your time? [13:45:53] batcave! [13:46:58] OMW! [13:52:23] 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) @ayounsi sorry for the trouble; this is not a production host. I will delete it shortly. [13:55:52] 10Data-Platform-SRE: Prometheus unable to scrape search-loader[12]002 - https://phabricator.wikimedia.org/T348222 (10bking) @fgiunchedi Sorry for the trouble. These hosts are part of a (failed) experiment to update the search-loader application to Bullseye. This project has been deprioritized and at this point i... [14:01:47] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10dr0ptp4kt) @bking just wanted to express my gratitude for the support on this ticket and its friends {T344905} and {T347647}. FWIW I do think it... [14:02:57] btullis just curious, do any of your conda deb pkgs include older versions of Python? Specifically looking for 3.7 on Bullseye or later [14:05:09] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Wikidata, 10Wikidata-Query-Service: Publish WDQS JNL files to dumps.wikimedia.org - https://phabricator.wikimedia.org/T344905 (10dr0ptp4kt) > I think the ammount of time taken to decompress the JNL file should also be taken into consideration on... [14:05:52] 10Data-Platform-SRE: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 (10BTullis) I've scheduled some time to work on this on Monday 16th of October at approximately 09:00 UTC. Is that OK with you @MoritzMuehlenhoff? My plan is to: - log in SAL an... [14:08:18] inflatador: We've not done anything with old versions of python, but I'm aware of a few people having tried /different/ versions of python with conda. [14:08:30] Have a look at this thread: https://wikimedia.slack.com/archives/CLKDS4MG9/p1584539766214000 [14:10:02] I'm trying this now, to see if it works: [14:10:07] https://www.irccloud.com/pastebin/Zi16hKvi/ [14:11:25] btullis , thanks! for context I'm looking at https://phabricator.wikimedia.org/T346039 . I still think we need to move to k8s, but this might be a stopgap if we can get it going [14:12:54] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.... [14:12:59] Seems to work for me anyway. [14:13:03] https://www.irccloud.com/pastebin/kssp4avt/ [14:13:44] thanks, will pass along to our SWEs to see if this would be useful to them [14:14:08] https://gerrit.wikimedia.org/r/c/operations/puppet/+/965748 is for moving our bullseye hosts back to insetup if you have time to take a look...just to stop noisy alerts [14:14:33] inflatador: ack on the previous message. It's a bit heavyweight to make a whole deb of this: https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics [14:18:52] btullis ACK, luckily someone else did the hard work already ;P [14:19:28] anyway, will talk it over w/Erik when he gets in re: whether we could/should use Conda or just creep along on Buster until the k8s stuff is ready [14:20:13] 10Data-Platform-SRE: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 (10MoritzMuehlenhoff) >>! In T346135#9249993, @BTullis wrote: > I've scheduled some time to work on this on Monday 16th of October at approximately 09:00 UTC. > Is that OK with y... [14:23:15] Yep, cool. Happy to help wherever I can. We have some other exampes of where we have created conda based debs. e.g. https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags uses similar techniques. k8s way is definitely preferable though :-) [14:24:21] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) Deployed ES to production, so far all good! I observed some higher latency only for co... [14:25:55] 10Data-Platform-SRE: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 (10BTullis) >>! In T346135#9250046, @MoritzMuehlenhoff wrote: >>>! In T346135#9249993, @BTullis wrote: >> I've scheduled some time to work on this on Monday 16th of October at ap... [15:19:44] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10BTullis) @fgiunchedi - I wonder if you might be able to advise here, please? We have an x509 certificate on-disk, but it's not exposed via a TCP service. We would like to check its expiry via Alertmanager... [15:47:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:19] (03PS1) 10Aqu: Use canonical_data.countries when generating referer table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965771 (https://phabricator.wikimedia.org/T348504) [15:56:10] (03PS2) 10Aqu: Use canonical_data.countries when populating the referer tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965771 (https://phabricator.wikimedia.org/T348504) [15:57:41] (03PS2) 10MNeisler: Add the wikifunctions_ui metrics platform schema to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/962657 (https://phabricator.wikimedia.org/T344277) [15:59:42] (03CR) 10MNeisler: Add the wikifunctions_ui metrics platform schema to the allowlist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/962657 (https://phabricator.wikimedia.org/T344277) (owner: 10MNeisler) [16:01:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:12] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM! Thanks" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/962657 (https://phabricator.wikimedia.org/T344277) (owner: 10MNeisler) [16:11:06] (03PS3) 10Aqu: Use canonical_data.countries when populating the referer tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965771 (https://phabricator.wikimedia.org/T348504) [16:29:50] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki... [16:34:59] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) a:05VRiley-WMF→03Jclark-ctr [16:35:46] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) 05In progress→03Resolved [16:35:52] 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work): Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10Jclark-ctr) [17:50:18] 10Data-Engineering, 10API Platform: Media views is returning "file not found" for many files - https://phabricator.wikimedia.org/T348889 (10Ladsgroup) [17:54:43] 10Data-Engineering, 10API Platform: Media views is returning "file not found" for many files - https://phabricator.wikimedia.org/T348889 (10Ladsgroup) Similarly for https://commons.wikimedia.org/wiki/File:Bundestagswahl_erkl%C3%A4rt_Erst-_und_Zweitstimme_von_Tagesschau.webm (https://pageviews.wmcloud.org/media... [19:13:48] (03PS1) 10Milimetric: Improve fidelity of dumps import [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) [19:16:38] (03CR) 10Milimetric: [C: 04-1] "I am -1-ing because I have more tests to run and a couple more bugs to track down, but the bulk of the changes should be here." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [19:22:23] (03CR) 10CI reject: [V: 04-1] Improve fidelity of dumps import [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [20:51:07] (03CR) 10Xcollazo: "Some more comments below." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (https://phabricator.wikimedia.org/T348761) (owner: 10Milimetric) [20:59:12] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10CodeReviewBot) ebernhardson opened https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/merge_requests/7 Update python to 3.10 [20:59:27] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10EBernhardson) Patch looks to work, but we will also want to check the model outputs after deploying the update.