[00:17:47] (SystemdUnitFailed) firing: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:17:55] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:29] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:47] (SystemdUnitFailed) resolved: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:27:47] (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:22:47] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:47] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:36:14] * brouberol waves good morning [07:37:47] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:25:13] Morning all. [08:36:00] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar, 10Epic, 10Kubernetes: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10JMeybohm) [08:39:25] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) Great! Thanks so much @xcollazo - It was the assembly file that I had overlooked in my brief testing. I'll mart this ticket... [08:52:48] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:16] (EventgateValidationErrors) firing: ... [09:00:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [09:24:58] heasdsup: we are going to attempt a reimage of kafka-jumbo1007.eqiad.wmnet. A couple of seconds of disturbance might be expected, until kafka reassigns leadership to other brokers I'll signal when we start [09:43:37] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10Addshore) @bking Would it be possible to get me access to an R2 bucket that is paid for by the WMF in some way? I'll happily continue my manual... [09:47:41] ^ silence set in alertmanager for under replication alerts [09:53:14] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1007.eqiad.wmnet with OS bullseye [09:54:21] kafka-jumbo1007.eqiad.wmnett is down [09:58:58] Cookbook didn't work. It didn't boot into PXE properly. Will check settings and maybe upgrade firmware. [10:02:03] 10Quarry: Move away from nfs? - https://phabricator.wikimedia.org/T349690 (10rook) [10:02:28] 10Quarry: Move away from nfs? - https://phabricator.wikimedia.org/T349690 (10rook) [10:02:34] 10Quarry: Move quarry to magnum - https://phabricator.wikimedia.org/T349029 (10rook) [10:08:35] kafka-jumbo1007 booted to pxe manually, cookbook proceeding [10:14:34] IIRC PXE manual doesn't work now, DHCP needs a special temporary setting added by the cookbook [10:24:12] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) @Ottomata I checked some Event Gate's analytics [[ https://grafana.wikimedi... [11:02:40] (DruidNetflowSupervisor) firing: Zero wmf_netflow events received by druid_analytics over the last 30 minutes. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Netlow_Supervisor - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=41&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidNetflowSupervisor [11:04:49] elukey: Understood. We were using the DHCP automation, it's just that iDRAC requesting boot to PXE didn't work. I had to press F12 manually. [11:06:25] I'm still wrestling with a reuse-parts-test recipe for kafka-jumbo1007 which looks like it should work, but isn't. [11:42:23] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10REsquito-WMF) @Ottomata WE have created ticket https://phabricator... [11:42:47] btullis: yes yes sorry I mentioned since I was puzzled when I tried the first time after the new automations :D I was so used to force PXE manually :D [11:42:56] if you need help for the reuse recipe lemme know! [11:43:34] elukey: Thanks so much. I think we might be getting to the point where we need your help. [11:44:04] I'm just trying an install with this patch, but I don't really expect it to fix things: https://gerrit.wikimedia.org/r/c/operations/puppet/+/968637 [11:44:12] me and Eric struggled a little for aqs1010, not sure if he managed to make it work, maybe there is something with the reuse recipe script that doesn't work anymore [11:45:04] Yeah, I'm surprised to find that there is no `/tmp/reuse-parts` directory on the server. I thought it logged what it was doing there. [11:45:28] IIRC there should be some info of what the reuse script does in the d-i console [11:45:42] there is a specific event log that one can inspect [11:45:52] it is not easy to parse but it may contain some clue [11:45:55] https://usercontent.irccloud-cdn.com/file/I8YoyMQX/image.png [11:45:58] what is the issue? [11:46:07] No, it's still not working. [11:46:43] The /dev/mapper/vg0-root doesn't get mounted to / and the /dev/vg1-srv doesn't get mounted to /srv/ [11:46:59] this is the same thing that me and Eric struggled with [11:47:01] Although the reuse recipe looks good to my eye and it gets downloaded. [11:47:12] bookworm right? [11:47:19] No, bullseye. [11:47:26] ah weird, okok [11:47:39] can you check the d-i console? [11:47:44] if there is anything logged in there [11:48:15] I have to step away for a bit to eat for a minute. I'll disconnect from the IPMI console and the `install_console` - Feel free to jump on and brouberol is around to help too. [11:48:24] okok super [11:48:59] I'm out. [11:54:13] o/ [11:55:01] elukey: I'm available to pair, if that helps [12:13:38] brouberol: back! Sorry I was commuting to the office [12:14:32] so I had the same issue with Eric, for aqs1010: the mountpoint were not present in the debian partition menu' [12:14:53] and the recipe seems good [12:15:25] so what may help is to check in the debian install event log (should be available as console IIRC) if anything is logged in the partman logfile [12:15:38] like "cannot use mountpoint because XYZ" [12:20:01] I'm not seeing anything out of the ordinary, but again, I'm not well versed w/ partman [12:21:22] do you mean in the log? [12:21:34] for example `~ # grep /srv /var/log/partman ` yields nothing, when I would expect if to at least mention that /dev/mapper/vg1-root would be mounted under /srv [12:21:36] yes [12:22:10] if you grep for lvm or mapper or similar? [12:23:23] do you see anything under /tmp/reuse-parts ? [12:23:47] * btullis back [12:24:12] That was my concern. There is no `/tmp/reuse-parts` created on the server. [12:24:27] nope [12:24:31] ~ # grep reuse /var/log/partman [12:24:31] ~ # [12:24:47] brouberol: nono I mean ls /tmp/reuse-parts [12:25:25] ah sorry [12:25:39] ls: /tmp/reuse-parts: No such file or directory [12:26:07] btullis: yeah I expected nothing on the server itself, IIRC d-i uses a separate minimal fs for its work [12:27:05] Yeah, there is no /tmp/reuse-parts even in this in-memory file system. [12:29:20] what jumbo node are we trying to reimage? [12:29:27] kafka-jumbo1007 [12:29:28] kafka-jumbo1007 [12:29:32] :-) [12:30:47] netboot looks good, same thing for the recipe (doing basic checks) [12:32:23] do you mind if I attach to the console? [12:32:32] brouberol: --^ [12:32:33] Please, be my guest [12:32:45] surwe [12:32:51] do I need to disconnect> [12:32:54] *sure [12:33:00] yeah exactly [12:33:04] done [12:33:09] all yours [12:34:53] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) Sounds good, lets pause rollout. eventgate-analytics is under more heavy... [12:37:31] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) Okay, we can delay timelines as needed. We've delayed sin... [12:48:39] I'm stepping out for about 1h [12:52:48] (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:24] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10REsquito-WMF) Thanks @Ottomata :) We'll update here once we have t... [12:57:33] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10dr0ptp4kt) Looking at yesterday's downloads with a rudimentary grep we're not far from 1K downloads, and that's just for the //latest-all// ones... [13:02:37] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) In the MR review @gmodena wrote: > I thin... [13:07:48] (SystemdUnitFailed) firing: (2) cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:04] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10dr0ptp4kt) ^ Update. [13:14:05] brouberol, btullis - my soul is deeply sad https://gerrit.wikimedia.org/r/c/operations/puppet/+/968659 [13:14:55] totally my fault, I forgot the "\" when fixing the script with Eric [13:16:34] haha, I was watching the onboarding chat and they mentioned this possible issue [13:16:35] running puppet on the install servers, after that you can retry the reimage [13:16:37] elukey: Don't be sad, this is a win \o/ Thanks so much for finding out what it was :-) [13:16:46] thank you so much! [13:16:47] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10dcausse) >>! In T266798#9277763, @Ottomata wrote: > cc @dcausse for... [13:17:18] cc: urandom: https://gerrit.wikimedia.org/r/c/operations/puppet/+/968659 is why we had issues with aqs1010 sigh [13:18:27] Ha! [13:18:42] elukey: good catch! [13:18:47] brouberol: you can retry the reimage, lemme know how it goes [13:19:14] urandom: I ended up checking the perms of the file installed on the d-i console, and the next minute was silence and tears [13:19:32] :) [13:21:44] (03PS1) 10Peter Fischer: cirrussearch/update_pipeline/update add cirrussearch_fetched_at timestamp [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/968661 [13:24:03] Heads-up, there will be a little downtime for superset.wikimedia.org as the server is going to be moved. Apologies for the short notice. [13:35:16] hnowlan: o/ [13:35:35] testing the rest-gateway, and I see something odd [13:35:47] https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/23 works fine [13:35:50] but [13:35:57] elukey@stat1004:~$ curl https://rest-gateway.discovery.wmnet:4113/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/23 [13:36:00] {"httpCode":404,"httpReason":"Not Found"} [13:36:13] do we have a different URI scheme if we hit the endpoint internally? [13:36:23] Cc: kevinbazira: [13:36:38] Perfect. Thanks again elukey <3 https://usercontent.irccloud-cdn.com/file/dAyML510/image.png [13:36:42] (03PS1) 10Kimberly Sarabia: Adds skin field in mobilewebuiactions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968674 (https://phabricator.wikimedia.org/T346106) [13:36:47] btullis: \o/ [13:37:05] urandom: confirmed that you can proceed with aqs reimaged [13:37:08] *reimages [13:42:41] btullis elukey: I’ll be home in 10 minutes. I’ll try rerunning the reimage then [13:43:50] brouberol: Sorry, I have already set it going :-) [13:44:19] Even better [13:47:34] 10Data-Platform-SRE, 10Data-Catalog, 10Patch-For-Review: Create Airflow Pipeline for Ingesting/Updating Superset Data - https://phabricator.wikimedia.org/T309622 (10Ottomata) Should revisit this task? Should be easy enough to do now? [13:53:11] 10Data-Engineering, 10Data-Catalog: Emit lineage information about Airflow jobs to DataHub - https://phabricator.wikimedia.org/T312566 (10Ottomata) Bump? :) [14:00:04] (03PS2) 10Kimberly Sarabia: Adds skin field in mobilewebuiactions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968674 (https://phabricator.wikimedia.org/T346106) [14:00:12] elukey: cool; I'm blocked by something else currently, but this is one less thing to worry about :) [14:02:48] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:56] (03PS1) 10Brouberol: Rely on multiple kafka bootstrap servers in different racks [analytics/refinery] - 10https://gerrit.wikimedia.org/r/968683 [14:09:26] (03CR) 10Ottomata: Rely on multiple kafka bootstrap servers in different racks (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/968683 (owner: 10Brouberol) [14:10:34] (03CR) 10Brouberol: Rely on multiple kafka bootstrap servers in different racks (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/968683 (owner: 10Brouberol) [14:11:49] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1007.eqiad.wmnet with OS bullseye completed: - kafka-jumbo1007 (**PASS**) - Downtimed on I... [14:11:59] (03CR) 10Ottomata: Rely on multiple kafka bootstrap servers in different racks (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/968683 (owner: 10Brouberol) [14:12:41] (DruidNetflowSupervisor) resolved: Zero wmf_netflow events received by druid_analytics over the last 30 minutes. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Netlow_Supervisor - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=41&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidNetflowSupervisor [14:13:19] (03PS2) 10Brouberol: Rely on multiple kafka bootstrap servers in different racks [analytics/refinery] - 10https://gerrit.wikimedia.org/r/968683 [14:13:45] btullis: ^ as we discussed, removal of a kafka configuration spof [14:15:42] (03CR) 10Brouberol: Rely on multiple kafka bootstrap servers in different racks (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/968683 (owner: 10Brouberol) [14:16:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:48] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:12] (03PS2) 10Peter Fischer: cirrussearch/update_pipeline/update add cirrussearch_fetched_at timestamp [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/968661 [14:18:52] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10dr0ptp4kt) I also see https://grafana.wikimedia.org/d/000000264/wikidata-dump-downloads?orgId=1&refresh=5m&from=now-2y&to=now which I noticed fr... [14:20:57] (03CR) 10Brouberol: Rely on multiple kafka bootstrap servers in different racks (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/968683 (owner: 10Brouberol) [14:27:08] (03CR) 10Btullis: [C: 03+1] "Looks good to me. I also think that a kafka discovery address would be a good thing, but this is helpful in the meantime." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/968683 (owner: 10Brouberol) [14:47:46] (EventgateValidationErrors) resolved: ... [14:47:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [14:50:28] 10Data-Engineering, 10serviceops, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) Profiled changeprop on nodejs 10 in staging as well: https://phabricator.wikimedia.org/P53054 One thing that I noticed is this: node 18: ` ticks total nonl... [15:01:03] 10Data-Platform-SRE, 10Data-Catalog, 10Patch-For-Review: Create Airflow Pipeline for Ingesting/Updating Superset Data - https://phabricator.wikimedia.org/T309622 (10BTullis) >>! In T309622#9280046, @Ottomata wrote: > Should revisit this task? Should be easy enough to do now? Yes, it would be great to get t... [15:03:48] 10Data-Engineering, 10Machine-Learning-Team, 10serviceops: URI to use when hitting the Pageviews API on rest-gateway - https://phabricator.wikimedia.org/T349722 (10elukey) [15:03:49] hnowlan: opened https://phabricator.wikimedia.org/T349722 [15:04:54] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) [15:22:38] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) @xcollazo I'm happy to be guided by you as to when this should be deployed to production. I believe that this patch is rea... [15:27:44] 10Data-Engineering, 10Data-Platform-SRE, 10AQS2.0: Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10BTullis) I think we can probably decline this ticket now, given that we are so close to sunsetting AQS 1.0. @VirginiaPoundstone - would you agree? [15:29:17] 10Data-Engineering, 10Data-Platform-SRE: Write a design document relating to superset on dse-k8s - https://phabricator.wikimedia.org/T349396 (10BTullis) p:05Triage→03Medium [15:31:46] (03CR) 10Ebernhardson: [C: 03+2] cirrussearch/update_pipeline/update add cirrussearch_fetched_at timestamp [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/968661 (owner: 10Peter Fischer) [15:32:48] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:16] 10Data-Engineering, 10Machine-Learning-Team, 10serviceops: URI to use when hitting the Pageviews API on rest-gateway - https://phabricator.wikimedia.org/T349722 (10hnowlan) Documentation fail on my part - this endpoint requires the host header of "wikimedia.org" be set. This is to force clients at the edge t... [15:44:12] 10Data-Engineering, 10Machine-Learning-Team, 10serviceops: URI to use when hitting the Pageviews API on rest-gateway - https://phabricator.wikimedia.org/T349722 (10elukey) 05Open→03Resolved a:03elukey Right this works! ` curl https://rest-gateway.discovery.wmnet:4113/wikimedia.org/v1/metrics/pageviews... [15:45:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:09] (03Merged) 10jenkins-bot: cirrussearch/update_pipeline/update add cirrussearch_fetched_at timestamp [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/968661 (owner: 10Peter Fischer) [15:47:48] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:03] (03PS1) 10Kimberly Sarabia: Adds new readme [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) [16:02:48] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:08:28] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [16:15:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:05] 10Data-Engineering, 10Data-Platform-SRE: Write a design document relating to superset on dse-k8s - https://phabricator.wikimedia.org/T349396 (10BTullis) It's also worth bearing in mind this ticket: {T309622} which is about how we intend to ingest metadata from Superset to DataHub. Our current superset instanc... [16:45:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:29] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [17:30:08] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) Okay, the 404s are all to '/' AKA `path: "root"` https://grafana.wikimedi... [17:30:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:21] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) > Latency and CPU usage increased At least memory usage went down? I... [17:56:09] 10Data-Engineering: NEW BUG REPORT 12 new wikis missing from the mediawiki_history dataset - https://phabricator.wikimedia.org/T349743 (10nshahquinn-wmf) [17:56:53] 10Data-Engineering: NEW BUG REPORT 12 new wikis missing from the mediawiki_history dataset - https://phabricator.wikimedia.org/T349743 (10nshahquinn-wmf) [18:00:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:50] 10Data-Engineering, 10Data-Engineering-Wikistats: Automate creation of sqoop list of wikis to import data for from sitematrix - https://phabricator.wikimedia.org/T190700 (10nshahquinn-wmf) This would only work for the snapshots, but a simple solution would be to just pull the sqoop list from `canonical_data.wi... [18:08:34] 10Data-Engineering: NEW BUG REPORT 12 new wikis missing from the mediawiki_history dataset - https://phabricator.wikimedia.org/T349743 (10nshahquinn-wmf) [18:10:59] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4): [Data Quality] List out data path options for Prometheus vs. Hive as a metrics backend - https://phabricator.wikimedia.org/T349744 (10Ahoelzl) [19:30:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:34:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:06] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10Ottomata) [19:43:40] (03PS3) 10Ottomata: Use eventutilities-spark JsonSchemaSparkConverter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854) [20:00:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:48] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:46] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10Ladsgroup) Thanks. From my contacts, this seems to be fixed now. [21:19:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:01] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards/alerts for new Search Update Pipeline - https://phabricator.wikimedia.org/T349772 (10bking) [22:15:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:19:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:46] 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10RobH) [23:02:33] 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10RobH) [23:04:12] 10Data-Platform-SRE, 10DC-Ops, 10ops-codfw: Q1:rack/setup/install elastic2087-2092 - https://phabricator.wikimedia.org/T349778 (10RobH) [23:04:41] 10Data-Platform-SRE, 10DC-Ops, 10ops-codfw: Q1:rack/setup/install elastic2087-2092 - https://phabricator.wikimedia.org/T349778 (10RobH) [23:05:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:11] 10Data-Platform-SRE, 10DC-Ops, 10ops-codfw: Q2:rack/setup/install elastic2087-2092 - https://phabricator.wikimedia.org/T349778 (10RobH) [23:06:22] 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10RobH) [23:08:41] 10Data-Platform-SRE, 10DC-Ops, 10ops-codfw: Q2:rack/setup/install elastic2093-2110 - https://phabricator.wikimedia.org/T349780 (10RobH) [23:09:09] 10Data-Platform-SRE, 10DC-Ops, 10ops-codfw: Q2:rack/setup/install elastic2093-2110 - https://phabricator.wikimedia.org/T349780 (10RobH) [23:15:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:35:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:40:28] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10KAP_Jasa) thank you all :-) [23:45:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state