[00:31:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:16] (SystemdUnitFailed) firing: (6) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:26:16] (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:30:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:40:02] (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:43:16] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [07:09:07] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MoritzMuehlenhoff) [07:24:11] 10Analytics-Radar, 10Infrastructure-Foundations, 10netops: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#8733992, @cmooney wrote: > Must be a race condition of some kind I'm guessing but not sure what it might be. Pro... [07:24:21] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Ladsgroup) MW section masters: - db1100: s5 - db1131: s6 - db1181: s7 Need to downtime the whole sections for these. I'll do it a b... [07:34:28] !log Rerun refine_event with "sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event --ignore_failure_flag=true --table_include_regex='mediawiki_visual_editor_feature_use|mediawiki_edit_attempt|mediawiki_web_ui_interactions' --since='2023-04-02T18:00:00.000Z' --until='2023-04-03T19:00:00.000Z'" [07:34:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:01:17] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:02] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:47] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 11): Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10dcausse) @Ottomata yes the job running with the flink-operator on the dse is using checkpoints, it can be used to experiment with zookeeper, w... [08:25:02] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:35:02] (SystemdUnitFailed) firing: (9) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:31] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for ckbwiktionary - https://phabricator.wikimedia.org/T331834 (10Ladsgroup) a:05Ladsgroup→03None I created the database and gave the rights to labsdbuser, it's now data engineering's turn to run their scripts. [09:32:48] 10Data-Engineering, 10Data-Engineering-Wikistats: Monthly pageview stats for March 2023 missing - https://phabricator.wikimedia.org/T333923 (10Radim.kubacki) [09:46:40] (03PS1) 10Barakat Ajadi: Navtiming: Add longtask task and longtask duration before FCP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/905589 (https://phabricator.wikimedia.org/T327477) [09:51:46] (03CR) 10Phuedx: [C: 03+1] sanitization: Remove some NavigationTiming retentions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/904660 (owner: 10Krinkle) [09:52:37] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 11): Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10elukey) Zookeeper is probably going to be supported for a long time at the WMF, it is mostly Kafka related but migrating away from it means:... [10:14:54] (03CR) 10Phedenskog: [C: 03+2] "Looks good!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/905589 (https://phabricator.wikimedia.org/T327477) (owner: 10Barakat Ajadi) [10:15:47] (03Merged) 10jenkins-bot: Navtiming: Add longtask task and longtask duration before FCP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/905589 (https://phabricator.wikimedia.org/T327477) (owner: 10Barakat Ajadi) [10:20:31] steve_munene: o/ [10:20:56] do you need any review/support/etc.. for today's row C maintenance? [10:32:28] afaics from https://phabricator.wikimedia.org/T331882 there is quite a bit of work to do, starting in an hour [10:33:41] joal: around by any chance? [10:41:10] Sent the email to analytics-announce for the matomo/superset/turnilo downtime [10:41:56] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10elukey) [10:45:49] created https://gerrit.wikimedia.org/r/c/operations/puppet/+/905596 for yarn and gobblin [10:46:22] DE folks - is there anybody that can assist me when I stop yarn queues and gobblin timers? [10:46:30] (going out for a quick lunch, back in a bit) [10:48:44] hi elukey yes I do sending out a notification for hadoop and yarn [10:49:23] steve_munene: ack perfect, I sent one for Turnilo/Superset/Matomo [10:49:44] there is also the code review out, I didn't see you online so I went ahead and created one [10:50:02] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:50:13] going out for lunch, will be back in a bit. Lemme know if you need help from me :) [11:05:02] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:45] a-team [11:27:17] jynus: o/ I think that Steve is out for lunch, anything that you need? [11:27:53] yes, we need someone with pageviews general knowledge [11:28:28] its an ongoing incident- not an emergency, but relatively time sensitive [11:29:34] I can maybe ping Steve after he comes back [11:30:22] but is it related to the infra or to the Pageview content? [11:30:27] I can try to help if needed [11:30:30] both [11:30:50] you can read backlog on a channel you are [11:32:30] search for "data engineering" mention, but we may need someone with specific pageviews knowledge [11:32:43] yep I think so [11:33:05] elukey: do you know at least if they have a general contact email? [11:33:48] there is a public one IIRC, not sure about any internal ones, probably they all use slack [11:33:59] I'd suggest to follow up with either mforns or milimetric [11:34:08] yep, I tried [11:34:59] will try a bit later [11:35:05] thank you, elukey [11:40:58] hi jynus, what channel? [11:41:26] milimetric: I PM you [11:41:52] (btw our team email is data-engineering-team) [11:42:47] !log stop puppet on an-launcher1002 and manually stop .timer units [11:42:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:44:23] steve_munene: not sure what is the procedure, but to avoid stopping ongoing jobs on an-launcher1002 I just stopped the relevant .timer systemd units [11:44:32] and disabled puppet of course [11:44:55] so the .service units, if any was running, would keep going, but they wouldn't be rescheduled [11:45:17] (it is an alternative to "Absent" all timers) [11:46:30] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) >>! In T330693#8701662, @gmodena wrote: [...] >> @MatthewVernon brou... [11:47:11] updated the code review just to stop Yarn queues [11:48:28] lemme know when you are back so we can stop yarn [11:49:04] Were there multiple running jobs? [11:49:09] I am back [11:49:35] I didn't check yarn yet [11:49:45] or do you mean on an-launcher? [11:49:45] steve_munene: I contacted mili with all info, please coordinate to see who is the right person what could help us [11:51:01] Meant on an-launcher, that would require us to avoid stopping the timers [11:51:11] thanks jynus reaching out [11:51:58] steve_munene: so the .timer units (the config that systemd uses to schedule periodically .service units, like gobblin etc..) can be seen via `systemctl list-timers` [11:51:59] sorry for the urgency- it is not "wikis are down" levels of emergency, but I thought it was important asking for your help, thank tou [11:52:35] steve_munene: if you stop (manually via `systemctl stop blabla.timer`) only the timer unit, the .service one will keep going, but it will not be rescheduled [11:52:51] with the puppet "absent" way, we remove all configs for all the timers from the node [11:52:56] that is a bit more brutal [11:53:22] this is why I opted for simply stopping the relevant .timer jobs manually, less invasive (at least, I used to do it at the time) [11:53:41] for Yarn we'll need to deploy the patch and call the special refresh queue command [11:54:05] and I think we are about on time, the task suggests to do it half an hour before maintenance starts [11:54:17] thanks for the explanation. [11:55:26] Sure, good timing. Hop on a call? [11:56:13] if you don't mind let's sync in here, I am finishing one thing and I'd need to prep for the maintenance as well :) [11:57:44] so the maintenance is in ~ 1 hour [11:57:58] and I am reading the DE section of https://phabricator.wikimedia.org/T331882 [11:58:08] That’s cool [11:58:17] stopping the Yarn queues + HDFS safe mode can be done in a bit (30 mins before) [11:58:26] so we can focus on the depool actions [11:58:47] do you know how to depool a node? Otherwise I'll give you some infos [11:59:40] Haven’t done one yet [12:00:28] ack so there are two ways [12:00:43] 1) you ssh on the node, and execute `sudo -i depool` [12:00:54] 2) you use conftool from puppetmaster1001 (see https://wikitech.wikimedia.org/wiki/Conftool) [12:01:37] the good thing about 2) is that you get a log entry in the SRE's irc SAL automatically (https://sal.toolforge.org/production) [12:01:50] but 1) is fine as well, especially if you haven't done 2) before [12:02:25] and then, after the depool, you can check the status of the backend pooled or not in https://config-master.wikimedia.org/pybal/eqiad [12:02:34] (there are dedicated pages for every service) [12:04:07] and in our case, we need to depool some nodes in "aqs" and some nodes for "datahub" [12:04:23] (and after the maintenance, we need to repool them) [12:07:42] choose the path that you prefer, I can give you any info :) [12:07:57] after that, we'll start the procedure for Yarn [12:08:29] Thanks, checking on 2 to get the right syntax. [12:09:57] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ssingh) [12:12:13] Going with 2 should I get started on datahubsearch? [12:13:16] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:16:17] steve_munene: let's sync on the command to execute first, can you paste it in here? [12:17:00] (conftool is very powerful and the first times it is best to double check to avoid depooling too many things by mistake etc..) [12:18:21] sure, from puppetmaster1001 confctl depool --hostname datahubsearch1003.eqiad.wmnet [12:19:40] ack, remember to use sudo -i in front [12:21:37] Ack, same for the aqs servers? [12:21:43] yep exactly [12:23:58] cool getting started [12:24:25] ack, we are about on time to stop yarn queues [12:29:16] Confirmed depool [12:30:04] nice :) [12:30:11] next step is yarn [12:30:22] so IIRC the procedure is the following: [12:30:25] 1) merge the puppet change [12:30:42] 2) run puppet on an-master100[12], so that the yarn config gets updated [12:30:44] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Stevemunene) [12:30:51] 3) run the command to refresh the queues [12:31:44] not familiar with the refresh queues [12:32:14] I was looking into https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration but I didn't find it [12:32:39] so you can restart the yarn resource managers on an-master100[12], or just run the refreshQueue command on a single node (it will reload the queue config) [12:32:42] lemme find it [12:33:09] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:33:27] should be something like `sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -refreshQueues` [12:34:43] (03PS3) 10Lgaulia: Add first input delay schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902693 (https://phabricator.wikimedia.org/T332012) [12:34:55] ack getting started [12:34:58] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:37:41] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:39:08] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon) [12:39:48] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 11): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10achou) > We could def put them in the same event stream, as long as they share the same... [12:41:27] (03PS4) 10Lgaulia: Add first input delay schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902693 (https://phabricator.wikimedia.org/T332012) [12:42:01] (03CR) 10CI reject: [V: 04-1] Add first input delay schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902693 (https://phabricator.wikimedia.org/T332012) (owner: 10Lgaulia) [12:44:14] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon) [12:44:49] Done with the yarn queues and refreshed, waiting to Put HDFS into safe mode in a few [12:44:59] ack, do you know how? [12:45:13] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad... [12:45:27] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:45:45] (if you want to check how the queues are doing, you can inspect https://yarn.wikimedia.org/cluster/scheduler?openQueues=Queue:%20root#Queue:%20root#Queue:%20default) [12:46:31] steve_munene: I am not sure what the current jobs in running state are doing, those could probably be idle spark sessions [12:46:54] we can avoid to kill them, but when you'll enter safemode those jobs will start to fail (if they are writing to hdfs) [12:47:03] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:47:55] dcausse: o/ we are about to enter HDFS safemode, the Flink job may not like it - https://yarn.wikimedia.org/cluster/app/application_1678266962370_104769 [12:48:15] elukey: thanks, stopping it [12:48:41] thanks :) [12:49:07] steve_munene: and once you are done, you can write in #wikimedia-sre that the DE part is good (and update the task's description as well) [12:51:53] cool, thanks elukey [12:52:29] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:57:57] !log putting hdfs into safe mode as part of T331882 [12:58:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:58:01] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [12:59:15] 10Data-Engineering-Planning, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 11): Monthly pageview stats for March 2023 missing - https://phabricator.wikimedia.org/T333923 (10JArguello-WMF) [12:59:34] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Stevemunene) [13:00:02] 10Data-Engineering-Planning, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 11): Monthly pageview stats for March 2023 missing - https://phabricator.wikimedia.org/T333923 (10JArguello-WMF) a:03Antoine_Quhen [13:00:44] 10Data-Engineering, 10Data-Engineering-Wikistats: Monthly pageview stats for March 2023 missing - https://phabricator.wikimedia.org/T333923 (10JArguello-WMF) [13:01:19] 10Data-Engineering, 10Data-Engineering-Wikistats: Monthly pageview stats for March 2023 missing - https://phabricator.wikimedia.org/T333923 (10JArguello-WMF) a:05Antoine_Quhen→03None [13:02:11] steve_munene: nice! [13:02:20] so to rollback when everything is done: [13:02:42] sure, I shall reach out [13:02:57] - ssh to an-launcer, re-enable puppet and run it (should be sufficient to restore the state). [13:03:13] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=80a32cef-9700-4047-8185-415ffca1aaa2) set by ayounsi@cumin1001 for 2:0... [13:03:16] err before it, remove safe mode [13:03:41] then revert the yarn queue patch and refresh its queues [13:03:51] and finally, repool all nodes via conftool [13:04:02] I'll be available if needed! [13:04:43] ack thanks. [13:06:00] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad... [13:15:33] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10hnowlan) [13:15:50] PROBLEM - aqs endpoints health on aqs1017 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:15:52] PROBLEM - aqs endpoints health on aqs1019 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:15:54] PROBLEM - aqs endpoints health on aqs1020 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:15:54] PROBLEM - aqs endpoints health on aqs1010 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:15:58] PROBLEM - aqs endpoints health on aqs1021 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:16:18] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:16:48] PROBLEM - aqs endpoints health on aqs1016 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:16:48] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:17:18] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:19:35] (GobblinLastSuccessfulRunTooLongAgo) firing: Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [13:20:33] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Event Driven Data Pipelines should be generated from a template - https://phabricator.wikimedia.org/T324980 (10Ottomata) [13:20:49] all the above alerts are related to the network maintenance [13:23:45] (SystemdUnitFailed) resolved: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:56] RECOVERY - aqs endpoints health on aqs1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:23:57] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:28] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:50] RECOVERY - aqs endpoints health on aqs1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:50] RECOVERY - aqs endpoints health on aqs1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:52] RECOVERY - aqs endpoints health on aqs1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:53] RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:58] RECOVERY - aqs endpoints health on aqs1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:25:14] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:25:29] (SystemdUnitFailed) firing: jupyter-appledora-singleuser.service Failed on stat1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:45] (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [13:27:00] (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:51] (HdfsMissingBlocks) firing: HDFS missing blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_missing_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=40&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsMissingBlocks [13:33:06] ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 6 failed LD(s) (Offline, Offline, Offline, Offline, Offline, Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T333960 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:33:25] elukey: getting started on the reverse [13:34:51] (HdfsMissingBlocks) resolved: HDFS missing blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_missing_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=40&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsMissingBlocks [13:35:14] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Event Driven Data Pipelines should be generated from a template - https://phabricator.wikimedia.org/T324980 (10Ottomata) [13:36:38] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) 05Open→03Resolved a:03ayounsi Closing the task as the upgrade is done. It went extremely smoothly, thank you everybody!... [13:39:03] !log leave hdfs safemode T331882 [13:39:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:39:06] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [13:48:58] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 11): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata) Okay, so it sounds like we are back to our preferred choice: one prediction pe... [13:49:50] (GobblinLastSuccessfulRunTooLongAgo) firing: (4) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [14:02:15] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333960 (10Peachey88) [14:07:30] Hello steve_munene , are those 2 type of alerts "Is Last successful gobblin run" "HDFS missing blocks" temporary problems generated by the switch upgrade, or should we investigate? [14:17:18] Hi aqu They seem to have recovered but yes they were due to the switch upgrade [14:21:45] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Stevemunene) [14:43:40] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row C... [14:44:50] (GobblinLastSuccessfulRunTooLongAgo) firing: (4) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [14:57:42] PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:58:16] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: apt-daily-upgrade.service,apt-daily.service,clean_puppet_client_bucket.service,confd_prometheus_metrics.service,export_smart_data_dump.service,hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service,ipmiseld.service,lldpd.service,logrotate.service,man-db.service,prometheus-debian-version-textfile.service,prometheus-ipmi-export [14:58:16] ce,prometheus-nic-firmware-textfile.service,prometheus-node-exporter-apt.service,prometheus-node-exporter.service,prometheus_intel_microcode.service,prometheus_puppet_agent_stats.service,rsyslog.service,syslog.socket,systemd-journald-audit.socket,systemd-journald-dev-log.socket,systemd-journald.service,systemd-journald.socket,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@116.service,wmf_auto_restart_cron.servic [14:58:16] to_restart_exim4.service,wmf_auto_restart_lldpd.service,wmf_auto_restart_nagios-nrpe-server.service,wmf_auto_restart_nic-saturation-exporter.service,wmf_auto_restart_prometheus-ipmi-e https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:26] PROBLEM - puppet last run on an-worker1132 is CRITICAL: CRITICAL: Puppet last ran 9 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:59:10] PROBLEM - Hadoop DataNode on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [15:00:23] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row C... [15:10:02] (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [15:19:50] (GobblinLastSuccessfulRunTooLongAgo) resolved: (3) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [15:21:30] steve_munene: one nit - do you mind tp downtime an-worker1132 for some days? [15:24:54] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) Here are the [[ https://docs.google.com/document/d/1T9vcUvbyWSDOFlj... [15:29:17] 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams, 10Patch-For-Review: Include image/file changes in page-links-change - https://phabricator.wikimedia.org/T333497 (10Ottomata) Cool, thanks for the patch. Let's involve some other users of this stream in a discussion before we decided to do... [15:42:50] On it elukey I agree it is quite noisy. [15:43:53] elukey: think we can safely say the services are up with no major issues after the maintenance? [15:53:42] Would you recommend we exclude it from Hdfs and yarn as had been done here https://phabricator.wikimedia.org/T330979 [16:01:48] steve_munene: ah interesting! I thought that the node wasn't in service [16:02:01] re: services - yes all good I think! [16:03:52] ah wow I see in the tty (from mgmt console): [16:03:53] print_req_error: I/O error, dev sda, sector 109836976 [16:04:03] so something is broken on an-worker1132 [16:04:45] trying to powercycle it [16:05:21] It was brought back short, there’s also a ticket raised with Dell [16:05:54] ahhh okok https://phabricator.wikimedia.org/T333960 [16:06:37] wow LD down [16:07:04] PROBLEM - Host an-worker1132 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:16] okok so the node is completely down, we can exlude it from hdfs probably [16:08:20] until it is up and running [16:08:29] we can sync tomorrow about it if you wnat [16:08:46] pasted the wrong link here is the ticked with the LD details https://phabricator.wikimedia.org/T333091 [16:09:59] ack, sending a patch to exclude/put it on standby sometime today [16:19:12] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:12] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:12] 10Quarry, 10Cloud-VPS (Project-requests): Superset project - https://phabricator.wikimedia.org/T333986 (10rook) [16:46:00] 10Quarry, 10Cloud-VPS (Project-requests): Superset project - https://phabricator.wikimedia.org/T333986 (10fnegri) +1 [16:56:36] 10Quarry, 10cloud-services-team (FY2022/2023-Q3): Consider moving Quarry to be an installation of Redash - https://phabricator.wikimedia.org/T169452 (10nskaggs) [16:56:40] 10Quarry, 10Cloud-VPS (Project-requests): Superset project - https://phabricator.wikimedia.org/T333986 (10nskaggs) [16:57:20] 10Quarry, 10Cloud-VPS (Project-requests): Superset project - https://phabricator.wikimedia.org/T333986 (10nskaggs) +1 [17:20:22] 10Quarry, 10Cloud-VPS (Project-requests): Superset project - https://phabricator.wikimedia.org/T333986 (10rook) ` openstack project create --description 'superset' superset --domain default openstack role add --project superset --user rook member openstack role add --project superset --user rook reader ` [17:20:28] 10Quarry, 10cloud-services-team (FY2022/2023-Q3): Consider moving Quarry to be an installation of Redash - https://phabricator.wikimedia.org/T169452 (10rook) [17:21:04] 10Quarry, 10Cloud-VPS (Project-requests): Superset project - https://phabricator.wikimedia.org/T333986 (10rook) 05Open→03Resolved a:03rook [17:53:26] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Jelto) [18:37:18] (03PS1) 10Aqu: Use a disallow list to filter top articles sent to Cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/905701 (https://phabricator.wikimedia.org/T333940) [18:38:06] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) Answering some specific questions from Eric: > Will disparate WMF... [19:50:36] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite) [20:34:40] (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:36] PROBLEM - Webrequests Varnishkafka log producer on cp3060 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:00:38] PROBLEM - eventlogging Varnishkafka log producer on cp3064 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:01:43] RECOVERY - eventlogging Varnishkafka log producer on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:08:24] RECOVERY - Webrequests Varnishkafka log producer on cp3060 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:45:38] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333960 (10Jclark-ctr) Open ticket with dell Confirmed: Service Request 165628610 was successfully submitted. [22:47:20] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333960 (10Jclark-ctr) 05Open→03Resolved T333091 duplicate ticket [22:48:10] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) Submitted 2nd ticket Open ticket with dell Confirmed: Service Request 165628610 was successfully submitted. They have not responded to 1st ticket except for asking for address a...