[00:16:37] (03PS1) 10Xcollazo: Modify geoeditor SQL scripts to play nice with Spark3 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/831639 (https://phabricator.wikimedia.org/T305846) [00:21:06] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:34] (03PS2) 10Xcollazo: Modify geoeditor SQL scripts to play nice with Spark3 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/831639 (https://phabricator.wikimedia.org/T305846) [00:28:31] (03CR) 10Xcollazo: "Something that I wanted to do was to move these files out of the oozie/ folder and into hql/ to make it clear that these are not oozie job" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/831639 (https://phabricator.wikimedia.org/T305846) (owner: 10Xcollazo) [00:30:26] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:44] 10Data-Engineering, 10Observability-Alerting: Migrate eventlogging check_prometheus checks to alertmanager - https://phabricator.wikimedia.org/T309007 (10lmata) [01:17:12] 10Data-Engineering, 10Observability-Alerting: Migrate eventgate check_prometheus checks to alertmanager - https://phabricator.wikimedia.org/T309009 (10lmata) [01:17:42] 10Data-Engineering, 10Observability-Alerting, 10Patch-For-Review: Migrate Kafka prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T309010 (10lmata) [01:18:38] 10Data-Engineering, 10Observability-Alerting, 10Patch-For-Review: Migrate zookeeper prometheus checks from Icinga to Alertmanager - https://phabricator.wikimedia.org/T309012 (10lmata) [02:29:43] 10Analytics, 10API Platform (Product Roadmap), 10Code-Health-Objective, 10Epic, and 3 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10BPirkle) [07:12:16] (03CR) 10Joal: "One nit and two questions (across all files)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/829862 (https://phabricator.wikimedia.org/T305841) (owner: 10Mforns) [07:22:11] (03CR) 10Joal: Modify geoeditor SQL scripts to play nice with Spark3 (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/831639 (https://phabricator.wikimedia.org/T305846) (owner: 10Xcollazo) [08:57:04] a-team: I would like to carry out a managed restart of the hadoop namenodes today, if possible. It's in support of: T311210 [08:57:04] T311210: Add an-worker11[42-48] to the Hadoop cluster - https://phabricator.wikimedia.org/T311210 [08:57:50] ack btullis - let's try it at a time of expected low activity (xx:45 for instance) [08:59:07] Great, that's just what I was about to say. So far it's only ever been the *fail-back* operation that has been an issue, but maybe that was just luck. I'll try to coordinate both the fail-over and fail-back at quiet periods. [08:59:30] thanks btullis :) [09:00:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2034 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2034%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:00:55] ^ I will ack this. I believe that it is relaing to work on the core routers at codfw - from #wikimedia-sre [09:05:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2034 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2034%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:30:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2036 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2036%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:35:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2036 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2036%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:40:25] Proceeding to run the `sre.hadoop.roll-restart-masters` cookbook. [09:42:26] !log roll-restarting the hadoop masters via the cookbook [09:42:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:49:50] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [09:50:24] ^^ Not expected, but I believe we have seen this before. Hopefully transitory. Looking now. [09:59:06] btullis: yeah it is very annoying [10:00:54] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) [10:06:21] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) > @Milimetric Can you verify that this really is what you want? Going forward, it sounds like perhaps amending the documentation to be a little... [10:14:16] btullis: this graph makes me nervous - https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=28&from=now-1y&to=now [10:15:25] joal, me too. The only solution we have on the table at the moment is iceberg though, isn't it? [10:15:59] btullis: iceberg will solve part of the problem for event data, I think there must be something else [10:16:29] btullis: something related to https://phabricator.wikimedia.org/T317126 [10:16:37] I'm gonna pick that right now [10:17:37] joal: Great! That would be really good if we could identify some more files to drop. [10:18:08] joal: lol 90M? [10:29:05] * joal looks away from elukey :S [10:33:39] It's still stuck at 444 corrupt blocks, but at 33 minutes past now, we're right in the middle of the busy time for the cluster. Hopefully in 10 minutes time or so, it will be calm enough and I can try the failback operation. [10:35:11] ack btullis [10:36:09] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:43:01] heya joal! :] thanks for the review! about the coalescing, good catch, do you think it is good to pass the coalescing number as a parameter for those queries as well, or better hadrcode it as a constant? [10:43:34] Hey mforns - sorry for having caught it on the previous review :) [10:43:44] thinking about your question [10:44:40] mforns: Since the queries are about loading data in tables, I like having the value as a parameter [10:45:42] mforns: for archiving jobs, having them hard-coded to 1 is very legit, but for table jobs I prefer parameters (for consistency with jobs having to coalesce to stuff bigger than 1)! [10:47:24] joal: I agree that coalesce=1 is part of the logic of the archiving job, and would be best hardcoded. I will change that if you don't mind (since the oozie jobs had that parametrized). I will add a parameter for the partitions of the calculation queries :] [10:47:44] Many thanks for this mforns :) [10:48:24] mforns: it Xabriele who made me thing of the hard-coded value earlier on today, as he did it naturally on his CR for geoeditors :) [10:48:47] Xabriel rocks! [10:51:37] !log attempting failback operation on hadoop namenodes [10:51:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:52:08] https://www.irccloud.com/pastebin/oiFzEwhw/ [11:00:20] (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [11:06:35] I assume this error reolution --^ is sign of a succesful master change btullis ? [11:08:04] joal: Yes, the failback operation succeeded. I don't know exactly how it relates to the corrupt blocks alert, but the timing would strongly indicate that this is the case. [11:08:23] awesome :) [11:09:18] my impression in the past was that the JMX metric counter got in a weird state when failing over, because every time a fsck showed no sign of issue [11:09:21] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:10:01] (03CR) 10Mforns: [V: 03+2] "Good catches! Using '\t' instead of the naked tab char is possible! Changed all your suggestions." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/829862 (https://phabricator.wikimedia.org/T305841) (owner: 10Mforns) [11:10:12] (03PS3) 10Mforns: Migrate unique devices queries to SparkSql and move to /hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/829862 (https://phabricator.wikimedia.org/T305841) [11:12:35] elukey: Am I safe to ignore this PCC error? https://puppet-compiler.wmflabs.org/pcc-worker1001/37243/an-worker1148.eqiad.wmnet/change.an-worker1148.eqiad.wmnet.err [11:12:35] I suppose it is because `facter` is returning the number of mounts on the pcc-worker, not the actual target. Is that right? [11:13:11] (03CR) 10Mforns: [V: 03+2] Migrate unique devices queries to SparkSql and move to /hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/829862 (https://phabricator.wikimedia.org/T305841) (owner: 10Mforns) [11:15:36] btullis: ah interesting! Yes I think so, the mountpoints are coming directly from facter indeed [11:16:41] (03CR) 10Mforns: [V: 03+2] Migrate unique devices queries to SparkSql and move to /hql (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/829862 (https://phabricator.wikimedia.org/T305841) (owner: 10Mforns) [11:24:53] elukey: Great, thanks. All being well, I'll merge and deploy this after lunch then. https://gerrit.wikimedia.org/r/c/operations/puppet/+/831841 [11:28:09] PROBLEM - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 7281 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:33:10] ^^ I think this should also fix itself. It's related to the failover again. [11:33:29] (03PS1) 10Joal: Add missing tables to drop-mediawiki-snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/831866 (https://phabricator.wikimedia.org/T317126) [11:33:58] btullis: we're gonna drop A LOT of data after that patch --^ :) [11:34:29] mforns: if you have a minute, could you review that patch please --^ ? [11:43:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2034 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2034%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [11:48:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2034 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2034%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [12:02:50] RECOVERY - Hadoop HDFS Namenode FSImage Age on an-master1002 is OK: FILE_AGE OK: /srv/hadoop/name/current/VERSION is 76 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [12:14:13] joal: That's excellent! [12:17:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [12:22:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [12:49:58] 10Data-Engineering, 10Equity-Landscape: Load country data - https://phabricator.wikimedia.org/T310712 (10ntsako) @JAnstee_WMF We can always filter out `nulls` with a `not null` clause, it serves the same purpose. I have modified the `YES/FALSE` values to be `YES/NO` values on the file I loaded so that it is... [12:50:12] 10Data-Engineering, 10Equity-Landscape: Load country data - https://phabricator.wikimedia.org/T310712 (10ntsako) a:05ntsako→03JAnstee_WMF [13:00:25] PROBLEM - Check systemd state on an-worker1144 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:05] ^^ don't worry about this. [13:02:27] RECOVERY - Check systemd state on an-worker1144 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:11] (03CR) 10Joal: [C: 03+2] "LGTM! Merging :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/829862 (https://phabricator.wikimedia.org/T305841) (owner: 10Mforns) [13:08:57] thanks joal! Looking at https://gerrit.wikimedia.org/r/c/analytics/refinery/+/831866/ [13:10:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2034 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2034%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [13:11:56] (03CR) 10Mforns: [C: 03+2] "Oh, wow! There were *a lot* of missing tables... LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/831866 (https://phabricator.wikimedia.org/T317126) (owner: 10Joal) [13:13:53] btullis: I got a +2 from mforns on the patch --^ Is it ok for you if I launch a manual run of it now, even if not deployed? [13:15:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2034 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2034%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [13:15:18] joal: Fine by me. In a meeting right now, but still around if need be. [13:15:25] ack btullis - doing that [13:19:05] (03PS2) 10Joal: Add missing tables to drop-mediawiki-snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/831866 (https://phabricator.wikimedia.org/T317126) [13:20:04] (03PS3) 10Joal: Add missing tables to drop-mediawiki-snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/831866 (https://phabricator.wikimedia.org/T317126) [13:21:01] !log Manual launch of refinery-drop-mediawiki-snapshots with new tables in patch https://gerrit.wikimedia.org/r/831866 [13:21:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:42:12] (03CR) 10Xcollazo: Modify geoeditor SQL scripts to play nice with Spark3 (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/831639 (https://phabricator.wikimedia.org/T305846) (owner: 10Xcollazo) [14:13:53] (03PS3) 10Xcollazo: Modify geoeditor SQL scripts to play nice with Spark3 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/831639 (https://phabricator.wikimedia.org/T305846) [14:23:04] a-team: I'm about to push out version 2.10.2 of the hadoop packages over 2.10.1 to A:hadoop-all - we've tested them in the test cluster and with the 6 new hadoop nodes. It will need a rolling restart to pick up the version change. It's related to this: https://phabricator.wikimedia.org/T311807 [14:24:50] I was reminded because of this message in the HDFS datanode web UI. [14:24:51] https://usercontent.irccloud-cdn.com/file/pEWMzUfa/image.png [14:31:13] PROBLEM - At least one Hadoop HDFS NameNode is active on an-master1001 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running [14:31:35] ^^ Uh oh, that's not good. Looking now. [14:32:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2028%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:33:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2032%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:33:25] RECOVERY - At least one Hadoop HDFS NameNode is active on an-master1001 is OK: Hadoop Active NameNode OKAY: an-master1001-eqiad-wmnet https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running [14:33:55] Oh, the namenode process got restarted. I thought that it didn't do that. [14:37:12] (VarnishkafkaNoMessages) resolved: (4) varnishkafka on cp2028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:38:05] Well that was a big mistake on my part, sorry about that. Everything has restarted cleanly, but I wonder what jobs in flight might have been affected. [14:38:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2032%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:39:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: analytics-reportupdater-logs-rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:20] ^^ This is one. That was an hdfs-rsync job that failed. I will investigate. [14:42:27] !log sudo systemctl restart analytics-reportupdater-logs-rsync.service on an-launcher1002 [14:42:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:43:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:16] joal, what do you think, should we have 4 different Airflow DAGs for unique_devices? Or just 2: a daily one and a monthly one, both handling per_domain as well as per_project_family? [14:49:07] cc aqu ^^ [14:50:16] not sure whats up, but as of ~25 minutes ago hadoop stuff on an-airflow1001 is failing with java.io.FileNotFoundException: /usr/lib/hadoop/hadoop-common-2.10.1.jar (No such file or directory) [14:50:29] running `beeline` reproduces it (along with spark jobs and whatnot) [14:51:14] oddly the file is there, so its a bit mysterious :) [14:51:15] ebernhardson: That's my fault. Sorry. I've been rolling out new hadoop packages to version 2.10.2 [14:51:34] btullis: ahh, if its expected then thats ok. I can retry the failing things in an hour or two [14:51:48] I think that the error is coming from the hadoop workers, rasther than from the airflow machine. [14:52:26] The path will need to be updated. I wasn't expecting this fall-out, unfortunately. Do you know where the 2.10.1.jar reference is? [14:52:54] btullis: not sure, it feels like system-level config since it happens in totally unrelated things like spark-submit and beeline [14:55:36] !log rolling out updated hadoop packages to analytics-airflow (cumin alias) hosts [14:55:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:00:20] ebernhardson: Could you retry one of the failing things now please? I've updated the hadoop packages on an-airflow1001. I'm not confident that it will fix it though, I suspect that somewhere 2.10.1 is defined in code. [15:03:11] PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:04:19] btullis: same error :( but the files on disk are now 2.10.2 [15:04:37] btullis: seems plausible then its coming from a worker node, [15:06:10] Interestingly, `hive` doesn't show the error, but `beeline` does. [15:06:40] maybe something with metastore? I suppose i'm also getting errors from airflow jobs that talk to metastore [15:06:58] but hive command should also talk to that, hmm [15:07:07] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:07:23] PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:08:09] PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:08:55] PROBLEM - Check unit status of refine_event_sanitized_main_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event_sanitized_main_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:09:11] Oh, it looks like hive isn't working. It loads, but gets a transport error. [15:09:14] https://www.irccloud.com/pastebin/vxXWG1qo/ [15:10:17] PROBLEM - Check unit status of refine_event_sanitized_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:11:12] !log restart hive-server2 and hive-metastore service on an-coord1002 to pick up new version of hadoop [15:11:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:17:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_netflow_hourly.service,eventlogging_to_druid_network_flows_internal_hourly.service,eventlogging_to_druid_prefupdate_hourly.service,refine_event_sanitized_analytics_immediate.service,refine_event_sanitize [15:17:49] immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:10] !log restarted yarn service on an-master1002 to make the active host an-master1001 again. [15:20:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:21:55] !log failed over hive to an-coord1002 via DNS https://gerrit.wikimedia.org/r/c/operations/dns/+/831906 [15:21:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:24:19] it's not complaining anymore, at least invoking beeline :) [15:26:07] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:27:08] ebernhardson: Oh that's good. A step in the right direction, anyway. [15:27:51] PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:39:25] !log Going to downgrade hadoop on ann hadoop-worker nodes to 2.10.1 [15:39:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:44:54] !log cancel that last message. Upgrading hadoop packages on an-launcher instead. They were inadvertently omitted last time. [15:44:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:46:21] !log restarting eventlogging_to_druid_editattemptstep_hourly.service on an-launcher1002 [15:46:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:49:29] That succeeded. [15:49:37] !log restarting eventlogging_to_druid_navigationtiming_hourly.service on an-launcher1002 [15:49:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:51:30] RECOVERY - Check unit status of refine_event_sanitized_main_immediate on an-launcher1002 is OK: OK: Status of the systemd unit refine_event_sanitized_main_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:51:49] !log restarting eventlogging_to_druid_network_flows_internal_hourly.service eventlogging_to_druid_prefupdate_hourly.service refine_event_sanitized_analytics_immediate.service refine_event_sanitized_main_immediate.service [15:51:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:52:02] RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:52:40] RECOVERY - Check unit status of refine_event_sanitized_analytics_immediate on an-launcher1002 is OK: OK: Status of the systemd unit refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:54:36] RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:55:43] !log rolling out upgraded hadoop client packages to stat servers. [15:55:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:56:34] RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:57:02] !log rolling out updated hadoop packages to an-airflow1003 [15:57:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:10:09] !log Rerun failed oozie webrequest jobs [16:10:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:11:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:30] joal: Would you have a moment to check with me if things are all back to normal re hadoop please? [16:13:35] yes [16:13:40] batcaev? [16:13:47] Sure thing. [16:15:58] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:18:28] RECOVERY - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:29:19] !log restarting oozie on an-coord1001 [16:29:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:30:16] !log restarting hive-server2 and hive-metastore on an-coord1001 (currently standby) [16:30:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:33:31] btullis: The nice thing is we gained 6 hosts on the cluster :) [16:34:02] btullis: oozie jobs now kicking in successfully [16:34:09] !log rerun failed webrequest oozie jobs [16:34:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:34:16] Great. Thanks joal. [16:39:11] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:45:05] !log Kill-rerun suspended oozie jobs (virtual-pagview and predictions-actor [16:45:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:50:21] btullis: how are you in terms of refine rerun? [16:51:16] I have only restarted the systemd services as per the SAL above. I haven't done any manual re-runs as a result of the emails. [16:51:52] ok - we need to do that [16:53:30] !log Rerun refine_eventlogging_analytics [16:53:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:59:35] joal, to confirm. I haven't yet run any manual refine jobs. Happy to do so later, or happy for you to do if you have time. Once again, apologies for my mistakes today. [17:00:35] btullis: I'm doing manual reruns of refine, and will send emails accordingly - No big deal, mistakes happen :) [17:10:55] Thanks so much. [17:13:23] btullis: I got confirmation [17:13:33] btullis: I got confirmation oozie jobs are now succeeding [17:14:36] !log rerun refine_netflow [17:14:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:14:42] !log rerun refine_event [17:14:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:22:34] !log rerun refine_eventloggin_legacy [17:22:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:23:59] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:24:06] (03PS4) 10Joal: Add missing tables to drop-mediawiki-snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/831866 (https://phabricator.wikimedia.org/T317126) [17:24:59] PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:25:43] PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:26:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_netflow_hourly.service,eventlogging_to_druid_network_flows_internal_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:42] 10Analytics, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [17:28:13] 10Analytics, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [17:31:03] PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:34:15] PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:48:50] btullis: I'm waiting for an oozie druid indexation job to either succeed or fail, but I think we're gonna need a druid upgrade + restart [17:51:37] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) @Milimetric Thanks for the reply and for expanding the docs. I think that Wikitech is a more appropriate place for documentation of the group th... [17:57:39] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:00:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:55] RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:05:08] hm [18:08:09] RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:09:09] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:10:09] RECOVERY - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:10:51] RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:20:36] joal: I see all the crazy, how can I help? [18:25:06] hey milimetric - we're back on track as far as I can see - waiting for a succesful druid indexation on oozie, then stopping for tonight [18:27:10] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) >>! In T317545#8233696, @BCornwall wrote: > I think that Wikitech is a more appropriate place for documentation of the group than the codebase:... [18:27:27] ok jo, if anything else goes wrong, I'm out of meetings, I can take over [18:28:24] thanks for offering milimetric :) [18:29:08] milimetric: druid UI says latest indexatrions worked so I'm not worried anymore - just triple checking :) [18:29:57] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) [18:31:35] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:33:54] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) @ottomata or @elukey I'm under the impression that one of you would be the best person to handle the Kerberos access. If that's true, would you be kind enough to prov... [19:03:40] ok - druid indexation succeeded from oozie - I think we're good for today! [19:03:49] Gone now :) [19:08:03] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BTullis) >>! In T317545#8233883, @BCornwall wrote: > @ottomata or @elukey I'm under the impression that one of you would be the best person to handle the Kerberos access. If tha... [22:06:33] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:17:34] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: Create k8s deployment of AQS 2.0 - https://phabricator.wikimedia.org/T288661 (10VirginiaPoundstone) [22:31:58] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: Obtain a security review of AQS 2.0 - https://phabricator.wikimedia.org/T288663 (10VirginiaPoundstone) [22:34:16] 10Data-Engineering, 10API Platform, 10Code-Health-Objective, 10Platform Engineering Roadmap, 10User-Eevans: Dashboards for AQS 2.0 - https://phabricator.wikimedia.org/T288667 (10VirginiaPoundstone) [22:35:27] 10Data-Engineering, 10API Platform, 10Code-Health-Objective, 10Epic, and 3 others: Problem details for HTTP APIs (rfc7807) - https://phabricator.wikimedia.org/T302536 (10VirginiaPoundstone)