[00:24:40] RECOVERY - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:07:24] PROBLEM - Check unit status of refinery-import-siteinfo-dumps on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-import-siteinfo-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:53:58] legoktm: re: confusing names, we agree :) [03:14:36] PROBLEM - Check unit status of refinery-import-page-history-dumps on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:47:06] PROBLEM - Check unit status of refinery-import-page-current-dumps on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-import-page-current-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:00:27] btullis: o/ - not sure if I am missing something for the hive2 metastore/server restarts, but with the analytics-hive.eqiad.wmnet CNAME they can be done anytime without causing issues (for example, CNAME -> an-coord1002, wait for TTL to expire, restart on 1001, check, failback, etc..) [06:23:02] (03CR) 10Gergő Tisza: [C: 03+2] Add a link: Update schema to support edit mode and link inspector toggles; add client_ip [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/704402 (https://phabricator.wikimedia.org/T278115) (owner: 10MewOphaswongse) [06:23:59] (03Merged) 10jenkins-bot: Add a link: Update schema to support edit mode and link inspector toggles; add client_ip [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/704402 (https://phabricator.wikimedia.org/T278115) (owner: 10MewOphaswongse) [07:51:01] (03PS1) 10Awight: Review access change [analytics/reportupdater-queries] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709646 [07:51:58] ^ If anyone has a moment, we're missing one minor bit to be able to merge our team's work in this repo. [08:18:41] Thanks elukey. I will read more about that. I guess I thought that the metastore was more stateful than that. [08:20:00] btullis: I added at the time some info to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Coordinator#Hive [08:20:41] the metastores on both coordinators point to the same db (currently the one on an-coord1001, having a more dynamic config is one of the next steps for the failover config) [08:21:12] so simply switching analytics-hive.eqiad.wmnet (plus time for TTL to expire etc..) works well, clients failover transparently [08:21:29] so we can operate on hive daemons freely [08:21:46] the problem now is if we have to say reboot an-coord1001, that requires a more invasive approach [08:23:40] (I can clarify more the docs if they are not clear) [08:25:05] Ah great. Ok, I'll send out an email later cancelling the maintenance window then. [08:39:48] btullis: I think it is fine if you keep it, you can just reply that everything is finished sooner :) [08:41:39] we added the hive CNAME alias at the time since we had an-coord1001 hardcorded everywhere, especially in refinery, and a change to it would have required a roll restart of all oozie jobs, etc.. [08:42:07] (try to git grep analytics-hive on refinery to see the amount of repetition :D) [08:42:32] ideally the actual config could work also in active/active [08:42:50] but we don't have support for LVS within the analytics vlan [08:47:21] (I believe this is due to the fact that the lvs hosts don't have any interface on the analytics vlans like it happens with the prod ones, so L2 forwarding for direct return doesn't work) [08:48:03] (could we add support for it? Maybe, I recall that SRE when we asked was not very fond of the idea) [08:51:06] (asked in #traffic) [09:00:04] Valentin is going to ask to the Traffic's team meeting and then we'll know [09:00:20] if there is a way forward it could be really interesting for a couple of use cases [09:00:38] 1) druid analytics, since we currently hardcode druid hostnames in turnilo/superset [09:01:02] (only one for each tool, so the queries are not distibuted among all nodes) [09:01:07] 2) analytics-hive [09:01:19] (active/active, with confctl to pool/depool) [10:09:50] Right, thanks for all this. I will look into it. [11:08:41] hellooo [12:22:32] 10Analytics, 10SRE, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10klausman) I did an analysis of the ATS and Varnish Kafka topics as reported for `cp3050.esams.wmnet` (the only host that currently feeds... [13:01:25] (03PS1) 10GoranSMilovanovic: T282563 [analytics/wmde/WD/WikidataAdHocAnalytics] - 10https://gerrit.wikimedia.org/r/709690 [13:01:42] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T282563 [analytics/wmde/WD/WikidataAdHocAnalytics] - 10https://gerrit.wikimedia.org/r/709690 (owner: 10GoranSMilovanovic) [13:14:11] 10Analytics, 10SRE, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10Ottomata) 10-20 seconds / 0.02% missing seems acceptable to me. Perhaps this is enough verification to proceed? [13:16:06] 10Analytics, 10Analytics-Kanban, 10EventStreams, 10User-Addshore: stream-beta.wmflabs.org seems broken (can't see my mediawiki-create events) or anything else - https://phabricator.wikimedia.org/T287760 (10Addshore) 05Open→03Resolved a:03Addshore Looks good to me And now I also see the `mediawiki.pa... [13:27:59] milimetric: o/ morning, if you are around i could use some help wthi some of these alarms [13:29:51] ottomata: I'll be ready to go in 30 min [13:31:13] k [13:45:24] 10Analytics, 10SRE, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10BTullis) I'm trying to get my head around what the implications of these two statements are: > usually, there are 0.02% of events that ar... [13:51:35] 10Analytics, 10SRE, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10Ottomata) I think thats right! [14:01:12] k ottomata, how can I help [14:10:56] milimetrici'm looking into the failed siteiinfo dumps thing [14:10:59] i think i know why its broken [14:11:01] but [14:11:10] there are quite a few false false_positives [14:11:18] from the webrequest upload data loss alert [14:11:23] Data Loss Warning - Workflow webrequest-load-check_sequence_statistics-wf-upload-2021-8-3-1 [14:11:46] ah shoot lost the output, gotta rerun [14:12:01] not sure what to do about that one [14:12:41] ottomata: I can rerun it and take a closer look [14:15:36] milimetric: https://gist.github.com/ottomata/e59118b3f242dcc03882df5b5cd1dc12 [14:15:48] it seems to be from all hosts [14:16:10] OH WAIT [14:16:12] min seq is 0 [14:16:30] oh, no [14:16:35] sorry can't read column aligment [14:16:49] nm [14:17:22] :) loading in spreadsheet and looking [14:19:32] 10Analytics, 10SRE: Import the openjdk8 packages in Bullseye - https://phabricator.wikimedia.org/T287960 (10elukey) [14:20:42] ottomata: it's 1.3% loss, so not huge, but yeah, weird since it never happens. Was there some event that affected all the hosts? [14:24:04] i dunno not that i know of [14:24:07] lemme ask [14:24:59] ottomata: wait, it's saying all false positives for me: [14:25:02] https://www.irccloud.com/pastebin/JZlr3xer/ [14:25:33] same min/max sequence numbers for each host, but the actual count with outliers is exactly the same in this case, unlike in your run [14:25:44] this is how I ran it: sudo -u analytics kerberos-run-command analytics spark2-sql --master yarn -S --jars /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar -f /srv/deployment/analytics/refinery/hive/webrequest/check_dataloss_false_positives.sparksql -d table_name=wmf_raw.webrequest -d webrequest_source=upload -d year=2021 [14:25:44] -d month=8 -d day=3 -d hour=1 [14:26:18] so between when you ran it and when I ran it, new data landed with those sequence numbers? Is that gobblin doing something weird? [14:26:25] lemem run again [14:27:24] milimetric: apparently that's ascii art according to wmopbot [14:27:51] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) I have created the kerberos principal and the keytab entry for an-coord1001, as per the same operation on the test clust... [14:29:51] :) art is in the eye of the beholder [14:32:02] milimetric: i still get all alarm_is_false_positive fase [14:32:03] false [14:32:16] oh [14:32:18] i'm checking webrequest [14:32:20] not webrequest raw [14:32:22] is that wrong? [14:32:34] oh! uh... lemme think [14:32:43] ops week task says to use wmf.webrequest if the alert was a warning [14:32:45] which it was [14:34:57] I don't see how, it should've been an error since it's 1.3%, over the threshold, and the check kind of confirms that it wasn't refined [14:35:13] but wait... it was partially refined [14:35:17] I don't understand [14:35:32] maybe the % loss is calculated differently than I did it [14:36:07] partially refined? [14:36:18] so wmf_raw has all the rows, with the right sequence numbers [14:36:35] if it was a warning, it should've refined everything and wmf.webrequest should have all the rows too [14:36:50] oh weird [14:36:51] i see [14:36:58] no loss at all [14:37:00] in raw [14:37:10] but maybe it was more dynamic, like more higher sequence numbers came in, making the data loss > 1.3% later, after refine happened [14:37:42] hm [14:37:46] ok i'll rerun that hour? [14:38:23] yeah, I think so, I'm just not sure what to use for threshold, maybe do 2%? [14:38:34] well, shouildn't i just rerun as is? [14:38:39] if the data is all in webrequest raw [14:38:46] there shouldn't be any loss after the refine [14:38:48] right? [14:38:48] the min/max sequence numbers suggest loss > 1% [14:39:01] and the false alarm check verifies they're all false positives [14:39:13] but the refine job might just fail since it doesn't do a false positive check [14:39:26] but [14:39:46] wait [14:39:56] expected count is the same for all? [14:40:17] newly_computed_rows_loss i s 0 [14:40:29] milimetric: why does your query on raw have any rows at all? [14:40:40] the first host for example: max 16886836343 and min 16893081259 [14:40:48] so there's a difference between those two [14:40:58] and that's the expected [14:41:13] oh yea [14:41:29] if the actual is equal to that... maybe the sequence numbers don't match up or something? [14:42:41] (I'm looking closer at the query) [14:44:24] ok, so that seems to be the case. The sequence numbers are different than expected, but the count is the same as expected [14:44:36] we have COUNT(DISTINCT(sequence)) on one side, and (MAX(max_seq) - MIN(min_seq) + 1) on the other [14:49:02] ah [14:49:15] oh weird [14:49:15] so [14:49:27] wait how is that possible [14:50:31] I'm looking at https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql [14:50:32] milimetric: that is per host, right? [14:50:35] and I'm looking at hive -e "select * from wmf_raw.webrequest_sequence_stats where year=2021 and month=8 and day=3 and hour=1" [14:50:44] which is showing what the computed loss was at the time of refine [14:50:52] right, per host [14:51:15] why do we do MAX(max_seq) ? [14:51:26] this is what we recorded at original refine time: [14:51:39] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) Verified that the updated keytab is present on an-coord1001 ` btullis@an-coord1001:~$ sudo puppet agent -tv Info:... [14:51:40] https://www.irccloud.com/pastebin/2OS7NmEE/ [14:54:11] right milimetric and that matches the output from the check false positives [14:54:13] ok, ottomata, the only thing that makes sense to me right now is that the actual count was different at refine time than it is now. The actual count recorded in the webrequest_sequence_stats table should be the same as the COUNT(DISTINCT(sequence)) that the false positive check runs [14:54:36] and it is, for wmf.webrequest, but not for wmf_raw.webrequest [14:54:40] meaning more data landed since refine ran [14:55:26] OH I SEE [14:55:35] your actual_count_with_outliers is 0 on webrequest raw [14:55:38] ok, so i'll rerun? [14:55:39] I don't understand why the false positive check on wmf_raw still returns records, I checked and it doesn't know about the original data at all, it's just looking at current data. So that's a gap in my knowledge [14:55:48] hm [14:55:59] yeah, try rerunning [14:56:37] hm [14:57:10] !log rerunning webrequest refine for upload 08-03T01:00 - 0042643-210701181527401-oozie-oozi-W [14:57:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:59:47] (03PS1) 10Ottomata: import-mediawiki-dump - move json.load into try block [analytics/refinery] - 10https://gerrit.wikimedia.org/r/709734 [15:02:29] RECOVERY - Check unit status of refinery-import-page-history-dumps on an-launcher1002 is OK: OK: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:02:34] (if that fails, I'd say just up the threshold to 2%) [15:02:53] if the underlying data has changed and is now complete, i'd expect it to succeed [15:03:43] (03CR) 10Milimetric: [V: 03+2 C: 03+2] import-mediawiki-dump - move json.load into try block [analytics/refinery] - 10https://gerrit.wikimedia.org/r/709734 (owner: 10Ottomata) [15:10:05] (03CR) 10Milimetric: Refine - replace default formatters with gobblin convention (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708787 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:11:05] the underlying data has changed, but I don't 100% understand how it computes loss and why it still pulls up those false positives in the data loss false positive checker [15:11:30] right [15:16:03] (03PS4) 10Ottomata: Refine - replace default formatters with gobblin convention [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708787 (https://phabricator.wikimedia.org/T271232) [15:16:06] (03CR) 10Ottomata: Refine - replace default formatters with gobblin convention (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708787 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:16:40] 10Analytics, 10SRE, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10elukey) Something that I noticed, that may be totally off: ` scala> spark.sql("SELECT count(*) FROM wmf.webrequest where webrequest_sourc... [15:19:19] oh milimetric one more camus review for ya [15:19:20] https://gerrit.wikimedia.org/r/c/analytics/refinery/+/708816 [15:22:37] ottomata: are you planning on cleaning up the other refinery camus stuff in a different patch, or should I list what I find in here? [15:22:52] (like tests like these: https://github.com/wikimedia/analytics-refinery/blob/1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3/python/tests/test_refinery/test_hive.py#L44) [15:24:24] (03PS3) 10Ottomata: Remove refinery-camus module [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708786 (https://phabricator.wikimedia.org/T271232) [15:24:26] (03CR) 10Milimetric: "Changes look good, also found this test:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/708816 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:24:44] milimetric: i think those ones can stay for now [15:24:50] those are about working with camus formatted directories [15:24:53] ok, cool, then +2ing [15:24:56] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) Oh I see it's more complicated than I was thinking. I've added a patch to add the presto keytabs to an-coord1002 but th... [15:24:57] which might be ok to keep around for now [15:25:02] oh ok makes sense [15:25:08] if we ever have yyyy/mm/dd/hh directories anywhere [15:25:09] that will work [15:25:22] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "leaving those for later" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/708816 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:26:25] (03CR) 10Ottomata: "> * Can probably remove /wmf/camus here: https://github.com/wikimedia/analytics-refinery-source/blob/354b8a45ac0081d6d17397376c46f623e522f" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708786 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:28:05] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10Ottomata) Oh, I didn't quite realize that either! [15:31:22] milimetric: updated the other two refinery source camus reviews [15:31:29] if we merge those i can start deploy [15:41:15] (03CR) 10Milimetric: [C: 03+2] "My bad, I got confused with the pagination. Looks good!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708786 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:42:15] (03CR) 10Milimetric: [C: 03+2] Refine - replace default formatters with gobblin convention [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708787 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:42:24] (03CR) 10jerkins-bot: [V: 04-1] Refine - replace default formatters with gobblin convention [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708787 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:42:40] ottomata: +2-ed that chain but there's a merge conflict I'll let you sort out [15:44:54] 10Analytics-Kanban, 10 Data-Engineering, 10Patch-For-Review: Add a spark job loading Cassandra 3 - https://phabricator.wikimedia.org/T280649 (10Ottomata) [15:44:56] 10Analytics: Finalize Gobblin Migration - https://phabricator.wikimedia.org/T287889 (10Ottomata) [15:44:58] 10Analytics, 10Analytics-Kanban, 10 Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10Ottomata) [15:45:09] ok [15:45:09] ty [15:46:34] (03PS5) 10Ottomata: Refine - replace default formatters with gobblin convention [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708787 (https://phabricator.wikimedia.org/T271232) [15:47:31] PROBLEM - Check unit status of refinery-import-page-history-dumps on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:51:16] 10Analytics: Use corosync and pacemaker for presto coordinator active/standby configuration - https://phabricator.wikimedia.org/T287967 (10BTullis) [16:00:21] 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) We met today and this is the plan forward: 1) use `topicmappr` to create a list o... [16:01:55] 10Analytics: Use corosync and pacemaker for presto coordinator active/standby configuration - https://phabricator.wikimedia.org/T287967 (10Ottomata) Before going too far with some new techs, we might have some ready to use at WMF. See https://wikitech.wikimedia.org/wiki/Conftool https://wikitech.wikimedia.org/... [16:07:58] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Enable canary events for streams by default - https://phabricator.wikimedia.org/T287789 (10Ottomata) a:03Ottomata [16:08:50] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Write a job entirely in Airflow with spark and/or sparkSQL - https://phabricator.wikimedia.org/T285692 (10Ottomata) [16:08:52] 10Analytics, 10Analytics-Kanban: [SPIKE] analytics-airflow jobs development - https://phabricator.wikimedia.org/T284172 (10Ottomata) [16:22:14] 10Analytics, 10 Data-Engineering: Push Gobblin import metrics to Prometheus and add alerts on some critical imports - https://phabricator.wikimedia.org/T286503 (10BTullis) Gobblin mentions emitting metrics via Kafka here: https://gobblin.apache.org/docs/metrics/Metrics-for-Gobblin-ETL/ Is there native support... [16:31:24] 10Analytics, 10 Data-Engineering: Push Gobblin import metrics to Prometheus and add alerts on some critical imports - https://phabricator.wikimedia.org/T286503 (10Ottomata) No, Joseph was going to have to add it. [17:29:21] anyone around to take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/709530/? [17:32:19] also I had a question about the process for `rsync` ing from `thorium` to the new `an-web1001`. WIll I just be manually running `/usr/local/bin/published-sync` on a host that has `statistics::rsync::published`? (ie one of `an-launcher1002.eqiad.wmnet,stat[1004-1008].eqiad.wmnet`) [17:52:02] ryan looking [17:52:48] 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research, 10Patch-For-Review: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10gmodena) Hey @Ottomata, Many thanks for this! Just wanted to give an ack that login on the host worked.... [17:53:06] ryankemper: we should probably first rsync everything in /srv on throrium over [17:54:05] so, migration would be like [17:54:17] setup an-web1001 without affecting anything in prod (including rsync jobs and web traffic) [17:54:26] rsync everythign thorium -> an-web1001 [17:54:43] test things like stats.wm.org there (by tunneling to an-web webserver?) [17:54:56] we can help with the testing part once that is ready [17:55:08] then, once things look good, rsync thorium -> an-web1001 [17:55:14] cut over prod stuff (jobs and traffic) [17:55:26] and then just in case do one more thorium -> an-web1001 rsync [17:55:30] or, something like that [18:02:58] 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research, 10Patch-For-Review: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10Ottomata) OO yup. [18:07:27] (03PS1) 10Ottomata: Update changelog [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/709784 [18:07:53] RECOVERY - Check unit status of refinery-import-page-history-dumps on an-launcher1002 is OK: OK: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:07:55] (03CR) 10Ottomata: [C: 03+2] Refine - replace default formatters with gobblin convention [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708787 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [18:08:07] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Update changelog [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/709784 (owner: 10Ottomata) [18:08:25] Starting build #93 for job analytics-refinery-maven-release-docker [18:19:04] ottomata: wrt [18:19:05] > setup an-web1001 without affecting anything in prod (including rsync jobs and web traffic) [18:19:11] ya [18:19:12] would setting `an-web1001` to `role(analytics_cluster::webserver)` affect prod traffic? like should I save the `site.pp` change until after the rsync [18:19:23] i think it won't. [18:19:35] the traffic is directed by the lvs routing stuff [18:19:44] lemme verify... [18:20:12] okay that was my thinking as well, will take a closer look at the puppet code to vierfy [18:20:14] verify* [18:20:46] yeah looks ok [18:20:58] and oh just saw the comment on the patch, okay so [assuming we don't find out that it does affect prod traffic] we'll merge just the site.pp change, then rsync, then test, then change the `published.pp` at time of cutover? [18:21:08] all that should do is set up the webserver and other various things, it shouldn't affect anything other than stuff on an-web1001 [18:21:13] I was thinking we needed the `published.pp` change in place in order to do the rsync [18:21:20] But is that actually just for automating the rsync or something [18:21:29] looking again.. [18:22:08] Project analytics-refinery-maven-release-docker build #93: 09SUCCESS in 13 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/93/ [18:22:09] ok yeah [18:22:16] statistics::rsync::published is included on client nodes (stat boxes) [18:22:43] statistics::published [18:22:47] is inclulded on throium [18:22:51] and will be on an-web1001 [18:23:02] statistics::published sets up the rsync server (and some other stuff there) [18:23:17] statistics::rsync::published sets up the cron jobs that push data [18:29:28] Starting build #51 for job analytics-refinery-update-jars-docker [18:30:05] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.16 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/709808 [18:30:07] Project analytics-refinery-update-jars-docker build #51: 09SUCCESS in 38 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/51/ [19:02:27] 10Analytics, 10Analytics-Kanban, 10 Data-Engineering, 10Data-Engineering-Kanban: Add ability to compare wikis - https://phabricator.wikimedia.org/T283251 (10odimitrijevic) [19:02:45] !log Deployed refinery using scap, then deployed onto hdfs [19:02:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:02:54] 10Analytics, 10Analytics-Kanban, 10 Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Add ability to compare wikis - https://phabricator.wikimedia.org/T283251 (10odimitrijevic) [19:02:57] PROBLEM - Check unit status of refinery-import-page-history-dumps on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:23:18] !log bump Refine to refinery version 0.1.16 to pick up normalized_host transform - now all event tables will have a new normalized_host field - T251320 [19:23:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:23:23] T251320: Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 [19:23:54] ottomata: back from 1:1 now. so given `statistics::rsync::published` is for the cronjob, I assume we'll want to leave that for the actual cutover, meaning change just the `site.pp` in the initial patch? [19:24:09] if so then this is ready for a rubberstamp: https://gerrit.wikimedia.org/r/c/operations/puppet/+/709530/ [19:24:12] yes i thikn that's right [19:24:25] +1 [19:24:26] :) [19:24:31] ty :P [19:32:41] PROBLEM - Check unit status of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:34:39] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10RKemper) Seeing this error upon running puppet on `an-web1001`: ` Notice: /Stage[main]/Statistics::Sites::Stats/File[/etc/apache2/htpasswd.stats]/ensure: de... [19:34:47] PROBLEM - Check unit status of refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:35:31] RECOVERY - Check unit status of refinery-import-page-history-dumps on an-launcher1002 is OK: OK: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:35:36] Seeing the following error when running puppet on `an-web1001`: https://phabricator.wikimedia.org/T285355#7256778 it seems like the puppet code might depend implicitly on files on /srv/ being there that need to be rsync'd, so I'm going to try rsyncing now [19:38:20] First I need to read the puppet code and get a better understanding of `statistics::published` vs `statistics::rsync::published` :D [19:39:35] 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10Ottomata) [19:39:48] Some of the comments in `statistics::published` imply that `statistics::rsync::published` is doing the actual work, but that's the file we intentionally didn't change [19:40:09] But I think maybe I might want to manually trigger the `/usr/local/bin/hardsync` which does belong to `statistics::published` [19:42:14] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add refinery-source jars for v0.1.16 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/709808 (owner: 10Maven-release-user) [19:45:29] hm [19:45:59] oh interestingi. ryankemper it may have beena LONG time since this puppet stuff was applied brand new [19:46:02] perhaps we can fix some of that? [19:46:31] that would be ideal :) [19:47:01] like, maybe add a file { $wikistats_web_directory: ensure => directory ... } in there? [19:47:25] I think the puppet run should be able to succeed wiithout the data files. [19:48:19] makes sense to me [19:49:31] I think maybe we want to add that to `statistics::sites::stats`? [19:49:49] ya exactly [19:53:01] ` /srv/stats.wikimedia.org/` on `thorium` is owned by `ezachte`. should I instead have this owned by root or keep it as that user (for an-web1001)? [19:54:34] 10Analytics, 10 Data-Engineering, 10Data-Engineering-Kanban: Gobblin Monitoring - https://phabricator.wikimedia.org/T287991 (10odimitrijevic) [19:55:35] PROBLEM - Check unit status of refine_netflow on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:59:04] 10Analytics, 10 Data-Engineering: Push Gobblin import metrics to Prometheus and add alerts on some critical imports - https://phabricator.wikimedia.org/T286503 (10odimitrijevic) [19:59:09] 10Analytics, 10Analytics-Kanban, 10 Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10odimitrijevic) [19:59:23] 10Analytics, 10Analytics-Kanban, 10 Data-Engineering: When gobblin fails, we should know about it - https://phabricator.wikimedia.org/T286559 (10odimitrijevic) [19:59:25] took a swing at it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/709817 [19:59:25] 10Analytics, 10 Data-Engineering, 10Data-Engineering-Kanban: Gobblin Monitoring - https://phabricator.wikimedia.org/T287991 (10odimitrijevic) [20:00:07] 10Analytics, 10Analytics-Kanban, 10 Data-Engineering: When gobblin fails, we should know about it - https://phabricator.wikimedia.org/T286559 (10odimitrijevic) [20:00:11] 10Analytics, 10Analytics-Kanban, 10 Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10odimitrijevic) [20:01:07] 10Analytics-Kanban, 10Patch-For-Review: Add a spark job loading Cassandra 3 - https://phabricator.wikimedia.org/T280649 (10odimitrijevic) [20:01:29] 10Analytics, 10 Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Gobblin Monitoring - https://phabricator.wikimedia.org/T287991 (10odimitrijevic) [20:10:44] ! ryankemper yeah lets change that [20:10:49] lemme see [20:11:03] (sorry ping me if i'm unresponsive, will look at IRC more then :) ) [20:11:21] 11 minutes is way too slow! [20:11:23] * ryankemper grabs pitchfork [20:11:50] :) [20:11:52] (jk obv, will ping if I ever get super blocked) [20:11:56] I've got the owner as root but still got it as 755 for the chmod, not sure if that's what we'd want [20:12:00] ryankemper: based on other files there, i think root::www-data looks right [20:12:27] ryankemper: a LONG time ago wikistats (v1) was 100% manitained by ezachte, and he manually rsynced html files [20:12:28] like [20:12:43] https://stats.wikimedia.org/index-v1.html [20:12:52] i do'nt think those are updated anymore (right fdans ?) [20:13:00] so, we can use root:www-data 755 [20:13:31] static site, no facebook tracking like button, monocolor background...it really is from back when the internet was perfect :P [20:13:35] reading... [20:14:12] right, nothing is updated in wikistats 1, it's just archival stuff [20:14:57] okay so maybe `root::www-data` and `0755`? [20:15:06] (that's what wikistats-v2 has anyway) [20:15:32] or maybe the fact that it is updated / is not archived means we want more restrictive than `0755`? [20:15:39] is archived / is not updated* [20:16:53] RECOVERY - Check unit status of refine_netflow on an-launcher1002 is OK: OK: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:17:11] RECOVERY - Check unit status of refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:21:46] ottomata: okay how's https://gerrit.wikimedia.org/r/c/operations/puppet/+/709817 looking [20:24:55] RECOVERY - Check unit status of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:27:50] ryankemper: i think 755 is good [20:28:01] unless...does www-data need to write there? [20:28:02] hm. [20:28:35] let's leave it 775 [20:28:35] ok [20:28:37] looks good [20:28:46] +1 [20:28:51] cool, merging [20:31:33] 10Analytics-Clusters, 10Analytics-Kanban: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10RKemper) After applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/709817: ` ryankemper@an-web1001:~$ sudo run-puppet-agent Info: Using configured environment 'producti... [20:33:19] looks like we need the htdocs dir in there too? [20:33:39] $wikistats_source_directory/htdocs ryankemper ? [20:33:53] ottomata: yup that's my thinking as well, just taking a glance at what the code's doing [20:37:12] ottomata: https://gerrit.wikimedia.org/r/c/operations/puppet/+/709822 [20:37:24] I believe puppet doesn't recursively create directories thus why we need two diff blocks [20:37:37] er two different resources to use the correct terminology [20:53:25] yup, you can put those on the same line if you want ryankemper [20:53:39] file { [$wikistats_web_directory, "${wikistats_web_directory}/htdocs"]: ... [20:53:52] since they have the same params [20:54:02] man you pinged me and i still lagged in response! [21:37:51] :P [21:38:01] (just got back from lunch) [21:45:28] ottomata: it's probably outside your workhours so can totally wait till tmrw, but https://gerrit.wikimedia.org/r/c/operations/puppet/+/709822 is ready whenever