[00:34:03] (03PS1) 10Nray: Update web_ui_reading_depth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739678 (https://phabricator.wikimedia.org/T294777) [00:41:30] (03PS2) 10Nray: Update web_ui_reading_depth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739678 (https://phabricator.wikimedia.org/T294777) [03:44:18] (DruidSegmentsUnavailable) firing: More than 30 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org [03:44:18] (DruidSegmentsUnavailable) firing: More than 20 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org [03:54:18] (DruidSegmentsUnavailable) resolved: More than 30 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org [03:54:18] (DruidSegmentsUnavailable) resolved: More than 20 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org [06:03:19] 10Analytics, 10Data-Engineering: Define priorities for HDFS data to be backed up - https://phabricator.wikimedia.org/T283261 (10odimitrijevic) [06:09:52] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10odimitrijevic) [06:09:54] 10Analytics, 10Data-Engineering: Add user accounts to LDAP group `analytics-privatedata-users` - https://phabricator.wikimedia.org/T295352 (10odimitrijevic) [06:09:56] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Observability-Alerting: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10odimitrijevic) [06:09:58] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10Product-Analytics: Support including edits to deleted pages in editing metrics - https://phabricator.wikimedia.org/T295212 (10odimitrijevic) [06:10:01] 10Analytics, 10Data-Engineering, 10Infrastructure-Foundations: Netflow data pipeline - https://phabricator.wikimedia.org/T257554 (10odimitrijevic) [06:15:50] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Add user accounts to LDAP group `analytics-privatedata-users` - https://phabricator.wikimedia.org/T295352 (10odimitrijevic) p:05Triage→03High [06:16:20] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Add user accounts to LDAP group `analytics-privatedata-users` - https://phabricator.wikimedia.org/T295352 (10odimitrijevic) a:03mforns [06:23:01] 10Analytics, 10Analytics-Dashiki: Testing the secondary application of tags - https://phabricator.wikimedia.org/T295954 (10odimitrijevic) [06:23:53] 10Analytics, 10Analytics-Dashiki: Testing the secondary application of tags - https://phabricator.wikimedia.org/T295954 (10odimitrijevic) 05Open→03Invalid Established that secondary herald rules do not get created. [06:25:30] 10Analytics, 10Analytics-Dashiki, 10Story: Story: dashiki filters outliers - https://phabricator.wikimedia.org/T75316 (10odimitrijevic) 05Open→03Declined [06:27:19] 10Analytics, 10Analytics-Dashiki: Icon font 404ing on metrics-staging - https://phabricator.wikimedia.org/T76747 (10odimitrijevic) 05Open→03Declined [06:29:14] 10Analytics, 10Analytics-Dashiki: make Dashiki JSON pages display nicely - https://phabricator.wikimedia.org/T87441 (10odimitrijevic) 05Open→03Declined [06:30:05] 10Analytics, 10Analytics-Dashiki: Commons page views in webstatscollector drop precipitously in 2015 - https://phabricator.wikimedia.org/T87589 (10odimitrijevic) 05Open→03Declined [06:30:35] 10Analytics, 10Analytics-Dashiki: Dashiki should support totals for reported metrics - https://phabricator.wikimedia.org/T88391 (10odimitrijevic) 05Open→03Declined [06:31:55] 10Analytics, 10Analytics-Dashiki, 10Epic: Epic: VSUser breaks down metric by target site - https://phabricator.wikimedia.org/T74135 (10odimitrijevic) 05Open→03Declined [06:33:39] 10Analytics, 10Analytics-Dashiki: Removing lines updates URL hash, but editing URL hash or using back/forward buttons has no effect - https://phabricator.wikimedia.org/T76746 (10odimitrijevic) 05Open→03Declined [06:36:44] 10Analytics-Radar: Feature request: Keeping track of time spent in phases of edits for users - https://phabricator.wikimedia.org/T268385 (10odimitrijevic) [06:37:28] 10Analytics, 10Analytics-General-or-Unknown: Increase in zero traffic for Grameenphone Bangladesh (470-01) around 2013-12-18 - https://phabricator.wikimedia.org/T60889 (10odimitrijevic) 05Open→03Declined [06:38:12] 10Analytics, 10Analytics-General-or-Unknown: Current puppet does not allow to bring up a cluster in labs - https://phabricator.wikimedia.org/T70161 (10odimitrijevic) 05Open→03Declined [06:38:37] 10Analytics, 10Analytics-General-or-Unknown: Hive queries can bring load on cluster slaves > #CPUs - https://phabricator.wikimedia.org/T65222 (10odimitrijevic) 05Open→03Declined [06:39:09] 10Analytics-Radar: Provide regular cross-wiki reports on flagged revisions status - https://phabricator.wikimedia.org/T44360 (10odimitrijevic) [06:40:10] 10Analytics, 10Analytics-General-or-Unknown: Increase in traffic for Mobilink Pakistan (410-01) around 2013-12-02 - https://phabricator.wikimedia.org/T60891 (10odimitrijevic) 05Open→03Declined [06:41:02] 10Analytics-Radar, 10Product-Analytics, 10Wikimedia-Interwiki-links, 10Wikipedia-Android-App-Backlog, 10I18n: there should be a comparison of clicks count on interlanguage links on different platforms - https://phabricator.wikimedia.org/T78351 (10odimitrijevic) [06:41:19] 10Analytics, 10Analytics-General-or-Unknown: Slight drop in zero requests around 2014-02-08 - https://phabricator.wikimedia.org/T63274 (10odimitrijevic) 05Open→03Declined [07:32:34] !log restart prometheus-druid-exporter on Druid Public to see metrics difference [07:32:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:01:53] hello folks! dbstore1007 seems again showing up high memory usage [09:43:53] elukey: I'll look at dbstore1007 today. Would be great to understand why it keeps happening, but that might take a lot of investigation. [09:44:28] btullis: definitely, it seems a resource leak over time, but no idea why it happens :( [09:45:00] btullis: for druid, this is what I was referring to https://grafana.wikimedia.org/d/000000538/druid?viewPanel=19&orgId=1&refresh=1m [10:03:52] !log restart prometheus-druid-exporter on Druid Analytics to clear unnecessary metrics [10:03:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:05:06] 10Analytics, 10Data-Engineering: Define priorities for HDFS data to be backed up - https://phabricator.wikimedia.org/T283261 (10jcrespo) We have a template spreedsheet of some SRE-maintained datasets so we can keep track of its current properties and state. Would that be a useful tool for you to classify your... [10:14:47] PROBLEM - Check unit status of check_webrequest_partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:28:35] elukey: Yes, I see what you mean now. It looks to me like they might have addressed this issue in the exporter though: https://github.com/opstree/druid-exporter/issues/100 [10:29:21] So I wonder whether I should look at updating the exporter to v0.11 instead of restarting the exporter in the cookbook: https://github.com/opstree/druid-exporter/blob/master/CHANGELOG.md [10:29:22] btullis: that is not the exporter we use :) We created one in python before the opstree came out [10:29:35] Doh. I wondered whether that might be the case. [10:29:48] Couldn't find our repo or any mention of it on wikitech. [10:31:31] I haven't had the time to fix that bug yet, but I had a similar idea to what the opstree folks did, so it may be easy to fix [10:31:41] I'll try to put some time on it during the next weeks [10:34:43] Doesn't matter. I'm happy to +1 your change to the cookbook as is. I've found the repo now and I'll link to it from Wikitech: https://gerrit.wikimedia.org/r/admin/repos/operations/software/druid_exporter [10:36:06] I should have searched phab as well. Found this useful ticket: https://phabricator.wikimedia.org/T177459 [10:36:39] we can also think about moving to the opstree exporter if it works better [10:36:50] (I think it is written in go, definitely faster) [10:44:01] Yes, I'll have a quick look to see if there are any differences metrics that we would gain or lose as a result of switching. [11:01:25] elukey: It looks to me like our version is every bit as comprehensive, so there'd be little point in changing. [11:06:38] I'd like to do a rolling-restart of kafka-jumbo today as part of T295673 - Any reason not to go ahead? Anything I need to know about? [11:07:05] I've read this: https://wikitech.wikimedia.org/wiki/Service_restarts#Kafka_brokers_(analytics) and this: https://phabricator.wikimedia.org/T136690 [11:08:27] ...so I'm no the lookout for disk full alarms and anything else alerting from kafka-jumbo nodes. [11:53:31] btullis: the only thing that I check before starting are kafka metrics, to avoid any ongoing issues. The cookbook should work nicely [11:53:54] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10BTullis) This has occurred again on dbstore1007. {F34753764,width=700} I will do some in... [11:57:30] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10Marostegui) @BTullis mariadb 10.4.22 has fixed some memory leaks, which might or might be... [11:59:40] elukey: Thanks. Yep the dashboard for kafka-jumbo looks OK to me. No under replicated partitions, all brokers appear healthy. I'm a bit unsure about why the partitions count isn't more evenly balanced here: https://grafana.wikimedia.org/d/000000027/kafka?viewPanel=48&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All [12:02:29] btullis: Razzi worked on it some months ago, we had to move some partitions to the new kafka brokers (100[789]) and we decided to move the ones doing the bulk of the traffic [12:02:55] so the partitions/brokers ratio is still unbalanced, but the rest is good (traffic/broker, etc..) [12:06:08] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10BTullis) @Marostegui - That sound like a great idea to me. I think that now would be a go... [12:06:43] elukey: Thanks for the explanation. Makes sense. OK, I'll kick off the cookbook now. [12:08:16] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10jcrespo) There was a version of mariadb that was recently packaged that mentioned somethi... [12:11:35] marostegui is discussing the possibility of upgrading MariaDB on dbstore1007 to 10.4.22. I can't see any reason not to go ahead with this at the moment, can anyone else? [12:11:35] I'll announce it in the analytics channel in slack and ask if anyone has any objections there too. But checking for SELECT operations on Grafana shows that they're hardly being used at the moment. [12:13:29] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10Marostegui) >>! In T290841#7513451, @BTullis wrote: > @Marostegui - That sound like a gre... [12:19:58] btullis: it is totally fine to proceed, clients will not notice anything, +1 [12:20:50] elukey: Thanks. That's what I suspected, but wanted to check. [12:21:50] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10Marostegui) Upgrade done, replication started. [12:33:55] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10BTullis) Great, thanks. I'll try to keep an eye on [[https://grafana.wikimedia.org/d/0000... [13:13:05] (03PS4) 10AKhatun: Save commons json dumps as a table and add fields for wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/739129 (https://phabricator.wikimedia.org/T258834) [13:18:15] (03CR) 10jerkins-bot: [V: 04-1] Save commons json dumps as a table and add fields for wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/739129 (https://phabricator.wikimedia.org/T258834) (owner: 10AKhatun) [15:09:26] (03CR) 10Ottomata: "Hiya," [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739659 (https://phabricator.wikimedia.org/T294246) (owner: 10Clare Ming) [15:14:43] RECOVERY - Check unit status of check_webrequest_partitions on an-launcher1002 is OK: OK: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:22:51] Hi! Does anybody know if I can retrieve revisions that were disallowed/blocked by an AbuseFilter? (it seems that abuse_filter_log.afl_rev_id is always NULL when abuse_filter_log.afl_actions IN ('block', 'disallow')) [15:26:20] elaragon: if AbuseFilter blocks the edit, there will by definition be no revision [15:36:02] majavah: Thanks! I found that many spambots were locked with 0 edits and I was told that this is because many trigger an AbuseFilter, so I was wondering if the text was stored and therefore retrievable somehow (I understand it is not). [15:45:20] (03CR) 10Clare Ming: Update web_ui_scroll schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739659 (https://phabricator.wikimedia.org/T294246) (owner: 10Clare Ming) [16:49:29] (03PS1) 10DLynch: Update talk_page_edit schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739865 (https://phabricator.wikimedia.org/T286076) [16:50:37] (03CR) 10DLynch: "I'm assured that for a minor fix like this I shouldn't need to update the schema version." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739865 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [17:11:05] (03PS2) 10Bartosz Dziewoński: Update talk_page_edit schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739865 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [17:11:46] (03CR) 10Bartosz Dziewoński: [C: 03+1] Update talk_page_edit schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739865 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [17:23:39] (03CR) 10Clare Ming: [C: 03+2] Update web_ui_reading_depth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739678 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [17:25:14] (03Merged) 10jenkins-bot: Update web_ui_reading_depth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739678 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [17:26:58] !log varnishkafka-webrequest on cp3050 is running with /etc/ssl/localcerts/wmf_trusted_root_CAs.pem [17:27:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:42:04] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Desktop Improvements, and 3 others: Add agent_type and access_method to sticky header instrumentation - https://phabricator.wikimedia.org/T294246 (10bwang) a:03bwang [18:15:34] (03CR) 10Ottomata: Update web_ui_scroll schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739659 (https://phabricator.wikimedia.org/T294246) (owner: 10Clare Ming) [18:20:55] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Improve Refine bad data handling - https://phabricator.wikimedia.org/T289003 (10odimitrijevic) a:05Ottomata→03None [18:23:00] elaragon: I think AF stores internally somehow, not sure about the details [18:26:20] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Add user accounts to LDAP group `analytics-privatedata-users` - https://phabricator.wikimedia.org/T295352 (10Ottomata) Hi, I believe this will require the usual [[ https://wikitech.wikimedia.org/wiki/SRE/Production_access#Access_Request_Process | ac... [18:49:07] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS bullseye [18:58:26] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) [19:06:08] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS bullseye executed wit... [19:17:23] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) updated site.pp, ran the script again and it made it to the debian installer but failed on raid cfg. [19:21:37] (03PS1) 10Milimetric: Create discussiontools_subscription table in Hive [analytics/refinery] - 10https://gerrit.wikimedia.org/r/739922 (https://phabricator.wikimedia.org/T290516) [19:31:03] 10Analytics: Superset annotation text overlaps illegibly - https://phabricator.wikimedia.org/T279738 (10nettrom_WMF) @razzi : This came up in a discussion in our team today. Looks like the upstream bug report is now closed. Should we reopen that, or is it something on our side that breaks this and we need to loo... [19:33:53] 10Analytics: Superset annotation text overlaps illegibly - https://phabricator.wikimedia.org/T279738 (10razzi) Good callout @nettrom_WMF, indeed that should be reopened. I'll add some steps to reproduce as well [19:36:55] razzi: do you know the syntax for systemd timer intervals? Like once a month is apparently *-*-01 00:00:00 somehow... [19:37:09] I'm not familiar milimetric [19:37:27] that doesn't look like cron: https://github.com/wikimedia/puppet/blob/b45b56d1f9042f2c6648e283567689ccad3f7bbc/modules/profile/manifests/analytics/refinery/job/sqoop_mediawiki.pp#L86 [19:37:33] ottomata: any idea? [19:38:02] when I look up "puppet systemd interval syntax" google just laughs at me [19:38:27] I can understand what it's saying though: *-*-01 means every year, every month, on the first day ie monthly [19:39:01] 2011-01-01 00:00:00 matches, 2011-01-02 00:00:00 does not [19:39:12] milimetric: maybe help [19:39:12] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers#Calculating_OnCalendar_interval [19:39:13] ? [19:39:38] perfect [19:39:44] woah, Luca's so cool :) [19:39:44] ooh that's a nice cmd [19:40:21] milimetric: https://opensource.com/article/20/7/systemd-timers looks good too [19:40:26] scrolll to Calendar event specifications [19:48:52] thanks very much, patches submitted. Been a while since I wrote any puppet :) [19:53:00] (03PS5) 10Joal: Save commons json dumps as a table and add fields for wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/739129 (https://phabricator.wikimedia.org/T258834) (owner: 10AKhatun) [19:59:11] mforns: you still there? i think maven code is worked out, and hdfs stuff is tested, can i demo and we can discuss? [21:17:11] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Desktop Improvements, and 3 others: Add agent_type and access_method to sticky header instrumentation - https://phabricator.wikimedia.org/T294246 (10bwang) a:05bwang→03None [21:38:34] (03PS1) 10Mayakpwiki: movement_metrics: Add error test notebook [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/739938 (https://phabricator.wikimedia.org/T295733) [21:40:11] ottomata: I was having dinner, if you still there I can meet now! [21:47:52] mforns: okay i think annie will be back very soon lets try! [21:48:06] ok omw [22:08:10] (03CR) 10Bearloga: [V: 03+2 C: 03+2] "Verified and it errors out like it's supposed to" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/739938 (https://phabricator.wikimedia.org/T295733) (owner: 10Mayakpwiki) [22:51:57] (03CR) 10Bearloga: [C: 03+1] "Looks good to me! And yes to updating without bumping version, since this is how version 1.0.0 should have looked like originally and doin" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739865 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch)