[05:59:55] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) Replication positions: {P16309} [06:01:32] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) ` root@dbstore1007:/srv# sudo lvextend -L+1100G /dev/mapper/tank-data && sudo xfs_growfs /srv Size of logical volume tank/data changed from <7.5... [06:01:47] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) Transfer between dbstore1004 and dbstore1007 started [06:06:26] hello folks, Manuel stopped dbstore1004 to copy data to dbstore1007 --^ [06:06:29] as FYI [06:09:16] 10Analytics: Requesting Kerberos password - https://phabricator.wikimedia.org/T284022 (10elukey) >>! In T284022#7136400, @Aklapper wrote: > Reading https://wikitech.wikimedia.org/wiki/Analytics/Data_access it's unclear to me if this is handled by #sre-access-requests or #Analytics - would be great if someone cou... [06:09:49] 10Analytics: Requesting Kerberos password - https://phabricator.wikimedia.org/T284022 (10elukey) 05Stalled→03Open p:05Triage→03Medium a:03razzi [06:20:18] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi dbstore1007 still needs to get the proper FW rules (https://gerrit.wikimedia.org/r/c/operations/homer/public/+/697704), as it cannot reach... [06:34:20] Good morning [07:14:22] bonjour [07:14:54] I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194 for the saveNamespace issue [07:15:13] I've seen it elukey :) [07:15:16] thanks a lot! [08:08:56] 10Analytics, 10Analytics-Kanban, 10Event-Platform: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 (10kai.nissen) Sorry, long wait for a short answer: No, the Fundraising team does not need the two fields, either. [09:57:30] 10Analytics: Requesting Kerberos password - https://phabricator.wikimedia.org/T284022 (10Aklapper) Thanks @elukey! [11:55:58] elukey: any idea on the status of T283084 ? I see a bunch of patches, but I'm not sure if something else still needs to be done :/ [11:55:59] T283084: Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 [11:56:29] elukey: and I realize that you might not be the right person to ask that question anymore :/ maybe joal ? [11:57:32] Hi gehel - putting my thoughts on this topic back in place, your answer should be ready in ~3minutes :) [11:57:56] joal: thanks! If you can put a comment on the ticket, that would be appreciated! [11:59:45] gehel: I'll let Andrew do that, he owns the work and will be more precise than I am - IIRC the behavior of the canary-event generator has been updated to not-fail globally afdter the first-failure (hum), and in this process a bug introduced itself leading to more patches than expected [12:00:13] damn bugs! [12:00:29] ok, I'll try to remember to ping Andrew for updates later today [12:00:30] gehel: if not yet fixed, it should be soon I think (the general case - the special one of that day needs to be fixed manually) [12:00:50] gehel: I'm asking him on the task (easiest) [12:01:39] done! [12:02:10] 10Analytics-Clusters, 10Discovery-Search (Current work), 10Patch-For-Review: Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10JAllemandou) Heya @Ottomata - Could you please provide a status summary on this (asked by @Gehel on IRC) - th... [12:27:54] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) This host needs ipv6 dns to be deleted from netbox [12:33:00] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) >>! In T283125#7138250, @Marostegui wrote: > This host needs ipv6 dns to be deleted from netbox Done [12:57:18] 10Analytics, 10Analytics-Kanban, 10Event-Platform: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 (10mforns) Great! Thanks a lot @kai.nissen. [13:05:16] elukey: o/ i just merged the west1 uid change....and i did not have to run the chowns???? [13:05:54] Jun 7 13:00:56 stat1007 puppet-agent[22486]: (/Stage[main]/Admin/Admin::Hashuser[west1]/Admin::User[west1]/User[west1]/uid) uid changed 1097 to 10972 [13:08:27] ottomata: good morning! so admin does a chown -R behind the scenes? [13:08:47] otherwise I don't explain [13:11:13] i know, ican't explain either... [13:11:19] didn't see that in puppet output [13:12:12] the admin::user class creates a user resource with managehome [13:12:19] maybe that thing takes care of everything [13:12:30] maybe the ohhh perhaps [13:12:38] anyway, I ran a find and it looks super good :) [13:13:40] managehome [13:13:41] https://puppet.com/docs/puppet/6/types/user.html#managehome [13:13:45] This parameter has no effect unless Puppet is also creating or removing the user in the resource at the same time. [13:13:54] ¯\_(ツ)_/¯ [13:13:58] oook [13:13:59] ty [13:14:40] puppet magic [13:14:50] :) [13:27:14] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) Host added to tendril and zarcillo. Set to active on Netbox [13:27:36] 10Analytics: Kerberos identity for phuedx - https://phabricator.wikimedia.org/T284096 (10phuedx) Thanks! [13:32:21] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi transfer has finished and I have configured replication. As soon as you push the new firewall rules, it will start catching up automatically. [13:32:29] elukey: just checking this [13:32:29] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/697704 [13:32:32] for arzhel [13:32:42] that's just the sanitzed cloud db replica right? [13:32:46] we only need that one host? [13:34:00] dbstore1007 should be the replacement for dbstore1004 [13:34:08] where we have partition space problems [13:34:48] we need to allow ipv4, but the extra term for ipv6 is not needed since for db nodes we don't set AAAA records [13:35:27] Manuel removed the AAAA record for dbstore1007 today IIUC [13:35:36] so the change should be amended to remove the ipv6 term [13:35:48] to be consistent with the rest [13:36:41] ottomata: --^ [13:49:04] thanks luca! [13:49:07] sent that to arzhel [14:18:27] (03PS1) 10Gerrit maintenance bot: Add dag.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/698524 (https://phabricator.wikimedia.org/T284450) [14:28:02] Hi, good morning [14:28:55] ottomata: I'll remove the ipv6 records from https://gerrit.wikimedia.org/r/c/operations/homer/public/+/697704 [14:38:52] ok! [14:38:57] mornin [14:39:57] razzi: o/, LGTM, feel free to deploy anytime (Arzhel said it was ok even if we are working on Capirca) [14:40:10] cool [14:40:24] Are there any special steps to deploy homer? [14:41:23] should be a very simple change, targets are cr1-eqiad and cr2-eqiad (we did a couple together in the past right? If no I can go into more details) [14:41:32] *if not [14:47:34] oh right, I remember now [14:50:13] elukey: o/ [14:50:13] https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/druid/public/worker.yaml#L9-L12 [14:50:14] ? [14:50:24] shoudl the comment there say bigtop instead of cloudera now? [14:52:07] yep [14:54:56] ty [14:56:20] razzi: don't forget to !log the action in #operations [14:56:26] (when you do it0 [14:56:29] ) [14:56:37] sounds good [15:00:43] 10Analytics, 10Patch-For-Review: Requesting Kerberos password - https://phabricator.wikimedia.org/T284022 (10razzi) 05Open→03Resolved Should be all set [15:07:13] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: New Wikivoyages are only partially included in Stats - https://phabricator.wikimedia.org/T279564 (10Ottomata) a:03razzi [15:21:17] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10razzi) @Marostegui new firewall rules are pushed, thanks for the update on your end. [15:23:40] Fun for my SREs: https://twitter.com/masaruhoshi/status/1401341565365272577/photo/1 [15:30:25] 10Analytics, 10Analytics-Wikistats: Offer map and queries by subdivisions of countries - https://phabricator.wikimedia.org/T284294 (10Ottomata) p:05Triage→03Medium [15:31:01] 10Analytics: labstore1006 possible kerberos issue - https://phabricator.wikimedia.org/T284261 (10Ottomata) 05Open→03Declined Closing, feel free to reopen if there is something to be done. [15:31:37] oh forgot - mforns would you have a minute to talk about alerts? [15:31:54] joal: yes! [15:31:55] I'm in the cave mforns [15:31:58] k [15:34:37] razzi, ottomata - ops sync? [15:34:59] otherwise I can reuse this time to debug some horror in kubeflow :D [15:41:51] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) As mentioned on IRC, there might be something else needed as it cannot reach its master yet: ` root@dbstore1007:~# telnet db1122.eqiad.wmnet 3306... [15:59:42] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi some more things to double check: ` [17:57:10] <@marostegui> Also, is dbstore1007 in the same vlan as dbstore1004? [17:57:56] an... [16:02:20] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Apache Beam go prototype code for DP evaluation - https://phabricator.wikimedia.org/T280385 (10fkaelin) One point of confusion that I have on a conceptual level: currently we apply what we loosely call a k-anonymity threshold, e.g... [16:09:15] mforns: milimetric razzi joal wanna pick a time to do some airflow play? [16:09:17] not everybody has to come\ [16:09:29] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10elukey) I am not sure if it is going to be a nightmare or not, but to avoid wiping the copy that Manuel did between 1004 and 1007 today with a reimage we coul... [16:09:41] razzi: --^ this is what I have in mind (Riccardo is in a meeting) [16:09:47] maybe in 3hrs? [16:10:07] ottomata: sure [16:10:16] i'll make a little cal invite [16:11:08] ottomata: yes! [16:11:25] btw ottomata I'm still finding missing revisions, looks like roughly as many as before, but no time to dig too much deeper right now. I was hoping there'd be none and we could just hooray [16:19:08] innnnteresting! [16:20:39] 10Analytics-Clusters, 10Analytics-Kanban, 10Discovery-Search (Current work), 10Patch-For-Review: Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10Ottomata) 05Open→03Resolved [16:20:59] 10Analytics-Clusters, 10Analytics-Kanban, 10Discovery-Search (Current work), 10Patch-For-Review: Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10Ottomata) a:03Ottomata Oh, I think I forgot to update this because we never groomed i... [16:37:29] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) We can reimage without wiping /srv if needed too [16:40:52] 10Analytics-Radar, 10ChangeProp, 10Event-Platform, 10Platform Engineering, and 3 others: Run EventBus tests in MediaWiki core CI - https://phabricator.wikimedia.org/T257583 (10hashar) [16:43:56] elukey: do you know, is it safe to just systemctl restart a mysqld replica? [16:44:04] want to do that for analytlics-meta replica on db1108 [16:45:09] ottomata: better to stop slave; restart; start slave in case [16:45:43] ok, do I need to sync with DBAs or can i just log in -ops and just do? [16:46:38] elukey: ^ ? [16:47:48] if it is 1108 you can do it anytime, maybe do a quick check via show processlist \G to see if any backup is being taken (don't recall exactly when it happens) [16:48:40] k [16:49:05] looks ok [16:50:50] !log restarting mysqld analytics-meta replica on db1108 to apply config change - T272973 [16:50:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:50:53] T272973: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 [16:51:24] elukey: huh it looks like after restart the slave just starts on its own [16:51:39] anyway, done ,thank you [16:51:40] :) [16:53:27] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10elukey) The reimage is surely good, but I think that we'd need to fix the ips manually anyway in netbox first. @Volans do you have suggestions about what's be... [16:56:42] ottomata: ah perfect [17:07:41] !log remove packages from an clsuter nodes: sudo apt-get -y remove r-cran-rmysql python3-matplotlib python3-sklearn python3-enchant python3-nltk gfortran liblapack-dev libopenblas-dev - T275786 [17:07:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:07:44] T275786: Remove all debian python-* packages installed for analytics clients, use conda instead - https://phabricator.wikimedia.org/T275786 [17:21:48] 10Analytics-Clusters, 10Analytics-Kanban: Remove all debian python-* and other user requested packages installed for analytics clients, use conda instead - https://phabricator.wikimedia.org/T275786 (10Ottomata) [17:26:23] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) Things to figure out update: [x] Airflow database Done. No HA at this time. [x] DAG dir and distribution We... [17:26:33] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi) Ok, new plan as we discussed at ops sync is to try the upgrade again next week - I'm picking Tuesday June 15. We'll see if the new memory + threads s... [17:34:17] Gone for tonight - see 'all tomorrow [17:36:57] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Volans) @elukey what do you need to change, just the vlan hence the IP? Ping me tomorrow and we can do it together. [17:43:21] elukey: i thikn docs are wrong here, yes? [17:43:22] https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/kafka/roll-restart-brokers.py [17:43:28] this cookbook does not reboot the node? [17:43:42] nor mirror maker? [17:44:50] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Growth-Team, and 3 others: Revisions missing from mediawiki_revision_create - https://phabricator.wikimedia.org/T215001 (10Milimetric) Just a quick note to say that I ran the query for May 17th, and still found mismatches on both sides. I will find a wa... [17:47:49] ottomata: yes it is a copy/paste issue [17:48:03] k [17:48:03] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/698586 [17:48:04] gr8 [17:49:01] FYI elukey razzi i'm going to do jumbo mirror maker and then broker restarts for T283067 [17:49:39] ottomata: I commented with a nit, since your are there fixing, it should mention "restart" [17:50:57] ty! [17:52:15] +1 for the restarts, it takes a long time now to restart jumbo with the pauses [17:52:30] but the nice thing is that it will run without really bothering you [17:52:46] :) [17:53:01] elukey: ... this is my first time running a cookbook [17:53:34] ottomata: it changed my life, especially for hadoop :D [17:53:50] !log rolling restart of kafka jumbo mirror makers - T283067 [17:53:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:14:18] !log rolling restart of kafka jumbo brokers - T283067 [18:14:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:59:06] mforns: razzi milimetric let's just use bc for airflow, not the meet in the cal event [18:59:13] cool [18:59:14] ok [19:29:40] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, and 2 others: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Volans) Adding @lucyblackwell for approval. [19:44:33] 10Analytics-Radar, 10observability, 10Puppet, 10Services (watching), 10User-Elukey: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10Eevans) [20:10:53] 10Analytics-Radar, 10Cassandra, 10Services (watching): Inconsistent Cassandra disk load shown in metrics and nodetool status - https://phabricator.wikimedia.org/T146130 (10Eevans) 05Open→03Resolved a:03Eevans >>! In T146130#4396352, @Eevans wrote: > @elukey, is this still a thing? I'm going to assume... [20:11:50] 10Analytics, 10Analytics-Kanban: Crunch and delete many old dumps logs - https://phabricator.wikimedia.org/T280678 (10WDoranWMF) @Milimetric The dashboard @Addshore shared is good enough for our purposes. [20:12:17] 10Analytics-Clusters, 10Analytics-Kanban, 10Cassandra, 10Patch-For-Review: Set up a testing environment for the AQS Cassandra 3 migration - https://phabricator.wikimedia.org/T257572 (10Eevans) Is this issue still relevant? [20:12:33] mforns: https://airflow.apache.org/docs/apache-airflow-providers-apache-hdfs/stable/_api/airflow/providers/apache/hdfs/sensors/hdfs/index.html [20:12:37] i think ^ did not exist in airflow 1, right? [20:36:14] ottomata: i don't remember [20:46:16] ok i do remember [20:46:18] it did, but snakebite [20:46:22] but sorta snakebite is now python3 [20:46:30] https://github.com/apache/airflow/pull/5659#issuecomment-651770583 [20:46:33] and maybe even should work wtih kerberos [20:48:13] but https://github.com/apache/airflow/pull/5659#issuecomment-651770583 [20:48:17] thanks to our friend luca [20:48:33] i think we should look into reimplementing https://github.com/apache/airflow/blob/main/airflow/providers/apache/hdfs/hooks/hdfs.py with pyarrow [20:48:37] pyarrow is better now [20:52:58] aha