Page MenuHomePhabricator

Update $wgNukeMaxAge to 90 days in Nuke
Closed, ResolvedPublic5 Estimated Story Points

Description

Per T379147, we will now gradually increase the time frame during which Nuke can retrieve pages to delete, and monitor how it performs.

We will track performance via:

As a first step, let's increase $wgNukeMaxAge to 90 days.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

But this does make me worry that some of our title-based queries will now start to time out as well.

Yup - did a few searches based on LIKE and got timeouts there too.

Change #1100142 had a related patch set uploaded (by Chlod Alejandro; author: Chlod Alejandro):

[operations/mediawiki-config@master] Revert "Increase Nuke max age to 90 days"

https://gerrit.wikimedia.org/r/1100142

Due to the unexpected timeouts affecting title-based queries, this needs a revert while we take a step back and investigate. Revert filed and scheduled this for deployment at the Tuesday, December 03 UTC late backport window (21:00–22:00 UTC).

Page filters not working as efficiently but user filters working fine makes me think we don't have a good index that we're anchoring the query on (whereas we do if we're searching by user). Unfortunately, I'm unsure of how to further optimize the query; this might be something that someone with more experience with the database would have knowledge of. Aside from that, I also don't have access to the databases (except Cloud replicas) nor Nuke on the bigger wikis with thousands of revisions, so it's not something I can actively investigate.

Change #1100142 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "Increase Nuke max age to 90 days"

https://gerrit.wikimedia.org/r/1100142

Mentioned in SAL (#wikimedia-operations) [2024-12-03T16:38:23Z] <urbanecm@deploy2002> Started scap sync-world: Backport for [[gerrit:1100142|Revert "Increase Nuke max age to 90 days" (T380846)]]

Mentioned in SAL (#wikimedia-operations) [2024-12-03T16:50:53Z] <urbanecm@deploy2002> Finished scap sync-world: Backport for [[gerrit:1100142|Revert "Increase Nuke max age to 90 days" (T380846)]] (duration: 12m 29s)

@Chlod I just went ahead and deployed the revert earlier of the window, as the increase appears to have user impact.

Unfortunately, I'm unsure of how to further optimize the query; this might be something that someone with more experience with the database would have knowledge of.

Do you have a SQL query that's failing to execute? That would probably be a good starting point. I'd be happy to help with identifying the next steps from there.

Aside from that, I also don't have access to the databases (except Cloud replicas) nor Nuke on the bigger wikis with thousands of revisions, so it's not something I can actively investigate.

Do you know if the error was reproducible on beta (en.wikipedia.beta.wmflabs.org)? That might give you a place to start investigating (happy to promote you on beta if that'd be helpful).

@Chlod I just went ahead and deployed the revert earlier of the window, as the increase appears to have user impact.

Unfortunately, I'm unsure of how to further optimize the query; this might be something that someone with more experience with the database would have knowledge of.

Do you have a SQL query that's failing to execute? That would probably be a good starting point. I'd be happy to help with identifying the next steps from there.

includes/SpecialNuke.php#L606

I guess this is an example query:
https://logstash.wikimedia.org/app/dashboards#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2024.12.03?id=k298jZMBPAEUXp-LDInZ

SET STATEMENT max_statement_time=30 FOR SELECT DISTINCT page_title,page_namespace,actor_name  FROM `revision` JOIN `actor` ON ((actor_id=rev_actor)) JOIN `page` ON ((page_id=rev_page))   WHERE (rev_parent_id = 0) AND (rev_timestamp > '20241103170506') AND (page_title LIKE '%(video`_game)' ESCAPE '`')  ORDER BY rev_timestamp DESC LIMIT 500

I guess this is an example query:
https://logstash.wikimedia.org/app/dashboards#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2024.12.03?id=k298jZMBPAEUXp-LDInZ

SET STATEMENT max_statement_time=30 FOR SELECT DISTINCT page_title,page_namespace,actor_name  FROM `revision` JOIN `actor` ON ((actor_id=rev_actor)) JOIN `page` ON ((page_id=rev_page))   WHERE (rev_parent_id = 0) AND (rev_timestamp > '20241103170506') AND (page_title LIKE '%(video`_game)' ESCAPE '`')  ORDER BY rev_timestamp DESC LIMIT 500

Note that the example has the previous 30 day limit. However I timed a run against enwiki from a stat host with two different dates to see the performance impact:

query with timestamp > 20240903170506
run time: 35.75752902030945

query with timestamp > 20241103170506
run time: 9.42602014541626

Both are over the ~5 second read query threshold, but clearly increasing the timestamp window has a big impact. Perhaps reworking this into a subquery to limit by timestamp ahead of the join could mitigate that. I'll try to put such a query together

hmm actually, I think the range check is happening first already. I'm having a little bit of jupyter or wmfdata.mariadb output issue in which the output of mariadb.run is truncated. So I can't see the full output, but it looks like the timestamp range is the first thing in the execution plan.

Given that, I suggest reworking the query to filter on the page name ahead of the timestamp; it would have more variable performance, but in this case should be faster.

ooh, case in point; fiddling with the timestamp to be a number instead of a string changes the optimizer result and reduces execution time:

with quotes

ANALYZE
SELECT DISTINCT page_title, page_namespace, actor_name
FROM revision JOIN actor ON ((actor_id=rev_actor)) JOIN page ON ((page_id=rev_page))
WHERE (revision.rev_parent_id = 0) AND (revision.rev_timestamp > '20240903170506') AND (page_title LIKE '%(video`_game)' ESCAPE '`');


	id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	r_rows 	filtered 	r_filtered 	Extra
0 	1 	SIMPLE 	revision 	range 	rev_timestamp,rev_actor_timestamp,rev_page_act... 	rev_timestamp 	14 	None 	31090796 	16945076.00 	100.0 	3.94 	Using index condition; Using where; Using temp...
1 	1 	SIMPLE 	page 	eq_ref 	PRIMARY 	PRIMARY 	4 	enwiki.revision.rev_page 	1 	1.00 	100.0 	0.02 	Using where
2 	1 	SIMPLE 	actor 	eq_ref 	PRIMARY 	PRIMARY 	8 	enwiki.revision.rev_actor 	1 	1.00 	100.0 	100.00

run time: 27.61525821685791

without quotes

ANALYZE
SELECT DISTINCT page_title, page_namespace, actor_name
FROM `revision` JOIN `actor` ON ((actor_id=rev_actor)) JOIN `page` ON ((page_id=rev_page))
WHERE (revision.rev_parent_id = 0) AND (revision.rev_timestamp > 20240903170506) AND (page_title LIKE '%(video`_game)' ESCAPE '`');


	id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	r_rows 	filtered 	r_filtered 	Extra
0 	1 	SIMPLE 	page 	index 	PRIMARY 	page_name_title 	261 	None 	62447787 	61955689.00 	100.0 	0.02 	Using where; Using index; Using temporary
1 	1 	SIMPLE 	revision 	ref 	rev_timestamp,rev_actor_timestamp,rev_page_act... 	rev_page_timestamp 	4 	enwiki.page.page_id 	10 	0.73 	100.0 	1.26 	Using index condition; Using where
2 	1 	SIMPLE 	actor 	eq_ref 	PRIMARY 	PRIMARY 	8 	enwiki.revision.rev_actor 	1 	1.00 	100.0 	100.00

run time: 18.12243938446045

Since rev_timestamp is stored as string in the database, that doesn't make a difference here. What you see is warmed up innodb buffer cache and other reasons.

Since rev_timestamp is stored as string in the database, that doesn't make a difference here. What you see is warmed up innodb buffer cache and other reasons.

I did some more testing and adding/removing the quotes seems to consistently change the execution plan, and filtering on the title first seems to consistently be faster than filtering on the timestamp first. Can you help me understand a case where this isn't so?

The problem here is quite obvious. It tries to query page table first and as result it scans all 70M page titles to find ones matching that title and then checks their creation date. It doesn't have condition on actor at all. That needs to be fixed. If that doesn't solve the problem, it needs to make sure revision query use rev_actor_timestamp index and it's queried first

Adding condition on actor (via rev_actor = ) makes it take 0.05s in most cases. if you need to add support for just page title and for any actor (I don't know if it's a valid usecase, I never used it), then doing straight join reduces the pain but I highly discourage allowing wildcard page match without forcing actor.

The problem here is quite obvious. It tries to query page table first and as result it scans all 70M page titles to find ones matching that title and then checks their creation date. It doesn't have condition on actor at all. That needs to be fixed. If that doesn't solve the problem, it needs to make sure revision query use rev_actor_timestamp index and it's queried first

Of course; I let myself get stuck on this quotation difference.

Adding condition on actor (via rev_actor = ) makes it take 0.05s in most cases. if you need to add support for just page title and for any actor (I don't know if it's a valid usecase, I never used it), then doing straight join reduces the pain but I highly discourage allowing wildcard page match without forcing actor.

thanks!

Also, I checked the original query that used the rc table to get a baseline on this:

ANALYZE
SELECT DISTINCT page_title,page_namespace,actor_name, rc_actor
FROM recentchanges JOIN actor ON ((actor_id=rc_actor))
JOIN page ON ((page_id=rc_cur_id))
WHERE ((rc_source='mw.new' OR (rc_log_type='upload' OR rc_log_action='upload'))
AND page_title LIKE '%(board_game)')
ORDER BY rc_timestamp DESC LIMIT 500

	id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	r_rows 	filtered 	r_filtered 	Extra
0 	1 	SIMPLE 	recentchanges 	index 	rc_cur_id,rc_actor 	rc_timestamp 	14 	None 	11098118 	11066830.00 	100.0 	1.75 	Using where; Using temporary
1 	1 	SIMPLE 	page 	eq_ref 	PRIMARY 	PRIMARY 	4 	enwiki.recentchanges.rc_cur_id 	1 	0.99 	100.0 	0.00 	Using where
2 	1 	SIMPLE 	actor 	eq_ref 	PRIMARY 	PRIMARY 	8 	enwiki.recentchanges.rc_actor 	1 	1.00 	100.0 	100.00

q1 run time: 18.42665696144104

Chlod changed the task status from In Progress to Stalled.Dec 10 2024, 6:30 AM

Per project check-in, currently waiting on some query performance benchmarks before proceeding with this.

Adding condition on actor (via rev_actor = ) makes it take 0.05s in most cases. if you need to add support for just page title and for any actor (I don't know if it's a valid usecase, I never used it), then doing straight join reduces the pain but I highly discourage allowing wildcard page match without forcing actor.

Thanks for the help here! For now we're aiming to continue supporting wildcard page matching, though we're going to investigate some data (T375476) to see how widely used this is.

Adding condition on actor (via rev_actor = ) makes it take 0.05s in most cases. if you need to add support for just page title and for any actor (I don't know if it's a valid usecase, I never used it), then doing straight join reduces the pain but I highly discourage allowing wildcard page match without forcing actor.

Thanks for the help here! For now we're aiming to continue supporting wildcard page matching, though we're going to investigate some data (T375476) to see how widely used this is.

If you need to keep it, you could make the actor-less queries be done on recentchanges table and the ones that have condition on actor on a larger period of time. That way we are sure it doesn't end up in pathological path.

Adding condition on actor (via rev_actor = ) makes it take 0.05s in most cases. if you need to add support for just page title and for any actor (I don't know if it's a valid usecase, I never used it), then doing straight join reduces the pain but I highly discourage allowing wildcard page match without forcing actor.

Thanks for the help here! For now we're aiming to continue supporting wildcard page matching, though we're going to investigate some data (T375476) to see how widely used this is.

If you need to keep it, you could make the actor-less queries be done on recentchanges table and the ones that have condition on actor on a larger period of time. That way we are sure it doesn't end up in pathological path.

I think that would be confusing as-is, because we'd have different time periods depending on the filter you used. If we were going to do this we might want to split the two search options more concretely to communicate this is a different path 🤔

This can be further discussed in terms of design and technical implementation, but we agreed that it makes sense to follow Amir's suggestion and limit ourselves to 30 days when SQL LIKE is used, and have maxage apply otherwise.

We will want to add a descriptive element to the SQL LIKE field, explaining that using it will limit queries to the last 30 days for performance reasons.

Thanks! Just to be clear, SQL LIKE part can still be used for mass deletion for large time spans as long as the creator is also set. MariaDB should be smart enough to first narrow down to pages created by user X and then apply the like statement. The only pathological code path is when SQL LIKE is used but no creator is set.

@Chlod As discussed here is the mock with the added description "Due to performance reasons using this filter without a username or IP address will limit the results to the last 30 days ". Hoping to use a tooltip (T338282) for this in the future.

Design
Empty filters.png (1×1 px, 66 KB)
jsn.sherman changed the task status from Stalled to Open.Dec 19 2024, 2:14 PM

unstalling since we now have a path forward

@Chlod As discussed here is the mock with the added description "Due to performance reasons using this filter without a username or IP address will limit the results to the last 30 days ". Hoping to use a tooltip (T338282) for this in the future.

Design
Empty filters.png (1×1 px, 66 KB)

I can't seem to see the file here; it might have not been attached to this task. While I wait for everyone to come back from the holiday break, I'll assume this is supposed to show below the "SQL LIKE" input field.

@Chlod
Sorry about this! You should be able to see the file now.

@Chlod As discussed here is the mock with the added description "Due to performance reasons using this filter without a username or IP address will limit the results to the last 30 days ". Hoping to use a tooltip (T338282) for this in the future.

Design
Empty filters.png (1×1 px, 66 KB)

I can't seem to see the file here; it might have not been attached to this task. While I wait for everyone to come back from the holiday break, I'll assume this is supposed to show below the "SQL LIKE" input field.

Thanks, Olga! It seems like OOUI places help messages below the input field rather than above. It seems for now, it'll look like this instead.

Screenshot 2024-12-27 at 23-06-08 Mass delete - mwdev.png (392×743 px, 18 KB)

I've also discovered a possible edge case here: where $wgNukeMaxAge is set to something smaller than $wgRCMaxAge while recentchanges queries are being performed. I've made it follow the smaller $wgNukeMaxAge in this case; and the message would disappear accordingly. Realistically, we shouldn't hit this in WMF production, but I was looking ahead in case third-party wikis configure it like this.

Change #1107033 had a related patch set uploaded (by Chlod Alejandro; author: Chlod Alejandro):

[mediawiki/extensions/Nuke@master] Switch to recentchanges in pattern-only search

https://gerrit.wikimedia.org/r/1107033

Test wiki created on Patch demo by SCardenas (WMF) using patch(es) linked to this task:
http://patchdemo.wmcloud.org/wikis/0ea73da36f/w/

I have been testing this locally, and I am getting unexpected behavior:

  1. Set $wgRCMaxAge = 86400 * 10; (ten days) in LocalSettings.php. Set $wgNukeMaxAge = 86400 * 91; higher than the recent changes limit.
  2. Check that Special:RecentChanges only shows changes made in the last 10 days.
  3. Navigate to Special:Nuke and search for a page that you know was created more than 10 days ago.
  4. The page shows up in the results.

Screenshot 2025-01-02 at 18.48.15.png (28×471 px, 10 KB)

Screenshot 2025-01-02 at 18.48.30.png (481×959 px, 61 KB)

The RC purge is random based on edits done or maint scripts. You should either run PurgeRecentChanges or make enough edits that RecentChangesUpdateJob::newPurgeJob() get queued (see this section in RecentChanges class):

		// Flush old entries from the `recentchanges` table
		if ( mt_rand( 0, 9 ) == 0 ) {
			$jobs[] = RecentChangesUpdateJob::newPurgeJob();
		}

this is called in Recentchanges::save()

The RC purge is random based on edits done or maint scripts. You should either run PurgeRecentChanges or make enough edits that RecentChangesUpdateJob::newPurgeJob() get queued (see this section in RecentChanges class):

		// Flush old entries from the `recentchanges` table
		if ( mt_rand( 0, 9 ) == 0 ) {
			$jobs[] = RecentChangesUpdateJob::newPurgeJob();
		}

this is called in Recentchanges::save()

Got it, so this happening in a production Wiki is very unlikely.

Change #1107033 merged by jenkins-bot:

[mediawiki/extensions/Nuke@master] Switch to recentchanges in pattern-only search

https://gerrit.wikimedia.org/r/1107033

Test wiki on Patch demo by SCardenas (WMF) using patch(es) linked to this task was deleted:

http://patchdemo.wmcloud.org/wikis/0ea73da36f/w/

Change #1111350 had a related patch set uploaded (by Chlod Alejandro; author: Chlod Alejandro):

[operations/mediawiki-config@master] Increase Nuke max age to 90 days (attempt 2)

https://gerrit.wikimedia.org/r/1111350

Moving this back to "To Triage" on the User-notice board. We're planning to increase the max age again after the train for this week rolls out (in the UTC late backport window for Thursday).

The original post added by Wargo in Tech/News/2024/50 was as follows:

Administrators can delete multiple pages created at once using [[mw:Extension:Nuke|Extension:Nuke]]. It allowed to choose from pages created by selected user in the last 30 days. Now it finds pages to delete from last 90 days. [[phab:T380846]]

With some changes for grammar and clarity:

Administrators can mass-delete multiple pages created by a user or IP address using [[mw:Extension:Nuke|Extension:Nuke]]. It previously allowed deletion of pages created by a user or IP address only in the last 30 days. Now it can delete pages from the last 90 days. [[phab:T380846]]

Currently, our new implementation means the "last 90 days" is only applicable for queries that have a specific target (be it a user, IP address, or temporary account). 30 days will still be the limit for title-based queries with no specific target. Since the old post doesn't even mention the case that gets limited to 30 days, I assume no changes are needed for this post? Should we also mention plans to increase this further per T379147#10356748?

Perhaps we could add a brief note about the restriction (with an additional minor copyedit):

Administrators can mass-delete multiple pages created by a user or IP address using [[mw:Extension:Nuke|Extension:Nuke]]. It previously only allowed deletion of pages created in the last 30 days. It can now delete pages from the last 90 days, provided it is targeting a specific user or IP address. [[phab:T380846]]

For Tech News, that content seems good. Timing-wise, I assume that entry would be best placed in the following week's edition (or later). -- (I.e. IIUC, the change won't be live until the latest patch is merged (and perhaps also the Deployment Train rolls it out, if not backported), so the information won't be accurate for editors until at least a week from today).) -- In a nutshell, you can either: just tell me when it's ready for inclusion, or add the text directly to the appropriate week's edition onwiki, or move the task in the workboard to "Announce in next". Thanks!

Change #1111350 merged by jenkins-bot:

[operations/mediawiki-config@master] Increase Nuke max age to 90 days (attempt 2)

https://gerrit.wikimedia.org/r/1111350

Mentioned in SAL (#wikimedia-operations) [2025-01-16T21:04:24Z] <jsn@deploy2002> Started scap sync-world: Backport for [[gerrit:1111350|Increase Nuke max age to 90 days (attempt 2) (T380846)]]

Mentioned in SAL (#wikimedia-operations) [2025-01-16T21:08:46Z] <jsn@deploy2002> jsn, chlod: Backport for [[gerrit:1111350|Increase Nuke max age to 90 days (attempt 2) (T380846)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-01-16T21:17:36Z] <jsn@deploy2002> Finished scap sync-world: Backport for [[gerrit:1111350|Increase Nuke max age to 90 days (attempt 2) (T380846)]] (duration: 13m 12s)

@Quiddity The underlying code change for this task was part of the train that went out today, and we just got the config change for this deployed and it should be live on all wikis. This should be good for inclusion on the next Tech/News. :)

Leaving in QA for now. I was still able to create a DB timeout error when doing a title query with the 90 day limit; the initial query threw the error, but subsequent queries worked, presumably due to caching; We should retest again with a cold cache at 60 and 30 days to verify if the error is due to the longer timespan. We realized that we didn't actually have a known error free result before expanding the time range.

Leaving in QA for now. I was still able to create a DB timeout error when doing a title query with the 90 day limit; the initial query threw the error, but subsequent queries worked, presumably due to caching; We should retest again with a cold cache at 60 and 30 days to verify if the error is due to the longer timespan. We realized that we didn't actually have a known error free result before expanding the time range.

It is possible to perform title queries that result in a database timeout, however, the behavior is the same at 60 and 30 days. This means the issue was already there and is not a regression from this task. Will file a bug report and mention this task before closing.