Donmai

Tag autocompletion should include aliased tags

Posted under Bugs & Features

Type-kun said:

Huh, it affects performance? What the hell, it shouldn't affect anything; it simply adds another filter on joined table, like (tags.name NOT LIKE 's%' ESCAPE E'\\')... tag_aliases.post_count works, except cases when there's a new tag added along with alias, and a job didn't synchronize counts yet; it's better to not add it, I guess. Out of curiosity, could you post explain for query with tags.post_count > 0?

Show

Query:

SELECT DISTINCT ON (name, post_count) * FROM
(
  (
    SELECT tags.name, tags.post_count, tags.category, null AS antecedent_name FROM "tags"
    WHERE (name LIKE 's%' ESCAPE E'\\')
    AND (post_count > 0)
    ORDER BY post_count desc LIMIT 10
  )
  UNION ALL
  (
    SELECT tags.name, tags.post_count, tags.category, tag_aliases.antecedent_name FROM "tag_aliases"
    INNER JOIN tags ON tags.name = tag_aliases.consequent_name
    WHERE (tag_aliases.antecedent_name LIKE 's%' ESCAPE E'\\')
    AND (status IN ('active','processing'))
    AND (tags.name NOT LIKE 's%' ESCAPE E'\\')
    AND (tags.post_count > 0)
    ORDER BY tag_aliases.post_count desc LIMIT 20
  )
) AS unioned_query 
ORDER BY post_count desc LIMIT 10

Explain:

Limit  (cost=582.32..582.39 rows=10 width=38)
  ->  Unique  (cost=582.32..582.54 rows=30 width=38)
        ->  Sort  (cost=582.32..582.39 rows=30 width=38)
              Sort Key: public.tags.post_count, public.tags.name
              ->  Result  (cost=0.00..581.58 rows=30 width=38)
                    ->  Append  (cost=0.00..581.58 rows=30 width=38)
                          ->  Limit  (cost=0.00..17.39 rows=10 width=19)
                                ->  Index Scan Backward using index_tags_on_post_count on tags  (cost=0.00..32012.63 rows=18406 width=19)
                                      Index Cond: (post_count > 0)
                                      Filter: (name ~~ 's%'::text)
                          ->  Subquery Scan on *SELECT* 2  (cost=0.00..564.09 rows=20 width=32)
                                ->  Limit  (cost=0.00..563.89 rows=20 width=36)
                                      ->  Nested Loop  (cost=0.00..8373.78 rows=297 width=36)
                                            ->  Index Scan Backward using index_tag_aliases_on_post_count on tag_aliases  (cost=0.00..1588.92 rows=881 width=31)
                                                  Filter: ((antecedent_name ~~ 's%'::text) AND ((status)::text = ANY ('{active,processing}'::text[])))
                                            ->  Index Scan using index_tags_on_name on tags  (cost=0.00..7.69 rows=1 width=19)
                                                  Index Cond: (name = (tag_aliases.consequent_name)::text)
                                                  Filter: ((name !~~ 's%'::text) AND (post_count > 0))

Type-kun said:

Also note the other changes: [...] selecting actual tags.count instead of synchronized.

That's fine if we're keeping the DISTINCT in, it has no effect on performance. But if we're taking the DISTINCT out, then selecting tags.count removes the performance benefit we got from removing the DISTINCT:

Show

Query:

SELECT * FROM
(
  (
    SELECT tags.name, tags.post_count, tags.category, null AS antecedent_name FROM "tags"
    WHERE (name LIKE 's%' ESCAPE E'\\')
    AND (post_count > 0)
    ORDER BY post_count desc LIMIT 10
  )
  UNION ALL
  (
    SELECT tags.name, tags.post_count, tags.category, tag_aliases.antecedent_name FROM "tag_aliases"
    INNER JOIN tags ON tags.name = tag_aliases.consequent_name
    WHERE (tag_aliases.antecedent_name LIKE 's%' ESCAPE E'\\')
    AND (status IN ('active','processing'))
    AND (tags.name NOT LIKE 's%' ESCAPE E'\\')
    AND (tag_aliases.post_count > 0)
    ORDER BY tag_aliases.post_count desc LIMIT 20
  )
) AS unioned_query 
ORDER BY post_count desc LIMIT 10

Explain:

Limit  (cost=215.67..221.64 rows=10 width=38)
  ->  Result  (cost=215.67..233.59 rows=30 width=38)
        ->  Merge Append  (cost=215.67..233.59 rows=30 width=38)
              Sort Key: public.tags.post_count
              ->  Limit  (cost=0.00..17.39 rows=10 width=19)
                    ->  Index Scan Backward using index_tags_on_post_count on tags  (cost=0.00..32012.63 rows=18406 width=19)
                          Index Cond: (post_count > 0)
                          Filter: (name ~~ 's%'::text)
              ->  Sort  (cost=215.66..215.71 rows=20 width=32)
                    Sort Key: *SELECT* 2.post_count
                    ->  Subquery Scan on *SELECT* 2  (cost=0.00..215.23 rows=20 width=32)
                          ->  Limit  (cost=0.00..215.03 rows=20 width=36)
                                ->  Nested Loop  (cost=0.00..8310.76 rows=773 width=36)
                                      ->  Index Scan Backward using index_tag_aliases_on_post_count on tag_aliases  (cost=0.00..1604.38 rows=870 width=31)
                                            Index Cond: (post_count > 0)
                                            Filter: ((antecedent_name ~~ 's%'::text) AND ((status)::text = ANY ('{active,processing}'::text[])))
                                      ->  Index Scan using index_tags_on_name on tags  (cost=0.00..7.70 rows=1 width=19)
                                            Index Cond: (name = (tag_aliases.consequent_name)::text)
                                            Filter: (name !~~ 's%'::text)

Guh, I give up. It's impossible to tune it further remotely, I'd have to fill the base and experiment myself. Postgresql uses different optimization algorithms than Oracle: taking out distinct would never reduce the cost there. Also, it looks like Postgresql makes very careful row estimations depending on literals used; I suppose that cost will go up and down, sometimes drastically, if string different from 's%' is used.

That's also probably why the cost went up in the first case of your last post: look at Nested Loop (cost=0.00..8373.78 rows=297 width=36) and then on Nested Loop (cost=0.00..8310.76 rows=773 width=36) below. Optimizer somehow decided that (tags.post_count > 0) will reduce row count down to 297, and (tag_aliases.post_count > 0) would not. That's not the case, clearly, if these columns contain the same information, but that row count affects limit clause cost (limit cost = loop cost * limit / loop rows). It should be "LIMIT 10" by the way, it will cut the estimated cost in half.

Thinking about it further, it might be better to not include any post_count > 0 filters in the query, and filter those out later in Ruby instead. The results will be ordered by count anyway, so we won't lose anything, but the query will fetch some additional 0-post results. This, however, will allow query to terminate earlier (because of limit) and reduce estimated cost. Maybe execution time, also.

Either way, I'd say the cost of 500-ish is acceptable. How much time does it take to execute on your system?

Type-kun said:

It should be "LIMIT 10" by the way, it will cut the estimated cost in half.

I intentionally made the second subquery have LIMIT 20 even though only 10 rows at most get used. If we selected only 10 rows then even a single duplicate being filtered out on some queries such as /s% would result in only 9 rows getting displayed in autocomplete which is incorrect. If we select 20, then there would need to be 11 or more duplicates to cause a problem - very unlikely.

We could change it to something like 15 though.

Type-kun said:

Either way, I'd say the cost of 500-ish is acceptable. How much time does it take to execute on your system?

There's some variance but generally 200-300ms (for the whole ajax request to the server). On the current production site it seems to take around 150ms.

Toks said:

I intentionally made the second subquery have LIMIT 20 even though only 10 rows at most get used. If we selected only 10 rows then even a single duplicate being filtered out on some queries such as /s% would result in only 9 rows getting displayed in autocomplete which is incorrect. If we select 20, then there would need to be 11 or more duplicates to cause a problem - very unlikely.

We could change it to something like 15 though.

Hm, can't we just move distinct clause to tag_aliases subquery?

Type-kun said:

Hm, can't we just move distinct clause to tag_aliases subquery?

Show

Query:

SELECT * FROM
(
  (
    SELECT tags.name, tags.post_count, tags.category, null AS antecedent_name FROM "tags"
    WHERE (name LIKE 's%' ESCAPE E'\\')
    AND (post_count > 0)
    ORDER BY post_count desc LIMIT 10
  )
  UNION ALL
  (
    SELECT DISTINCT ON (tags.name, tag_aliases.post_count) tags.name, tags.post_count, tags.category, tag_aliases.antecedent_name FROM "tag_aliases"
    INNER JOIN tags ON tags.name = tag_aliases.consequent_name
    WHERE (tag_aliases.antecedent_name LIKE 's%' ESCAPE E'\\')
    AND (status IN ('active','processing'))
    AND (tags.name NOT LIKE 's%' ESCAPE E'\\')
    AND (tag_aliases.post_count > 0)
    ORDER BY tag_aliases.post_count desc LIMIT 10
  )
) AS unioned_query 
ORDER BY post_count desc LIMIT 10

Explain:

Limit  (cost=7062.32..7071.21 rows=10 width=42)
  ->  Result  (cost=7062.32..7080.09 rows=20 width=42)
        ->  Merge Append  (cost=7062.32..7080.09 rows=20 width=42)
              Sort Key: public.tags.post_count
              ->  Limit  (cost=0.00..17.39 rows=10 width=19)
                    ->  Index Scan Backward using index_tags_on_post_count on tags  (cost=0.00..32012.63 rows=18406 width=19)
                          Index Cond: (post_count > 0)
                          Filter: (name ~~ 's%'::text)
              ->  Sort  (cost=7062.31..7062.34 rows=10 width=32)
                    Sort Key: *SELECT* 2.post_count
                    ->  Subquery Scan on *SELECT* 2  (cost=7061.97..7062.15 rows=10 width=32)
                          ->  Limit  (cost=7061.97..7062.05 rows=10 width=36)
                                ->  Unique  (cost=7061.97..7067.77 rows=773 width=36)
                                      ->  Sort  (cost=7061.97..7063.91 rows=773 width=36)
                                            Sort Key: tag_aliases.post_count, public.tags.name
                                            ->  Nested Loop  (cost=41.20..7024.89 rows=773 width=36)
                                                  ->  Bitmap Heap Scan on tag_aliases  (cost=41.20..318.48 rows=870 width=31)
                                                        Filter: ((antecedent_name ~~ 's%'::text) AND ((status)::text = ANY ('{active,processing}'::text[])) AND (post_count > 0))
                                                        ->  Bitmap Index Scan on index_tag_aliases_on_antecedent_name_pattern  (cost=0.00..40.98 rows=873 width=0)
                                                              Index Cond: ((antecedent_name ~>=~ 's'::text) AND (antecedent_name ~<~ 't'::text))
                                                  ->  Index Scan using index_tags_on_name on tags  (cost=0.00..7.70 rows=1 width=19)
                                                        Index Cond: (name = (tag_aliases.consequent_name)::text)
                                                        Filter: (name !~~ 's%'::text)

Big queries like s% decrease in performance by a lot.

Small queries don't seem to be affected in performance significantly. They don't decrease in performance, but they also don't increase from having the limit reduced from 20->10.

FYI, this was implemented*. Send your virtual beer to Toks and Albert for being awesome :3

* You might need to clear your browser local storage for it to work on tags you've searched for recently. Or just wait for a week until all cached entries are outdated.

<3 Toks, Albert and also you, Type-kun.

Searchwanted said:

I believe it would also be nice to have the option to not see aliases in tag autocompletion.

-1 this would defeat the mistagging prevention use case -- which is what motivated this thread in the first place.

I'm not going to implement a user setting to disable it just for the hell of it. Maybe if there are some examples where it's significantly inconvenient, but I can't think of what those would be.

Toks said:

I'm not going to implement a user setting to disable it just for the hell of it. Maybe if there are some examples where it's significantly inconvenient, but I can't think of what those would be.

I'm finding with some tags, the alias has higher count than a tag I used to use previously.

eg "dark(autocomplete)" used to get me dark_skin, now it gets dark_hair->black_hair.

Minor annoyance & just will take some adjusting around.

My (unscientific) observations based on maintaining the winking -animated and peace_symbol use cases mentioned in OP showed that the rate of mistaggings dropped to about half as often within the first week; almost completely ceased to occur over the last week or so. Only finding the occasional stray case now, something like a couple within a week, as opposed to a dozen every day or two.

One user had to be contacted by PM because of continued misuse of winking, but it turned out that they were using DanbooruUp, which remains outdated... I don't think it's being maintained anymore?

So, as far as my original primary motivation for asking for this feature goes, it has been a resounding success. Thanks again.

r0d3n7z said:

DanbooruUp, which remains outdated... I don't think it's being maintained anymore?

Yeah, it was updated 2 years ago to make it work with Danbooru 2 but it seems there haven't been any updates since. And I haven't seen zatchii since.

1 2 3