SQL _ ランク付け:MySQL, mariaDB

PostreSQL や Oracle , mysql8.1 には、ランク関数なるものがあるようだが、MySQL５までにはないので、ランク付けのために必要なSQL文を概観してみた。[2015-05-09]
ここのサイトではmysql5なので、そのままにしていたが、ようやくPCも買い換えて、mariaDB 10.4を使ってwindow関数(rank関数など)を使ってみたので、こつを解説してみる。mysql 8.1 も使い方はほぼ同じようだ。

まずはテーブル定義

以降のSQLに使うテストデータとして、テーブルを定義します。順位をつけたいデータカラム（数値または、文字列、日付などsort可能なデータ型であればよい、以下ではpoint int型）と、ある行を特定するid または、表示用テキストカラム（name）、後半の、カテゴリー別順位につかうcategoryカラムを定義します。

create table  rank_data (
 id int auto_increment not null primary key
, name varchar(100)
, category varchar(50)
, point int
)engine=myisam charset=utf8  ;

insert into rank_data values (null,'test1','aa',12) , (null,'m1','bb',34),(null,'cc','cc',24)
, (null,'test2','aa',22) , (null,'m2','bb',33),(null,'cc2','cc',44)
,(null,'test3','aa',11) , (null,'m3','bb',54),(null,'m4','bb',14)
,(null,'test4','aa',42) , (null,'m5','bb',55),(null,'bb6','bb',64)
;

自己結合の直積（デカルト積）を使う

self cross join : 自己結合の直積（デカルト積）のなかで、上位の行数を数える
point の高い方を上位（降順）とすると、以下のようになる
この場合、r1.point <= r2.point　の等号をとってしまうと、最上位のr1.point が得られなくなってしまう。また、同順があった場合、それも、ランク数にカウントされるので、下位のランクで同順となる。
```
select r1.id as id, r1.point as p, count(r2.point) as rank
 from rank_data as r1 cross join rank_data as r2
  where  r1.point <= r2.point 
  group by r1.id , r1.point  
  order by rank  ;
  --  同順が1つに数えられてしまうので、id でのgroup by必要 
  --  また、mysqlでは、idにたいして、point が一意に決まるなら、group byにr1.point カラム不要
```
しかし、行数が多くなると、直積をとってから、条件抽出するのは、無駄も多いので、left join で 1:多結合を行う。最近のMySQLでは、cross joinでも、条件節やgroup by を鑑みてうまく最適化しているようなので、以下とさほど実行時間差はないようである。
さらに、不等号のみとしても、r1については全行抽出され、r2側カラムでcount()をとれば、同順は上位にランク付けされる。また、トップの行は、r2側がnull となるが、count(r2.point) では、０なので、+1 として１番からのランク付けとする
```
select r1.id as id, r1.point as p, count(r2.point)+1 as rank
 from rank_data as r1 left join rank_data as r2
  on  r1.point < r2.point 
  group by r1.id 
  order by rank  ;
  --  同順が1つに数えられてしまうので、id でのgroup by 必要 
```
※上記でわかることは、join 方式では、行を特定するカラムが必須である。
相関サブクエリを使う

相関サブクエリを使うときは、スカラークエリとして、select句に記述します。
```
select * 
, ( select count(*)+1 from rank_data r2 where  r1.point < r2.point ) as rank
  from rank_data r1
  order by  rank  ;
```
※pointカラムにもindexがあると、高速化も望めるかもしれないが、使えるindex がないとMySQLでは、相関サブクエリは非常に遅くなってしまう。

変数使用し、整列順に連番をつける

先に、set構文で変数に 0 を代入しておいてもよいのだが、一SQL文で行う方法を紹介する。
この方法は、QandA サイトの回答　を　参考にしています。
from句のテーブルサブクエリは、一番最初に一回だけ実行されることから、ここで変数を初期化する
さらに、テーブルサブクエリをテーブルとしたサブクエリで、変数のカウントアップをする。
連番なので、同順のものも、見つけた順に順位が下がっていくので、とりあえず、毎回同じとなるよう、id も使って順位をそろえる。
よって、同順を同じにして、順位値に間をあけたりはできない。

select * 
, ( select @a:=@a+1 from (select @a:=0) as dummy ) as rank
  from rank_data 
  order by  point desc , id  ;
  --  0 からの連番にするときは、 @a:=0 の代入値を -1 とする。

explain 結果
+----+----------------------+------------+--------+---------------+------+---------+------+------+----------------+
| id | select_type          | table      | type   | possible_keys | key  | key_len | ref  | rows | Extra          |
+----+----------------------+------------+--------+---------------+------+---------+------+------+----------------+
|  1 | PRIMARY              | rank_data  | ALL    | NULL          | NULL | NULL    | NULL |   14 | Using filesort |
|  2 | UNCACHEABLE SUBQUERY | <derived3> | system | NULL          | NULL | NULL    | NULL |    1 | NULL           |
|  3 | DERIVED              | NULL       | NULL   | NULL          | NULL | NULL    | NULL | NULL | No tables used |
+----+----------------------+------------+--------+---------------+------+---------+------+------+----------------+

※　これを思いついた人に脱帽です。

カテゴリー別に順位をつけたい: 下準備 index 作成

順位付けカラムや、カテゴリー分けカラムにindex がないと、joinするにしても、相関サブクエリにしても、時間のかかる作業になるので、index を作ります。
カテゴリーに関係なく全体でランキングをとることが多ければ、ランキング用カラムのみのindex を作り、カテゴリー別にすることが多ければ、連動indexとします。どのindexが実際に使われるかを、explain で調査したうえで、決定するとよいでしょう。
```
alter table `rank_data` add index pt (point); 
alter table `rank_data` add index cate_pt (category,point); 
```

カテゴリー別に順位をつけたい: left join

select r1.id as id, r1.category as category, r1.point as p, count(r2.point)+1 as rank
 from rank_data as r1 left join rank_data as r2
   on  r1.point < r2.point and r1.category=r2.category
 group by r1.category , r1.id 
 order by r1.category, rank  ;

explain 結果
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                                          |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
|  1 | SIMPLE      | r1    | ALL  | PRIMARY       | NULL | NULL    | NULL |   14 | Using temporary; Using filesort                |
|  1 | SIMPLE      | r2    | ALL  | pt,cate_pt    | NULL | NULL    | NULL |   14 | Range checked for each record (index map: 0x6) |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+

カテゴリー別に順位をつけたい: subquery

select * 
, ( select count(*)+1 from rank_data r2 where  r1.point < r2.point and r1.category=r2.category ) as rank
  from rank_data r1
  order by category, rank ;

explain 結果
+----+--------------------+-------+------+---------------+---------+---------+------------------+------+---------------------------------+
| id | select_type        | table | type | possible_keys | key     | key_len | ref              | rows | Extra                           |
+----+--------------------+-------+------+---------------+---------+---------+------------------+------+---------------------------------+
|  1 | PRIMARY            | r1    | ALL  | NULL          | NULL    | NULL    | NULL             |   14 | Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | r2    | ref  | pt,cate_pt    | cate_pt | 33      | test.r1.category |    2 | Using where; Using index        |
+----+--------------------+-------+------+---------------+---------+---------+------------------+------+---------------------------------+

※　上記explain 結果は、MySQL5.6での実行結果です。where 句に出現するカラムが入っているindex は、かなり有効利用されるようです。
joinの場合も相関クエリの場合も「Using temporary; Using filesort」とあるので、どちらが有利とも言い難いので、あとは、実際のデータ量や、limit 行数などにもよるでしょう。

カテゴリー別に上位３位までを抽出

自己結合を使う場合

同順は、上位側に数えて、上位３位まで（３位で同順が２行あれば、合計４行の抽出）として、上位のみをカテゴリー別に得るには、自己結合では、having で条件を追加する

select r1.id as id, r1.category as category, r1.point as p, count(r2.point)+1 as rank
 from rank_data as r1 left join rank_data as r2
   on  r1.point < r2.point and r1.category=r2.category
 group by r1.category , r1.id 
 having  rank <= 3
 order by r1.category, rank  ;

相関サブクエリを使う場合

相関サブクエリでランクを求めている場合には、さらにそのテーブルをテーブルサブクエリとして、条件 rank <=3 とする。

select *
 from ( select * 
      , ( select count(*)+1 from rank_data r2 where  r1.point < r2.point and r1.category=r2.category 
      ) as rank
      from rank_data r1
  ) as temp
 where rank<=3
 order by category,rank  ;


explain 結果
+----+--------------------+------------+------+---------------+---------+---------+------------------+------+-----------------------------+
| id | select_type        | table      | type | possible_keys | key     | key_len | ref              | rows | Extra                       |
+----+--------------------+------------+------+---------------+---------+---------+------------------+------+-----------------------------+
|  1 | PRIMARY            | <derived2> | ALL  | NULL          | NULL    | NULL    | NULL             |   14 | Using where; Using filesort |
|  2 | DERIVED            | r1         | ALL  | NULL          | NULL    | NULL    | NULL             |   14 | NULL                        |
|  3 | DEPENDENT SUBQUERY | r2         | ref  | pt,cate_pt    | cate_pt | 33      | test.r1.category |    2 | Using where; Using index    |
+----+--------------------+------------+------+---------------+---------+---------+------------------+------+-----------------------------+
3 rows in set (0.00 sec)

※　この多重サブクエリでは、テンポラリーに書き出しているのではないかとも思ったが、r1 のデータがほとんどそのまま引き継がれているので、わざわざ書き出してはいないらしい。filesortで書き出しが発生するのみ。

exists を使って、必要行のみ抽出

select * 
      , ( select count(*)+1 from rank_data r3 where  r1.point < r3.point and r1.category=r3.category 
      ) as rank
from rank_data r1
where  exists
   ( select * from rank_data  r2  
     where  r1.point < r2.point  and  r1.category=r2.category
     having  count(*) < 3
   )
order by category, point desc  ;

explain 結果
+----+--------------------+-------+------+---------------+---------+---------+------------------+------+-----------------------------+
| id | select_type        | table | type | possible_keys | key     | key_len | ref              | rows | Extra                       |
+----+--------------------+-------+------+---------------+---------+---------+------------------+------+-----------------------------+
|  1 | PRIMARY            | r1    | ALL  | NULL          | NULL    | NULL    | NULL             |   14 | Using where; Using filesort |
|  3 | DEPENDENT SUBQUERY | r2    | ref  | pt,cate_pt    | cate_pt | 33      | test.r1.category |    2 | Using where; Using index    |
|  2 | DEPENDENT SUBQUERY | r3    | ref  | pt,cate_pt    | cate_pt | 33      | test.r1.category |    2 | Using where; Using index    |
+----+--------------------+-------+------+---------------+---------+---------+------------------+------+-----------------------------+

※　ランク値も取り出そうと思うと、ほぼ似たようなサブクエリを２回実行することになり、逆に冗長かもしれない。「ランク値不要で、カテゴリー別に上位３位の行データのみ」という用件なら、上記の rankカラム用のスカラーサブクエリを削ると結構高速になりそうではある。

mariaDB 10における window frame , rank(), row_number()

以下はmariaDB 10.4での出力例。

nameカラムの整列順に連番

rank() や row_number() には over による整列カラムの指定は必須のよう。rank(name) のようには書けない。
そして全体に対するorder by句がなくても、over でorder byを指定するとその順で出力されます。というより、よけいなfilesortが発生しないように、全体に対するorder by句はない方がよいようです。

select * ,  row_number() over( order by name )  as number
 from rank_data ;

MariaDB [test]> explain
    -> select * , row_number() over( order by category,point ) as num  from rank_data  ;
+------+-------------+-----------+------+---------------+------+---------+------+------+-----------------+
| id   | select_type | table     | type | possible_keys | key  | key_len | ref  | rows | Extra           |
+------+-------------+-----------+------+---------------+------+---------+------+------+-----------------+
|    1 | SIMPLE      | rank_data | ALL  | NULL          | NULL | NULL    | NULL | 15   | Using temporary |
+------+-------------+-----------+------+---------------+------+---------+------+------+-----------------+

MariaDB [test]> explain
    -> select * , row_number() over( order by category,point ) as num  from rank_data
    ->  order by category,point  ;
+------+-------------+-----------+------+---------------+------+---------+------+------+---------------------------------+
| id   | select_type | table     | type | possible_keys | key  | key_len | ref  | rows | Extra                           |
+------+-------------+-----------+------+---------------+------+---------+------+------+---------------------------------+
|    1 | SIMPLE      | rank_data | ALL  | NULL          | NULL | NULL    | NULL | 15   | Using temporary; Using filesort |
+------+-------------+-----------+------+---------------+------+---------+------+------+---------------------------------+

index設定のあるカラムでover(order by)してみたのだが、rank関数実行でindex無効となり、さらに全体のorder by でfilesort されてしまう！実際には、最後にorder byつけなくてもoverで指定した順に並んで出力される。
マニュアルみてて、window関数の例文には最後にorder byついてないなとは思っていたが、そういうことだったのか。
overの整列指定が異なるカラムを複数つくると、もっとも単純な指定の順序が出力に選ばれるようです（カラム記述順は関係なかった）。複数あるときは、状況によって、最後にorder byで出力順の指定を行った方がよいでしょう。

カテゴリー別ランクや、連番

層別化には partition byで指定し、大きい方から並べるには、order byに指定するカラムに desc を指定します。
複数のwindow関数で同一の順序付けを指定するときは、window frame にエイリアスを指定すると楽です。select句も冗長にならずに見やすくなります。
window句は group by , having より後で、order by句の手前に入れます。

select * , rank() over w as rank , row_number() over w as number 
 from rank_data 
 window w as (partition by category order by  point desc)  ;
--  window句での複数エイリアス指定。この場合の出力順は、記述の順序にはよらず層別化の単純なものが採用されるようだ。
select * , rank() over w as rank , row_number() over w as number , row_number() over w2 as n2
 from rank_data 
 window w as (partition by category order by  point desc) 
 , w2 as ( order by point ) ;

カテゴリー別に上位３位までを抽出: with
where句やhaving句でwindow関数やwindow関数利用カラムのエイリアスは使えないので、ランク付けしたサブクエリテーブルに対して条件指定が必要です。
withによるサブクエリテーブル宣言を使うとサブクエリを本体の外に出せるので、全体の見通しがよくなります。一時view作成と同じことを一回のsql文発行で行える利点があります。
rank関数で既にsort済みなので、あらためて本体selectにorder by句を指定する必要はありません。指定すると、そのままの並び順であってもさらなるfilesortが実行されます。
```
with t as ( 
  select * , rank() over (partition by category order by point desc) as rank 
  from rank_data 
)
select * from t where rank <=3 ;
```

カテゴリー別集計のpoint合計でランク付け: with とjoin

window frame を宣言する場合は、group by , having の後方に置くこととなっており、over([partition by ..] order by ..)には集計関数の count, sum, avg, group_concat, bit_or, bit_and, bit_xor を指定できます。
ただし、count(distinct ..)はつかえないようです。
これらは、select句でwindow関数に直接over(([partition by ..] order by ..)で指定することもできます。集計関数につけたエイリアスも使用できます。 window関数と集計関数の記載順とは関係なくエイリアス使用できるようです。

--  ex1
select category , sum(point) as sum_point , rank() over w as rank
   from rank_data  group by category
   window w over( order by  sum_point desc)  ;

-- ex2
select category, rank() over (order by  sum_point desc) as rank, sum(point) as sum_point
   from rank_data group by category  ;

-- ex3  集計ランク上位3位までのカテゴリーについて全行表示、joinすると全体へのorder by句がないと t 側の順序は保持されない。
with t as ( select category, rank() over (order by  sum(point) desc) as rank 
   from rank_data group by category
)
select r1.*  , t.rank
   from rank_data as r1 join t on r1.category = t.category
   where t.rank < 4  
   order by t.rank, r1.point ;

MariaDB [test]> explain
    -> with t as ( select category, rank() over(order by sum(point) desc) as rank from rank_data group by category )
    -> select *  from rank_data as r1 join t on r1.category=t.category and t.rank < 4
    -> order by t.rank, r1.point ;
+------+-------------+------------+-------+---------------+---------+---------+------------------+------+----------------------------------------------+
| id   | select_type | table      | type  | possible_keys | key     | key_len | ref              | rows | Extra                                        |
+------+-------------+------------+-------+---------------+---------+---------+------------------+------+----------------------------------------------+
|    1 | PRIMARY     | r1         | ALL   | cate_pt       | NULL    | NULL    | NULL             | 15   | Using where; Using temporary; Using filesort |
|    1 | PRIMARY     | <derived2> | ref   | key0          | key0    | 153     | test.r1.category | 2    | Using where                                  |
|    2 | DERIVED     | rank_data  | index | NULL          | cate_pt | 158     | NULL             | 15   | Using index; Using temporary                 |
+------+-------------+------------+-------+---------------+---------+---------+------------------+------+----------------------------------------------+

--  上記をjson記法で出力、情報多いようで、構造がわかりにくいですね、なれればいろいろ利用できるのでしょうけど。※は私のコメント
MariaDB [test]> explain format="json"
    -> with t as  .. 以降上記と同じ
+----------------------------------------------------------|
| EXPLAIN                                                  |
+----------------------------------------------------------|
| {
  "query_block": {　　　　　　　　　　　　　　　　※　メインのselect文
    "select_id": 1,
    "filesort": {
      "sort_key": "t.rank, r1.`point`",
      "temporary_table": {
        "table": {
          "table_name": "r1", 　　　　　　　　　　※ rank_data as r1 
          "access_type": "ALL",
          "possible_keys": ["cate_pt"],
          "rows": 15,
          "filtered": 100,
          "attached_condition": "r1.category is not null"
        },
        "table": {
          "table_name": "<derived2>", 　　　　　　※ join t テーブルについて
          "access_type": "ref",
          "possible_keys": ["key0"],
          "key": "key0",
          "key_length": "153",
          "used_key_parts": ["category"],
          "ref": ["test.r1.category"],
          "rows": 2,
          "filtered": 100,
          "attached_condition": "t.rank < 4",
          "materialized": {
            "query_block": {　　　　　　　　　　　※ with 宣言のselect文
              "select_id": 2,
              "window_functions_computation": {
                "sorts": {
                  "filesort": {
                    "sort_key": "sum(rank_data.`point`) desc"
                  }
                },
                "temporary_table": {　　　　　　　※ with 宣言のテーブル内容
                  "table": {
                    "table_name": "rank_data",
                    "access_type": "index",
                    "key": "cate_pt",
                    "key_length": "158",
                    "used_key_parts": ["category", "point"],
                    "rows": 15,
                    "filtered": 100,
                    "using_index": true
                  }
                }
              }
            }
          }
        }
      }
    }
  }
} |
+----------------------------------------------------------|

[設置日 2015-05-09] : 初回設置 [2020-07-05] : mysql8.1(以降) , mariaDB10.2(以降) でのrank関数 [最終更新日　2020-07-19] : rank関数修正追加