基于另一列进行不重复计数

Question

12 浏览2023年1月11日

匿名的 2023年1月11日

0 Comments

我有以下表格：

CREATE TABLE tbl (
  id          int NOT NULL
, date        date NOT NULL
, cid         int NOT NULL
, birth_place text NOT NULL
, location    text NOT NULL
);
INSERT INTO tbl VALUES
  (1 , '2022-01-01', 1, 'France' , 'Germany')
, (2 , '2022-01-30', 1, 'France' , 'France')
, (3 , '2022-01-25', 2, 'Spain'  , 'Spain')
, (4 , '2022-01-12', 3, 'France' , 'France')
, (5 , '2022-02-01', 4, 'England', 'Italy')
, (6 , '2022-02-12', 1, 'France' , 'France')
, (7 , '2022-03-05', 5, 'Spain'  , 'England')
, (8 , '2022-03-08', 2, 'Spain'  , 'Spain')
, (9 , '2022-03-15', 2, 'Spain'  , 'Spain')
, (10, '2022-03-30', 5, 'Spain'  , 'Italy')
, (11, '2022-03-22', 4, 'England', 'England')
, (12, '2022-03-22', 3, 'France' , 'England');

我需要按月份和地点计算不同客户（即cid）的数量 - 但有一个特殊的要求：

如果一个客户在任何给定的月份回到了他们的出生地（即location = birth_place），则优先选择该地点。否则，每个月和客户选择一个地点。

我的期望输出是：

date         location   count
2022-01-01   France     2
2022-01-01   Spain      1
2022-02-01   Italy      1
2022-02-01   France     1
2022-03-01   Spain      1
2022-03-01   England    3

2022-01-01的cid 1的地点是出生地，该时间段内没有其他客户将Germany作为地点，因此在我的期望输出中没有Germany。

这是我当前的查询语句：

with
  t as (
    select id, date_trunc('month', date)::date AS date, cid, birth_place, location
    from tbl),
  t1 as (
    select date, cid, location
    from t
    where birth_place = location),
  t2 as (
    select date, cid, location, row_number() over (partition by date, cid order by date) as row
    from t
    where birth_place <> location),
  t3 as (
    select t.*, 
        case
            when t1.location is not null then t1.location
            else t2.location
        end as new_loc
    from t
    left join t1
    on t.cid = t1.cid and t.date = t1.date
    left join t2
    on t.cid = t2.cid and t.date = t2.date and t2.row = 1)
select date, new_loc, count(distinct cid)
from t3
group by 1, 2
order by 1, 2

它能正常工作，但对于一亿行数据似乎效率不高。

我正在寻找一种更高效的方法。

0

3 答案

匿名的 · Answer 1 · 2023-06-26T00:05:18+00:00

问题的出现原因：

在给定的SQL查询中，使用了一个子查询（q1），该子查询根据特定条件筛选出了符合要求的数据。然后，在主查询中，对子查询的结果进行了聚合操作，计算了每个年份、每个月份和每个地点下唯一cid的计数，以及所有记录的计数。然而，这种方法只能计算满足条件的记录，而忽略了不满足条件的记录。

解决方法：

要解决这个问题，我们需要修改查询，使其能够计算不满足条件的记录的唯一cid计数。一种方法是使用LEFT JOIN操作，将原始表与满足条件的记录表连接起来，并使用COALESCE函数来处理不匹配的记录。下面是修改后的查询：

WITH q1 AS (
    SELECT
        EXTRACT( YEAR  FROM t."date" ) AS "Year",
        EXTRACT( MONTH FROM t."date" ) AS "Month",
        t.cid,
        t.birth_place,
        t.location
    FROM
        theTable AS t
    WHERE
        t.location = t.birth_place
)
SELECT
    "Year",
    "Month",
    COALESCE(q1.location, t.location) AS "location",
    COUNT( DISTINCT COALESCE(q1.cId, t.cId) ) AS "COUNT( DISTINCT cId )",
    COUNT( * ) AS "CountAll"
FROM
    theTable AS t
LEFT JOIN
    q1 ON t.location = q1.location
       AND t.cId = q1.cId
GROUP BY
    "Year",
    "Month",
    COALESCE(q1.location, t.location)
ORDER BY
    "Year",
    "Month",
    COALESCE(q1.location, t.location)

这样修改后的查询将返回所有记录的唯一cid计数，包括满足条件的记录和不满足条件的记录。

匿名的 · Answer 2 · 2023-03-12T02:11:35+00:00

问题出现的原因是需要根据另一列进行去重计数。解决方法是使用case...when语句在count函数中，并且在查询中可以添加where过滤条件，以便在将来查询中允许对数据进行其他聚合操作。

以下是解决方法的具体代码：

SELECT
    date_trunc('month', date)::date AS date, t.location
    , count(distinct (case when t.location=t.birth_place then t.cid else null end)) as "count"
FROM theTable AS t
WHERE t.location=t.birth_place
GROUP BY date_trunc('month', date)::date, t.location

以上代码将根据日期（按月）和位置对数据进行分组，并使用case...when语句在count函数中进行去重计数。当位置等于出生地时，将考虑对cid进行计数，否则将忽略。此查询还添加了where条件，仅考虑位置等于出生地的数据。

匿名的 · Answer 3 · 2023-08-24T23:10:20+00:00

问题：根据另一列进行去重计数

在给定的目标下：

将日期截断到月份。

每个（月份，cid）选择一个位置，以家庭位置为优先。

然后按照（月份，位置）对行进行计数。

解决方法：

SELECT date, location, count(*)
FROM  (
   SELECT DISTINCT ON (1, 2)  --  选择每个（月份，cid）一个位置
          date_trunc('month', date)::date AS date, cid, location
   FROM   tbl
   ORDER  BY 1, 2, birth_place = location DESC  -- 优先选择家庭位置，否则选择任意位置
   ) sub
GROUP  BY 1, 2
ORDER  BY 1, 2;  -- 可选

需要注意的是，在“不在家”的情况下，选择任意位置可能会导致结果不稳定！您可能希望定义一个稳定的选择方式。

根据未透露的细节，可能存在更快的查询变体。

关于DISTINCT ON和性能：

- [在GROUP BY分组中选择每个组的第一行？](https://stackoverflow.com/questions/3800551/7630564#7630564)

关于排序顺序：

- [SQL按天和月排序选择查询？](https://stackoverflow.com/questions/14650705/14651597#14651597)

如果存在NULL值：

- [按列升序排序，但NULL值排在前面？](https://stackoverflow.com/questions/9510509/9511492#9511492)