본문 바로가기

Oracle/SQL Query

[펌] 중복값 삭제에 Analytic Function을 사용하는 예제


The Power of Analytic Functions

* 원문에서 필요한 부분만 발췌한 자료이므로 필요할 경우 원문을 확인하세요.

 

In this article we show the power of analytic functions on an example
: Removing Duplicate Rows.

 

-- 중복값이 많아서 많은 행은 삭제해야 하는 테이블 확인
select count(*), count(distinct OBJECT_NAME)

from t;
 

  COUNT(*) COUNT(DISTINCT OBJECT_NAME)
---------- ---------------------------
   1697024                        5017

 

Here you can see that we have more than 1.6 million rows but only 5'017 unique ones. We need to delete 1'697'024 - 5'017 = 1'692'007 of my rows (lots). Let's see how speedy this can be done using analytic functions.

 

The normal (non Analytic Function) Approach:

 

-- 서브쿼리를 이용한 삭제 : 시간과 자원이 매우 많이 소요된다.

delete from t
where rowid not in (select min(rowid)
                             from t group by OBJECT_NAME);

 

This statements runs hours and hours, using nearly 100% CPU Power and even after increasing the rollback segment tablespace to 7GB, we are not getting the desired results. This is not the way to go!

 

For the sake of clarity we first demonstrate the deletion on a small table - we duplicate the EMP table: insert into emp (select * from emp), then we want to delete duplicate ENAME's.

 

The Analytic Function Approach:

 

The approach to removing duplicates is a little different. We would generate the set of ROWID's to delete by using analytics and then delete them, like this:

 

select count(*), count(distinct ENAME) from emp;

 

  COUNT(*) COUNT(DISTINCT ENAME)
---------- ---------------------
       28                     14

 

Let's look closer to the solution - the inner query shows the groups:

 

SELECT ename Ename, deptno Deptno, rowid Rid,
  ROW_NUMBER()
  OVER (
     PARTITION BY ename
     ORDER BY rowid
  ) RNo
FROM emp
/

 

Ename          Deptno Rid                       RNo
---------- ---------- ------------------ ----------
ADAMS              20 AAABrzAAEAAAAASAAK          1
ADAMS              20 AAABrzAAEAAAAASAAY          2
ALLEN              30 AAABrzAAEAAAAASAAB          1
ALLEN              30 AAABrzAAEAAAAASAAP          2
BLAKE              30 AAABrzAAEAAAAASAAF          1
BLAKE              30 AAABrzAAEAAAAASAAT          2
CLARK              10 AAABrzAAEAAAAASAAG          1
CLARK              10 AAABrzAAEAAAAASAAU          2
FORD               20 AAABrzAAEAAAAASAAM          1
FORD               20 AAABrzAAEAAAAASAAa          2
JAMES              30 AAABrzAAEAAAAASAAL          1
JAMES              30 AAABrzAAEAAAAASAAZ          2
JONES              20 AAABrzAAEAAAAASAAD          1
JONES              20 AAABrzAAEAAAAASAAR          2
KING               10 AAABrzAAEAAAAASAAI          1
KING               10 AAABrzAAEAAAAASAAW          2
MARTIN             30 AAABrzAAEAAAAASAAE          1
MARTIN             30 AAABrzAAEAAAAASAAS          2
MILLER             10 AAABrzAAEAAAAASAAN          1
MILLER             10 AAABrzAAEAAAAASAAb          2
SCOTT              20 AAABrzAAEAAAAASAAH          1
SCOTT              20 AAABrzAAEAAAAASAAV          2
SMITH              20 AAABrzAAEAAAAASAAA          1
SMITH              20 AAABrzAAEAAAAASAAO          2
TURNER             30 AAABrzAAEAAAAASAAJ          1
TURNER             30 AAABrzAAEAAAAASAAX          2
WARD               30 AAABrzAAEAAAAASAAC          1
WARD               30 AAABrzAAEAAAAASAAQ          2

 

Now show only the Rows which must be deleted, these are all Rows with RNo <> 1.

 

SELECT Ename, Deptno, Rid, RNo
  FROM
    (SELECT ename Ename, deptno Deptno, rowid Rid,
      ROW_NUMBER()
      OVER (
         PARTITION BY ename
         ORDER BY rowid
      ) RNo
    FROM emp
    )
    WHERE RNo <> 1
/

 

ENAME          DEPTNO RID                       RNO
---------- ---------- ------------------ ----------
ADAMS              20 AAABrzAAEAAAAASAAY          2
ALLEN              30 AAABrzAAEAAAAASAAP          2
BLAKE              30 AAABrzAAEAAAAASAAT          2
CLARK              10 AAABrzAAEAAAAASAAU          2
FORD               20 AAABrzAAEAAAAASAAa          2
JAMES              30 AAABrzAAEAAAAASAAZ          2
JONES              20 AAABrzAAEAAAAASAAR          2
KING               10 AAABrzAAEAAAAASAAW          2
MARTIN             30 AAABrzAAEAAAAASAAS          2
MILLER             10 AAABrzAAEAAAAASAAb          2
SCOTT              20 AAABrzAAEAAAAASAAV          2
SMITH              20 AAABrzAAEAAAAASAAO          2
TURNER             30 AAABrzAAEAAAAASAAX          2
WARD               30 AAABrzAAEAAAAASAAQ          2


Now these Rows can easily be deleted:


DELETE FROM emp
 WHERE rowid IN
    (SELECT Rid
      FROM
        (SELECT rowid Rid,
          ROW_NUMBER()
          OVER (
             PARTITION BY ename
             ORDER BY rowid
          ) RNo
        FROM emp
        )
        where RNo <> 1
    );

 

14 rows deleted.

 

Conclusion


Now, we go back to the huge Table: t

 

DELETE FROM t NOLOGGING
 WHERE rowid IN
    (SELECT Rid
      FROM
        (SELECT rowid Rid,
          ROW_NUMBER()
          OVER (
             PARTITION BY object_name
             ORDER BY rowid
          ) RNo
        FROM t
        )
        where RNo <> 1
    );

 

1692007 rows deleted.

Elapsed: 00:08:57.15

 

Execution Plan
----------------------------------------------------------
   0      DELETE STATEMENT Optimizer=CHOOSE
   1    0   DELETE OF 'T'
   2    1     NESTED LOOPS
   3    2       VIEW OF 'VW_NSO_1'
   4    3         SORT (UNIQUE)
   5    4           VIEW
   6    5             WINDOW (SORT)
   7    6               TABLE ACCESS (FULL) OF 'T'
   8    2       TABLE ACCESS (BY USER ROWID) OF 'T'

 

The huge Deletion was complete after 9 Minutes instead of hours - Great!

 

select count(*), count(distinct OBJECT_NAME) from t;

 

  COUNT(*) COUNT(DISTINCT OBJECT_NAME)
---------- --------------------------
      5017                       5017

 

No more duplicate Rows on OBJECT_NAME.

 

More Information about Analytic Functions can be found here.

 

출처 : http://www.akadia.com/services/ora_analytic_functions2.html