본문 바로가기

Oracle/sql * loader

[Oracle] SQL * Loader 성능 향상

 III    Load data in Parallel in Direct Mode with Split the data file

대용량의 데이터 파일을 잘게 쪼개서 로더를 사용하여 올린다 .

/usr/bin/split -l2000000 /u01/oratest/sales_data.dat sales  
2백만건씩 쪼개서 파일 만들기

 

When “the direct load test”  was running , I observed the CPU, memory and disks utilization;  Potentially they have not been 
used much.  With 8 CPUs and 16 GB RAM,  my aim was how I should beat 114 seconds, got from the “DIRECT LOAD” 
method.  
The only way I could beat the 114 seconds run time is to divide and conquer approach. That is, using parallel option. However, 
I have only one data file.  Hence, I split the data file (see below the shell script) and  used 8 parallel threads to complete the 
load process.

Script :
#!/bin/ksh
date
/usr/bin/split -l2000000 /u01/oratest/sales_data.dat sales
mv salesaa salesaa.dat
mv salesab salesab.dat
mv salesac salesac.dat
mv salesad salesad.dat
mv salesae salesae.dat
mv salesaf salesaf.dat
mv salesag salesag.dat
mv salesah salesah.dat
echo "Split is done"
date
sqlldr / control=load_sales.ctl bad=bad_1.dat streamsize=1048576 readsize=1048576 \
data=salesaa.dat log=load_sales_pll_1.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data01.dbf \
multithreading=true direct=true parallel=true &
sqlldr / control=load_sales.ctl bad=bad_2.dat streamsize=1048576 readsize=1048576 \
data=salesab.dat log=load_sales_pll_2.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data02.dbf \
multithreading=true direct=true parallel=true  &
sqlldr / control=load_sales.ctl bad=bad_3.dat streamsize=1048576 readsize=1048576 \
data=salesac.dat log=load_sales_pll_3.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data03.dbf \
multithreading=true direct=true parallel=true   &
sqlldr / control=load_sales.ctl bad=bad_4.dat streamsize=1048576 readsize=1048576 \
data=salesad.dat log=load_sales_pll_4.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data04.dbf \
multithreading=true direct=true parallel=true &
sqlldr / control=load_sales.ctl bad=bad_5.dat streamsize=1048576 readsize=1048576 \
data=salesae.dat log=load_sales_pll_5.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data05.dbf \
multithreading=true direct=true parallel=true &
sqlldr / control=load_sales.ctl bad=bad_6.dat streamsize=1048576 readsize=1048576 \
data=salesaf.dat log=load_sales_pll_6.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data06.dbf \
multithreading=true direct=true parallel=true  &
sqlldr / control=load_sales.ctl bad=bad_7.dat streamsize=1048576 readsize=1048576 \
data=salesag.dat log=load_sales_pll_7.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data07.dbf \
multithreading=true direct=true parallel=true &
sqlldr / control=load_sales.ctl bad=bad_8.dat streamsize=1048576 readsize=1048576 \
data=salesah.dat log=load_sales_pll_8.log  columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data08.dbf \
multithreading=true direct=true parallel=true &
wait
date
exit 0

Each parallel thread reads a small file of size 400MB and loads it directtly into the data blocks.
By adding FILE parameter , I force each parallel thread to write the data into  separate data file with in the tablespace.

Log file: nohup_split.out
Sat Nov 12 17:16:36 GMT 2005
Split is done
Sat Nov 12 17:19:11 GMT 2005
SQL*Loader: Release 9.2.0.5.0 - Production on Sat Nov 12 17:19:11 2005
Load completed - logical record count 336066.
Load completed - logical record count 2000000.
Load completed - logical record count 2000000.
Load completed - logical record count 2000000.
Load completed - logical record count 2000000.
Load completed - logical record count 2000000.
Load completed - logical record count 2000000.
Load completed - logical record count 2000000.
Sat Nov 12 17:19:35 GMT 2005                             Total Run Time 180 Seconds.

In this method, the UNIX split consumed 156 seconds and the SQL*Loader ran for 24 seconds.
The result is NOT attractive compared with “simple direct load” run time of 114 seconds. 




IV  Load data in Parallel in Direct Mode without Spliting the data file 


요거는 더 대박!!! skip을 사용해서 데이터 건너 뛰고 작업 수행하기
*** 테스트 필요

I have to come out with a different idea, in order to beat  “direct load run time of 114” seconds. Finally two parameters SKIP and 
LOAD gave me a clue. By combining three parameters PARALLEL, SKIP  and LOAD, I wrote the following script to load the 
data file.

Script : load_sales_pll.ksh
#!/bin/ksh
date
## 512KB = <524288>
## 1 MB =<1048576>
## 2 MB = <2097152>
## 4 MB = <4194304>
## 8 MB = <8388608>
sqlldr / control=load_sales.ctl bad=bad_1.dat streamsize=1048576 readsize=1048576 \
data=sales_data.dat log=load_sales_pll_1.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data01.dbf \
multithreading=true direct=true parallel=true load=2000000 &
sqlldr / control=load_sales.ctl bad=bad_2.dat streamsize=1048576 readsize=1048576 \
data=sales_data.dat log=load_sales_pll_2.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data02.dbf \
multithreading=true direct=true parallel=true skip=2000000 load=2000000 &
sqlldr / control=load_sales.ctl bad=bad_3.dat streamsize=1048576 readsize=1048576 \
data=sales_data.dat log=load_sales_pll_3.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data03.dbf \
multithreading=true direct=true parallel=true skip=4000000 load=2000000  &
sqlldr / control=load_sales.ctl bad=bad_4.dat streamsize=1048576 readsize=1048576 \
data=sales_data.dat log=load_sales_pll_4.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data04.dbf \
multithreading=true direct=true parallel=true skip=6000000 load=2000000 &
sqlldr / control=load_sales.ctl bad=bad_5.dat streamsize=1048576 readsize=1048576 \
data=sales_data.dat log=load_sales_pll_5.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data05.dbf \
multithreading=true direct=true parallel=true skip=8000000 load=2000000 &
sqlldr / control=load_sales.ctl bad=bad_6.dat streamsize=1048576 readsize=1048576 \
data=sales_data.dat log=load_sales_pll_6.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data06.dbf \
multithreading=true direct=true parallel=true skip=10000000 load=2000000 &
sqlldr / control=load_sales.ctl bad=bad_7.dat streamsize=1048576 readsize=1048576 \
data=sales_data.dat log=load_sales_pll_7.log columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data07.dbf \
multithreading=true direct=true parallel=true skip=12000000 load=2000000 &
sqlldr / control=load_sales.ctl bad=bad_8.dat streamsize=1048576 readsize=1048576 \
data=sales_data.dat log=load_sales_pll_8.log  columnarrayrows=8000 \
file=/u01/oratest/sbltest_large_data08.dbf \
multithreading=true direct=true parallel=true skip=14000000 &
wait
date
exit 0

I did not split the data file. By using SKIP parameter I told the SQL*Loader process to skip certain number of records before it 
writes the records into data blocks.
For example, the first loader process would insert only 2 Million Rows.
The 2nd loader process would skip the first 2 Million Rows and insert next 2 Million Rows.
The 3rd loader process would skip the first 4 Million Rows and insert only next 2 Million Rows, and so on…

Log: nohup_pll_1.out
Sat Nov 12 16:17:04 GMT 2005
Load completed - logical record count 2000000.
Load completed - logical record count 2000000.
Load completed - logical record count 336066.
Load completed - logical record count 2000000.
Load completed - logical record count 2000000.
Load completed - logical record count 2000000.
Load completed - logical record count 2000000.
Load completed - logical record count 2000000.
Sat Nov 12 16:17:42 GMT 2005       ----------- Total Run Time : 38 Seconds

!!!! Bingo. The total run time was 38 Seconds, out beat the “direct load method” run time 114 seconds.

Tips:

1.        As a DBA/developer, you need to find out correct values for ROWS, READSIZE, STREAMSIZE, and 
COLUMNARRAYROWS in your production environment. You need to do several tests to get the best performance.
2.        If the input data file arrives as a single file, then you can use “Parallel Direct Mode with Skip option”.
3.        If you get multiple data files, then use “Parallel Direct Mode with Split data file” approach. No need to run Unix Split 
command.

Reference: http://oracleact.com/papers/sqlldrperf.html