最近正在進(jìn)行ETL后臺(tái)系統(tǒng)數(shù)據(jù)的日志分析,查看運(yùn)行耗時(shí)長(zhǎng)的TASK,并找出耗時(shí)長(zhǎng)的JOB,進(jìn)行邏輯層面和數(shù)據(jù)庫(kù)層面的優(yōu)化.本文僅從數(shù)據(jù)庫(kù)層面上的優(yōu)化著手(包括SQL語句的調(diào)整以及greenplum table dk的調(diào)整).查看一個(gè)耗時(shí)30分鐘左右的JOB,找到相應(yīng)的源表,進(jìn)行如下分析:
dw=#select
gp_segment_id,
count
(
*
)
from
tb_name
group
by
gp_segment_id
order
by
count
(
*
)
desc
gp_segment_id count
----------------------
65 16655
說明:gp_segment_id是greenplum table里面的一個(gè)隱藏列,用來標(biāo)記該行屬于哪個(gè)節(jié)點(diǎn).由此可見,該表只分布在一個(gè)節(jié)點(diǎn)65上(節(jié)點(diǎn)信息請(qǐng)查看gp_segment_configuration),而我的gp總共有96個(gè)節(jié)點(diǎn),這顯然沒有利用到gp多節(jié)點(diǎn)運(yùn)算能力,該表的DK值設(shè)置的有問題.因此,使用alter table tb_name set distributed by (col1,...)對(duì)表的DK值進(jìn)行重新設(shè)置.然后重新運(yùn)行上面的語句,一方面觀察節(jié)點(diǎn)數(shù)(是否每個(gè)節(jié)點(diǎn)都分布了),另一方面觀察節(jié)點(diǎn)的條數(shù)(分布是否平衡)。在上述二項(xiàng)觀察指標(biāo)大致滿足要求后,請(qǐng)vacuum full、vacuum analyze一樣,徹底回收空間+收集統(tǒng)計(jì)信息。把耗時(shí)長(zhǎng)JOB的源表抓出來,逐個(gè)分析,整個(gè)TASK的執(zhí)行時(shí)長(zhǎng)從3小時(shí)縮短到2小時(shí)左右(主要是之前表設(shè)計(jì)的太差,才導(dǎo)致有這么大的優(yōu)化空間),后期就是對(duì)邏輯以及SQL的優(yōu)化,以及提高并發(fā)度,這才是王道。
為了統(tǒng)計(jì)分析方便,設(shè)計(jì)了如下二張表和若干function,用來收集表的分布情況,并發(fā)現(xiàn)哪些表需要進(jìn)行重新調(diào)整DK值。
--
二張表
CREATE
TABLE
"
public
"."table_segment_statistics" (
"table_name"
varchar
(
200
)
DEFAULT
NULL
,
"segment_count" int4
DEFAULT
NULL
,
"table_rows" int8
DEFAULT
NULL
);
CREATE
TABLE
"
public
"."table_segment_statistics_balance" (
"table_name"
varchar
(
200
)
DEFAULT
NULL
,
"segment_id" int4
DEFAULT
NULL
,
"segment_count" int8
DEFAULT
NULL
);
--
function
CREATE
OR
REPLACE
FUNCTION
"
public
"."analyze_table_dk_balance"(v_schemaname
varchar
)
RETURNS
"pg_catalog"."int4"
AS
$BODY$
DECLARE
v_tb
varchar
(
200
);
v_cur_tb
cursor
for
select
schemaname
||
'
.
'
||
tablename
from
pg_tables
where
schemaname
<>
'
information_schema
'
and
schemaname
<>
'
pg_catalog
'
and
schemaname
<>
'
gp_toolkit
'
and
tablename
not
like
'
%prt%
'
and
schemaname
=
v_schemaname;
BEGIN
truncate
table
public
.table_segment_statistics;
truncate
table
public
.table_segment_statistics_balance;
open
v_cur_tb;
loop
fetch
v_cur_tb
into
v_tb;
if
not
found
THEN
exit
;
end
if
;
execute
'
insert into public.table_segment_statistics select
'''
||
v_tb
||
'''
as table_name,count(*) as segment_id,sum(num) as table_rows from (select gp_segment_id,count(*) num from
'
||
v_tb
||
'
group by gp_segment_id) t
'
;
execute
'
insert into public.table_segment_statistics_balance select
'''
||
v_tb
||
'''
as table_name,gp_segment_id,count(*) as cnt from
'
||
v_tb
||
'
group by gp_segment_id order by gp_segment_id
'
;
end
loop;
RETURN
0
;
end
;
$BODY$
LANGUAGE
'
plpgsql
'
VOLATILE;
分析的語句如下:
--
96指的是greenplum的節(jié)點(diǎn)(我的機(jī)器是96個(gè))
select
*
from
public
.table_segment_statistics
where
table_rows
is
not
null
and
segment_count
<
96
and
table_rows
>
10000
order
by
table_rows
desc
;
--
找出比平均值超出10%的節(jié)點(diǎn),這個(gè)閥值可以自行調(diào)整,另:只統(tǒng)計(jì)超過1萬行的表,小表沒有太大的分析意義
select
a."table_name",b.segment_id,a.table_rows
/
a.segment_count
as
reldk,b.segment_count
from
"
public
".table_segment_statistics a
inner
join
"
public
".table_segment_statistics_balance b
on
a."table_name"
=
b."table_name"
where
a."table_name"
is
not
null
and
a.table_rows
>
10000
and
abs
(a.table_rows
/
a.segment_count
-
b.segment_count)
/
(a.table_rows
/
a.segment_count)
>
0.1
?
更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主
微信掃碼或搜索:z360901061
微信掃一掃加我為好友
QQ號(hào)聯(lián)系: 360901061
您的支持是博主寫作最大的動(dòng)力,如果您喜歡我的文章,感覺我的文章對(duì)您有幫助,請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點(diǎn)擊下面給點(diǎn)支持吧,站長(zhǎng)非常感激您!手機(jī)微信長(zhǎng)按不能支付解決辦法:請(qǐng)將微信支付二維碼保存到相冊(cè),切換到微信,然后點(diǎn)擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對(duì)您有幫助就好】元

