您的位置: ITPUB個人空間 ? cc59的個人空間 ? 日志 發布新日志
我的日志我的足跡我的收藏
unix/linuxHA隨筆backup&restoreperformance tuningTroubleshootConcepts&Basic
RAC Diagnostics Script
2007-02-15 00:00:00
?
from metalink:
This script is broken up into different SQL statements that can be used individually.? Each SQL statement adds information to help in debugging an RAC hang/severe performance scenerio.? Script------- - - - - - - - - - - - - - - - - Script begins here - - - - - - - - - - - - - - - --- NAME:? RACDIAG.SQL --??? SYS OR INTERNAL USER, CATPARR.SQL ALREADY RUN, PARALLEL QUERY OPTION ON-- ------------------------------------------------------------------------ -- AUTHOR:? --??? Michael Polaski - Oracle Support Services - DataServer Group--??? Copyright 2002, Oracle Corporation????? -- ------------------------------------------------------------------------ -- PURPOSE: -- This script is intended to provide a user friendly guide to troubleshoot RAC? -- hung sessions or slow performance scenerios.? The script includes information -- to gather a variety of important debug information to determine the cause of an -- RAC hang.? The script will create a file called racdiag_<timestamp>.out -- in your local directory while dumping hang analyze dumps in the user_dump_dest(s)-- and background_dump_dest(s) on all nodes.---- ------------------------------------------------------------------------ -- DISCLAIMER: --??? This script is provided for educational purposes only. It is NOT? --??? supported by Oracle World Wide Technical Support. --??? The script has been tested and appears to work as intended. --??? You should always run new scripts on a test instance initially. -- ------------------------------------------------------------------------ -- Script output is as follows:set echo offset feedback offcolumn timecol new_value timestampcolumn spool_extension new_value suffixselect to_char(sysdate,'Mondd_hhmi') timecol,'.out' spool_extension from sys.dual;column output new_value dbnameselect value || '_' outputfrom v$parameter where name = 'db_name';spool racdiag_&&dbname&×tamp&&suffixset lines 200set pagesize 35set trim onset trims onalter session set nls_date_format = 'MON-DD-YYYY HH24:MI:SS';alter session set timed_statistics = true;set feedback onselect to_char(sysdate) time from dual;set numwidth 5column host_name format a20 truselect inst_id, instance_name, host_name, version, status, startup_timefrom gv$instanceorder by inst_id;set echo on-- Taking Hang Analyze dumps-- This may take a little while...oradebug setmypidoradebug unlimitoradebug -g all hanganalyze 3-- This part may take the longest, you can monitor bdump or udump to see if the-- file is being generated.oradebug -g all dump systemstate 266-- WAITING SESSIONS:-- The entries that are shown at the top are the sessions that have -- waited the longest amount of time that are waiting for non-idle wait -- events (event column).? You can research and find out what the wait-- event indicates (along with its parameters) by checking the Oracle -- Server Reference Manual or look for any known issues or documentation -- by searching Metalink for the event name in the search bar.? Example -- (include single quotes): [ 'buffer busy due to global cache' ].-- Metalink and/or the Server Reference Manual should return some useful -- information on each type of wait event.? The inst_id column shows the-- instance where the session resides and the SID is the unique identifier-- for the session (gv$session).? The p1, p2, and p3 columns will show -- event specific information that may be important to debug the problem.-- To find out what the p1, p2, and p3 indicates see the next section. -- Items with wait_time of anything other than 0 indicate we do not know-- how long these sessions have been waiting.--set numwidth 10column state format a7 trucolumn event format a25 trucolumn last_sql format a40 truselect sw.inst_id, sw.sid, sw.state, sw.event, sw.seconds_in_wait seconds, sw.p1, sw.p2, sw.p3, sa.sql_text last_sqlfrom gv$session_wait sw, gv$session s, gv$sqlarea sawhere sw.event not in ('rdbms ipc message','smon timer','pmon timer','SQL*Net message from client','lock manager wait for remote message','ges remote message', 'gcs remote message', 'gcs for action', 'client message', 'pipe get', 'null event', 'PX Idle Wait', 'single-task message', 'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue', 'listen endpoint status','slave wait','wakeup time manager')and sw.seconds_in_wait > 0 and (sw.inst_id = s.inst_id and sw.sid = s.sid)and (s.inst_id = sa.inst_id and s.sql_address = sa.address)order by seconds desc;-- EVENT PARAMETER LOOKUP:-- This section will give a description of the parameter names of the-- events seen in the last section.? p1test is the parameter value for-- p1 in the WAITING SESSIONS section while p2text is the parameter-- value for p3 and p3 text is the parameter value for p3.? The-- parameter values in the first section can be helpful for debugging-- the wait event.--column event format a30 trucolumn p1text format a25 trucolumn p2text format a25 trucolumn p3text format a25 truselect distinct event, p1text, p2text, p3textfrom gv$session_wait swwhere sw.event not in ('rdbms ipc message','smon timer','pmon timer','SQL*Net message from client','lock manager wait for remote message','ges remote message', 'gcs remote message', 'gcs for action', 'client message', 'pipe get', 'null event', 'PX Idle Wait', 'single-task message', 'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue', 'listen endpoint status','slave wait','wakeup time manager')and seconds_in_wait > 0order by event;-- GES LOCK BLOCKERS:-- This section will show us any sessions that are holding locks that-- are blocking other users.? The inst_id will show us the instance that-- the session resides on while the sid will be a unique identifier for-- the session.? The grant_level will show us how the GES lock is granted to -- the user.? The request_level will show us what status we are trying to obtain.-- The lockstate column will show us what status the lock is in.? The last column -- shows how long this session has been waiting.--set numwidth 5column state format a16 tru;column event format a30 tru;select dl.inst_id, s.sid, p.spid, dl.resource_name1, decode(substr(dl.grant_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)','KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)','KJUSEREX','Exclusive',request_level) as grant_level,decode(substr(dl.request_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)','KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)','KJUSEREX','Exclusive',request_level) as request_level, decode(substr(dl.state,1,8),'KJUSERGR','Granted','KJUSEROP','Opening','KJUSERCA','Canceling','KJUSERCV','Converting') as state,s.sid, sw.event, sw.seconds_in_wait secfrom gv$ges_enqueue dl, gv$process p, gv$session s, gv$session_wait swwhere blocker = 1and (dl.inst_id = p.inst_id and dl.pid = p.spid)and (p.inst_id = s.inst_id and p.addr = s.paddr)and (s.inst_id = sw.inst_id and s.sid = sw.sid)order by sw.seconds_in_wait desc;-- GES LOCK WAITERS:-- This section will show us any sessions that are waiting for locks that-- are blocked by other users.? The inst_id will show us the instance that-- the session resides on while the sid will be a unique identifier for-- the session.? The grant_level will show us how the GES lock is granted to -- the user.? The request_level will show us what status we are trying to obtain.-- The lockstate column will show us what status the lock is in.? The last column -- shows how long this session has been waiting.--set numwidth 5column state format a16 tru;column event format a30 tru;select dl.inst_id, s.sid, p.spid, dl.resource_name1, decode(substr(dl.grant_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)','KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)','KJUSEREX','Exclusive',request_level) as grant_level,decode(substr(dl.request_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)','KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)','KJUSEREX','Exclusive',request_level) as request_level, decode(substr(dl.state,1,8),'KJUSERGR','Granted','KJUSEROP','Opening','KJUSERCA','Cancelling','KJUSERCV','Converting') as state,s.sid, sw.event, sw.seconds_in_wait secfrom gv$ges_enqueue dl, gv$process p, gv$session s, gv$session_wait swwhere blocked = 1and (dl.inst_id = p.inst_id and dl.pid = p.spid)and (p.inst_id = s.inst_id and p.addr = s.paddr)and (s.inst_id = sw.inst_id and s.sid = sw.sid)order by sw.seconds_in_wait desc;-- LOCAL ENQUEUES:-- This section will show us if there are any local enqueues.? The inst_id will -- show us the instance that the session resides on while the sid will be a -- unique identifier for.? The addr column will show the lock? address. The type -- will show the lock type.? The id1 and id2 columns will show specific parameters -- for the lock type.? --set numwidth 12column event format a12 truselect l.inst_id, l.sid, l.addr, l.type, l.id1, l.id2, decode(l.block,0,'blocked',1,'blocking',2,'global') block, sw.event, sw.seconds_in_wait secfrom gv$lock l, gv$session_wait swwhere (l.sid = sw.sid and l.inst_id = sw.inst_id) and l.block in (0,1)order by l.type, l.inst_id, l.sid;-- LATCH HOLDERS:-- If there is latch contention or 'latch free' wait events in the WAITING-- SESSIONS section we will need to find out which proceseses are holding -- latches.? The inst_id will show us the instance that the session resides -- on while the sid will be a unique identifier for.? The username column -- will show the session's username.? The os_user column will show the os -- user that the user logged in as.? The name column will show us the type-- of latch being waited on.? You can search Metalink for the latch name in -- the search bar.? Example (include single quotes): -- [ 'library cache' latch ]. Metalink should return some useful information -- on the type of latch.? --set numwidth 5select distinct lh.inst_id, s.sid, s.username, p.username os_user, lh.namefrom gv$latchholder lh, gv$session s, gv$process pwhere (lh.sid = s.sid? and lh.inst_id = s.inst_id)and (s.inst_id = p.inst_id and s.paddr = p.addr)order by lh.inst_id, s.sid;-- LATCH STATS:-- This view will show us latches with less than optimal hit ratios-- The inst_id will show us the instance for the particular latch.? The -- latch_name column will show us the type of latch.? You can search Metalink -- for the latch name in the search bar.? Example (include single quotes): -- [ 'library cache' latch ]. Metalink should return some useful information -- on the type of latch.? The hit_ratio shows the percentage of time we -- successfully acquired the latch.--column latch_name format a30 truselect inst_id, name latch_name,round((gets-misses)/decode(gets,0,1,gets),3) hit_ratio, round(sleeps/decode(misses,0,1,misses),3) "SLEEPS/MISS"from gv$latchwhere round((gets-misses)/decode(gets,0,1,gets),3) < .99and gets != 0order by round((gets-misses)/decode(gets,0,1,gets),3);-- No Wait Latches:--select inst_id, name latch_name,round((immediate_gets/(immediate_gets+immediate_misses)), 3) hit_ratio, round(sleeps/decode(immediate_misses,0,1,immediate_misses),3) "SLEEPS/MISS"from gv$latchwhere round((immediate_gets/(immediate_gets+immediate_misses)), 3) < .99and immediate_gets + immediate_misses > 0order by round((immediate_gets/(immediate_gets+immediate_misses)), 3);-- GLOBAL CACHE CR PERFORMANCE-- This shows the average latency of a consistent block request.? -- AVG CR BLOCK RECEIVE TIME should typically be about 15 milliseconds depending -- on your system configuration and volume, is the average latency of a -- consistent-read request round-trip from the requesting instance to the holding -- instance and back to the requesting instance. If your CPU has limited idle time -- and your system typically processes long-running queries, then the latency may -- be higher. However, it is possible to have an average latency of less than one -- millisecond with User-mode IPC. Latency can be influenced by a high value for -- the DB_MULTI_BLOCK_READ_COUNT parameter. This is because a requesting process -- can issue more than one request for a block depending on the setting of this -- parameter. Correspondingly, the requesting process may wait longer.? Also check-- interconnect badwidth, OS tcp settings, and OS udp settings if -- AVG CR BLOCK RECEIVE TIME is high.--set numwidth 20column "AVG CR BLOCK RECEIVE TIME (ms)" format 9999999.9select b1.inst_id, b2.value "GCS CR BLOCKS RECEIVED", b1.value "GCS CR BLOCK RECEIVE TIME",((b1.value / b2.value) * 10) "AVG CR BLOCK RECEIVE TIME (ms)"from gv$sysstat b1, gv$sysstat b2where b1.name = 'global cache cr block receive time' andb2.name = 'global cache cr blocks received' and b1.inst_id = b2.inst_id ;-- GLOBAL CACHE LOCK PERFORMANCE-- This shows the average global enqueue get time. -- Typically AVG GLOBAL LOCK GET TIME should be 20-30 milliseconds.? the elapsed -- time for a get includes the allocation and initialization of a new global -- enqueue. If the average global enqueue get (global cache get time) or average -- global enqueue conversion times are excessive, then your system may be -- experiencing timeouts.? See the 'WAITING SESSIONS', 'GES LOCK BLOCKERS', -- 'GES LOCK WAITERS', and 'TOP 10 WAIT EVENTS ON SYSTEM' sections if the -- AVG GLOBAL LOCK GET TIME is high.--set numwidth 20column "AVG GLOBAL LOCK GET TIME (ms)" format 9999999.9select b1.inst_id, (b1.value + b2.value) "GLOBAL LOCK GETS", b3.value "GLOBAL LOCK GET TIME",(b3.value / (b1.value + b2.value) * 10) "AVG GLOBAL LOCK GET TIME (ms)"from gv$sysstat b1, gv$sysstat b2, gv$sysstat b3where b1.name = 'global lock sync gets' andb2.name = 'global lock async gets' and b3.name = 'global lock get time'and b1.inst_id = b2.inst_id and b2.inst_id = b3.inst_id;-- RESOURCE USAGE-- This section will show how much of our resources we have used. --set numwidth 8select inst_id, resource_name, current_utilization, max_utilization,initial_allocationfrom gv$resource_limitwhere max_utilization > 0order by inst_id, resource_name;-- DLM TRAFFIC INFORMATION-- This section shows how many tickets are available in the DLM.? If the -- TCKT_WAIT columns says "YES" then we have run out of DLM tickets which could-- cause a DLM hang.? Make sure that you also have enough TCKT_AVAIL.? --set numwidth 5select * from gv$dlm_traffic_controllerorder by TCKT_AVAIL;-- DLM MISC--set numwidth 10select * from gv$dlm_misc;-- LOCK CONVERSION DETAIL:-- This view shows the types of lock conversion being done on each instance.--select * from gv$lock_activity;-- TOP 10 WRITE PINGING/FUSION OBJECTS-- This view shows the top 10 objects for write pings accross instances.? -- The inst_id column shows the node that the block was pinged on.? The name -- column shows the object name of the offending object.? The file# shows the -- offending file number (gc_files_to_locks).? The STATUS column will show the -- current status of the pinged block.? The READ_PINGS will show us read converts -- and the WRITE_PINGS will show us objects with write converts.? Any rows that -- show up are objects that are concurrently accessed across more than 1 instance.--set numwidth 8column name format a20 trucolumn kind format a10 truselect inst_id, name, kind, file#, status, BLOCKS, READ_PINGS, WRITE_PINGSfrom (select p.inst_id, p.name, p.kind, p.file#, p.status, count(p.block#) BLOCKS, sum(p.forced_reads) READ_PINGS, sum(p.forced_writes) WRITE_PINGSfrom gv$ping p, gv$datafile dfwhere p.file# = df.file# (+)group by p.inst_id, p.name, p.kind, p.file#, p.statusorder by sum(p.forced_writes) desc)where rownum < 11order by WRITE_PINGS desc;-- TOP 10 READ PINGING/FUSION OBJECTS-- This view shows the top 10 objects for read pings.? The inst_id column shows -- the node that the block was pinged on.? The name column shows the object name -- of the offending object.? The file# shows the offending file number -- (gc_files_to_locks).? The STATUS column will show the current status of the-- pinged block.? The READ_PINGS will show us read converts and the WRITE_PINGS -- will show us objects with write converts.? Any rows that show up are objects -- that are concurrently accessed across more than 1 instance.--set numwidth 8column name format a20 trucolumn kind format a10 truselect inst_id, name, kind, file#, status, BLOCKS, READ_PINGS, WRITE_PINGSfrom (select p.inst_id, p.name, p.kind, p.file#, p.status, count(p.block#) BLOCKS, sum(p.forced_reads) READ_PINGS, sum(p.forced_writes) WRITE_PINGSfrom gv$ping p, gv$datafile dfwhere p.file# = df.file# (+)group by p.inst_id, p.name, p.kind, p.file#, p.statusorder by sum(p.forced_reads) desc)where rownum < 11order by READ_PINGS desc;-- TOP 10 FALSE PINGING OBJECTS-- This view shows the top 10 objects for false pings.? This can be avoided by-- better gc_files_to_locks configuration.? The inst_id column shows the node-- that the block was pinged on.? The name column shows the object name of the -- offending object.? The file# shows the offending file number -- (gc_files_to_locks).? The STATUS column will show the current status of the-- pinged block.? The READ_PINGS will show us read converts and the WRITE_PINGS -- will show us objects with write converts.? Any rows that show up are objects -- that are concurrently accessed across more than 1 instance.--set numwidth 8column name format a20 trucolumn kind format a10 truselect inst_id, name, kind, file#, status, BLOCKS, READ_PINGS, WRITE_PINGSfrom (select p.inst_id, p.name, p.kind, p.file#, p.status, count(p.block#) BLOCKS, sum(p.forced_reads) READ_PINGS, sum(p.forced_writes) WRITE_PINGSfrom gv$false_ping p, gv$datafile dfwhere p.file# = df.file# (+)group by p.inst_id, p.name, p.kind, p.file#, p.statusorder by sum(p.forced_writes) desc)where rownum < 11order by WRITE_PINGS desc;-- INITIALIZATION PARAMETERS:-- Non-default init parameters for each node.--set numwidth 5column name format a30 trucolumn value format a50 wracolumn description format a60 truselect inst_id, name, value, descriptionfrom gv$parameterwhere isdefault = 'FALSE'order by inst_id, name;-- TOP 10 WAIT EVENTS ON SYSTEM-- This view will provide a summary of the top wait events in the db.--set numwidth 10column event format a25 truselect inst_id, event, time_waited, total_waits, total_timeoutsfrom (select inst_id, event, time_waited, total_waits, total_timeoutsfrom gv$system_event where event not in ('rdbms ipc message','smon timer','pmon timer', 'SQL*Net message from client','lock manager wait for remote message','ges remote message', 'gcs remote message', 'gcs for action', 'client message', 'pipe get', 'null event', 'PX Idle Wait', 'single-task message', 'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue', 'listen endpoint status','slave wait','wakeup time manager')order by time_waited desc)where rownum < 11order by time_waited desc;-- SESSION/PROCESS REFERENCE:-- This section is very important for most of the above sections to find out -- which user/os_user/process is identified to which session/process.--set numwidth 7column event format a30 trucolumn program format a25 trucolumn username format a15 truselect p.inst_id, s.sid, s.serial#, p.pid, p.spid, p.program, s.username, p.username os_user, sw.event, sw.seconds_in_wait sec? from gv$process p, gv$session s, gv$session_wait swwhere (p.inst_id = s.inst_id and p.addr = s.paddr)and (s.inst_id = sw.inst_id and s.sid = sw.sid)order by p.inst_id, s.sid;-- SYSTEM STATISTICS:-- All System Stats with values of > 0.? These can be referenced in the-- Server Reference Manual--set numwidth 5column name format a60 trucolumn value format 9999999999999999999999999select inst_id, name, valuefrom gv$sysstatwhere value > 0 order by inst_id, name;-- CURRENT SQL FOR WAITING SESSIONS:-- Current SQL for any session in the WAITING SESSIONS list--set numwidth 5column sql format a80 wraselect sw.inst_id, sw.sid, sw.seconds_in_wait sec, sa.sql_text sqlfrom gv$session_wait sw, gv$session s, gv$sqlarea sawhere sw.sid = s.sid (+) and sw.inst_id = s.inst_id (+)and s.sql_address = sa.address and sw.event not in ('rdbms ipc message','smon timer','pmon timer','SQL*Net message from client','lock manager wait for remote message','ges remote message', 'gcs remote message', 'gcs for action', 'client message', 'pipe get', 'null event', 'PX Idle Wait', 'single-task message', 'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue', 'listen endpoint status','slave wait','wakeup time manager')and seconds_in_wait > 0order by sw.seconds_in_wait desc; -- Taking Hang Analyze dumps-- This may take a little while...oradebug setmypidoradebug unlimitoradebug -g all hanganalyze 3-- This part may take the longest, you can monitor bdump or udump to see if the-- file is being generated.oradebug -g all dump systemstate 266set echo offselect to_char(sysdate) time from dual;spool off-- ---------------------------------------------------------------------------Prompt;Prompt racdiag output files have been written to:;Prompt;host pwdPrompt alert log and trace files are located in:;column host_name format a12 trucolumn name format a20 trucolumn value format a60 truselect distinct i.host_name, p.name, p.valuefrom gv$instance i, gv$parameter pwhere p.inst_id = i.inst_id (+)and p.name like '%_dump_dest' and p.name != 'core_dump_dest'; - - - - - - - - - - - - - - - -? Script ends here? - - - - - - - - - - - - - - - -查看(59) 評論(0) 收藏 分享 圈子 管理
RMAN-06726
2007-02-13 00:00:00
以
在RAC環境中一個RMAN備份出錯的案例.
環境是四節點的RAC.在備份歸檔日志時出現以下錯誤:
allocated channel: c1
channel ch00: sid=132 devtype=DISK
allocated channel: c2
channel ch01: sid=32 devtype=DISK
allocated channel: c3
channel ch02: sid=156 devtype=DISK
allocated channel: c4
channel ch02: sid=123 devtype=DISK
Starting backup at 03-FEB-07
released channel: c1
released channel: c2
released channel: c3
released channel: c4
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup command at 02/03/2007 01:17:34
RMAN-06726: could not locate archivelog XXXXXXXXXXXXXXXXX
查找原因.檢查腳本.無任何問題.檢查歸檔路徑,無任何問題.
歸檔日志全部都存在.檢查參數.發現其中一個實例中cluster_database_instances 參數值為3. 因為是新添加了一個節點.因此值并未作修改.將cluster_database_instances 修改為4后備份歸檔日志恢復正常.
查看(201) 評論(0) 收藏 分享 圈子 管理
rac ora-12545
2007-02-09 00:00:00
?
hpunix
oracle9205 rac
由于node1已很長時間沒能起來,直到同步vg信息后才完成RAC的重組。
startup 兩node.
啟動應用,發現無法連接。ORA-12545: Connect failed because target host or object does not exist ,但重試又可連上.檢查原因后,發現兩實例都設了local_listener和remote_listener.將其清除后問題解決!
查看(75) 評論(0) 收藏 分享 圈子 管理
Oracle RAC Wait Events
2007-01-19 00:00:00
?
RAC Differences
The main difference to keep in mind when monitoring a RAC database versus a singleinstance
database, is the buffer cache and its operation. In a RAC environment the
buffer cache is global across all instances in the cluster and hence the processing
differs. When a process in a RAC database needs to modify or read data, Oracle will
first check to see if it already exists in the local buffer cache. If the data is not in the
local buffer cache the global buffer cache will be reviewed to see if another instance
already has it in their buffer cache. In this case the remote instance will send the data
to the local instance via the high-speed interconnect, thus avoiding a disk read.
Monitoring a RAC database often means monitoring this situation and the amount of
requests going back and forth over the RAC interconnect. The most common wait
events related to this are gc cr request and gc buffer busy.
gc cr request
This wait event, also known as global cache cr request prior to Oracle 10g, specifies the
time it takes to retrieve the data from the remote cache. High wait times for this wait
event often are because of:
1. RAC Traffic Using Slow Connection - typically RAC traffic should use a high-speed
interconnect to transfer data between instances, however, sometimes Oracle may not
pick the correct connection and instead route traffic over the slower public network.
This will significantly increase the amount of wait time for the gc rc request event. The
oradebug command can be used to verify which network is being used for RAC traffic:
SQL> oradebug setmypid
SQL> oradebug ipc
This will dump a trace file to the location specified by the user_dump_dest Oracle
parameter containing information about the network and protocols being used for the
RAC interconnect.
2. Inefficient Queries poorly tuned queries will increase the amount of data blocks
requested by an Oracle session. The more blocks requested typically means the more
often a block will need to be read from a remote instance via the interconnect.
gc buffer busy
This wait event, also known as global cache buffer busy prior to Oracle 10g, specifies
the time the remote instance locally spends accessing the requested data block. Thiswait
event is very similar to the buffer busy waits wait event in a single-instance
database and are often the result of:
1. Hot Blocks - multiple sessions may be requesting a block that is either not in buffer
cache or is in an incompatible mode. Deleting some of the hot rows and re-inserting
them back into the table may alleviate the problem. Most of the time the rows will be
placed into a different block and reduce contention on the block. The DBA may also
need to adjust the pctfree and/or pctused parameters for the table to ensure the rows
are placed into a different block.
2. Inefficient Queries as with the gc cr request wait event, the more blocks requested
from the buffer cache the more likelihood of a session having to wait for other sessions.
Tuning queries to access fewer blocks will often result in less contention for the same
block.
Conclusion
Oracle RAC is somewhat of a unique case of an Oracle environment, but everything
learned about wait events in the single instance database also applies to clustered
databases. However, the special use of a global buffer cache in RAC makes it
imperative to monitor inter-instance communication via the cluster-specific wait events
such as the ones discussed above. Understanding these wait events will help in the
diagnosis of problems and pinpointing solutions in a RAC database.
查看(218) 評論(0) 收藏 分享 圈子 管理
oracle pk sybase
2007-01-19 00:00:00
?
一個用戶說要為數據倉庫選型.
制定了一套基準測試方案.sybase由原廠的工程師做.ORACLE這這我負責.
發現加載數據ORACLE快得多.
5000萬的數量量.sqlldr 5分鐘搞定.不過sybase可是用了將近20分鐘啊.在檢索方面ORACLE當然也雖強于SYBASE啦.
查看(27) 評論(0) 收藏 分享 圈子 管理
處理了一個ora-12500
2007-01-18 00:00:00
?
接到電話說客戶端無法正常連接到數據庫,急忙趕到現場.
在停掉部分應用后,發現可以正常連接,再啟動應用.發現又無法連接.當連接數達到283后就無法登錄了.報ora-12500.
由于客戶方的系統為windows2000 2GB內存.
查看sga發現居然有1.4G.加上pga的大小1.5 G
這種情況下,如果連接數高了,資源吃緊,oracle無可用的內存.從而導致連接受限.將sga調整到900M后,問題解決.
我覺得的奇怪的是,為什么XX的數據庫(雖然是這個庫是小庫).也不應該用這么低的配置吧. ......
查看(105) 評論(0) 收藏 分享 圈子 管理
給用VCS的一個提醒
2007-01-08 00:00:00
?
給oracle rac升級,一定別忘了給oracle lib作同步了:
$ cp /opt/VRTSvcs/ops/lib/libskgxp92_64.so $ORACLE_HOME/lib/libskgxp9.so
$ cp /opt/ORCLcluster/lib/9iR2/libskgxn2_64.so $ORACLE_HOME/lib/libskgxn9.so
否則有可能運行了大半年甚至一年之后問題爆發出來你找不到北了。
查看(36) 評論(0) 收藏 分享 圈子 管理
RAC下trace暴漲診斷
2007-01-07 00:00:00
?
?
DB版本oracle 9207 rac
OS版本solaris 9
集群件Veritas cluster server 4.1
故障:
平均三秒鐘產生一個trace文件。Trace文件不斷增加,導致磁盤空間迅速減小
而alter中無任何錯誤信息,只有一行:
:
Thu Jan 4 11:34:53 2007
Errors in file /oracle_bin/rac9i/admin/XXrac/udump/XXrac2_ora_1942.trc
trace文件:
/oracle_bin/rac9i/admin/XXrac/udump/XXrac2_ora_1942.trc
Oracle9i Enterprise Edition Release 9.2.0.7.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.7.0 - Production
ORACLE_HOME = /oracle_bin/rac9i/product
System name: SunOS
Node name: XXXXX
Release: 5.9
Version: Generic_118558-13
Machine: sun4u
Instance name: yjrac2
Redo thread mounted by this instance: 2
Oracle process number: 13
Unix process pid: 1942, image:
oracle@XXXXX
(TNS V1-V3)
*** SESSION ID:(85.7779) 2006-04-23 01:13:12.231
=================================
Begin 4031 Diagnostic Information
=================================
The following information assists Oracle in diagnosing
causes of ORA-4031 errors. This trace may be disabled
by setting the init.ora parameter _4031_dump_bitvec = 0
======================================
Allocation Request Summary Information
======================================
Current information setting: 00654fff
Dump Interval=300 seconds SGA Heap Dump Interval=3600 seconds
Last Dump Time=04/14/2030 14:07:45
Allocation request for: kglsim object batch
Heap: 380032950, size: 4032
******************************************************
HEAP DUMP heap name="sga heap(2,0)" desc=380032950
extent sz=0xfe0 alt=200 het=32767 rec=9 flg=-126 opc=0
parent=0 owner=0 nex=0 xsz=0x1
====================
Process State Object
====================
----------------------------------------
SO: 41f4ee448, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=13, calls cur/top: 41f6d3320/41f6d3320, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
last post received-location: No post
last process to post me: none
last post sent: 0 0 0
last post sent-location: No post
last process posted by me: none
(latch info) wait_event=0 bits=20
holding 428c53f60 Child library cache level=5 child#=5
Location from where latch is held: kglobpn: child:: latch
Context saved from call: 6
state=busy
Process Group: DEFAULT, pseudo proc: 41f5cbab0
O/S info: user: orarac, term: UNKNOWN, ospid: 1942
OSD pid info: Unix process pid: 1942, image:
oracle@XXXXX(TNS
V1-V3)
=========================
User Session State Object
=========================
----------------------------------------
SO: 4204f6d48, type: 4, owner: 41f4ee448, flag: INIT/-/-/0x00
(session) trans: 0, creator: 41f4ee448, flag: (41) USR/- BSY/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 438fa8348, psql: 0, user: 0/SYS
O/S info: user: , term: , ospid: , machine:
program:
temporary object counter: 0
...No current library cache object being loaded
...No instantiation object
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksm_4031_dump()+186 CALL ksedst() 00000000B ? 000000000 ?
8 000000000 ? 103327258 ?
00000003E ?
FFFFFFFF7FFF7938 ?
ksmasg()+352 CALL ksm_4031_dump() 000103756 ? 380000030 ?
380032950 ? 000654FFF ?
103756000 ? 103751768 ?
kghnospc()+364 PTR_CALL 0000000000000000 1037519C8 ? 380000030 ?
000000FC0 ? 000000FC0 ?
380000078 ?
FFFFFFFF7FFFB138 ?
kghalo()+4156 CALL kghnospc() 1037519C8 ? 380032950 ?
000000000 ? 004000000 ?
102D430F8 ? 103519280 ?
kglsim_chk_objlist( CALL kghalo() 000000000 ?
)+340 FFFFFFFF7FFFB2A0 ?
1037519C8 ? 000001000 ?
428D58638 ? 000000000 ?
很明顯,由于4031造成的錯誤,但是為什么這么頻繁的4031錯誤產生呢?
通過v$resource_limit,我們發現以下情況:
RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
--------------- ------------------- --------------- -------------------- --------------------
processes 479 490 1000 1000
sessions 486 506 1105 1105
enqueue_locks 165 468 13282 13282
enqueue_resources 179 343 5080 UNLIMITED
ges_procs 478 488 1001 1001
ges_ress 32169 59712 20754 UNLIMITED
ges_locks 29613 55281 32150 UNLIMITED
ges_cache_ress 1929 29346 0 UNLIMITED
ges_reg_msgs 1087 2301 2230 UNLIMITED
RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
--------------- ------------------- --------------- -------------------- --------------------
ges_big_msgs 106 1183 2230 UNLIMITED
ges_rsv_msgs 0 0 1000 1000
gcs_resources 161607 189852 211052 211052
gcs_shadows 88914 99594 211052 211052
dml_locks 26 370 4860 UNLIMITED
temporary_table 0 2 UNLIMITED UNLIMITED
_locks
transactions 8 21 1215 UNLIMITED
branches 0 1 1215 UNLIMITED
cmtcallbk 0 1 1215 UNLIMITED
RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
--------------- ------------------- --------------- -------------------- --------------------
sort_segment_lo 27 41 UNLIMITED UNLIMITED
cks
max_rollback_se 14 15 244 244
gments
max_shared_serv 0 0 20 20
ers
parallel_max_se 0 5 6 6
rvers
RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
--------------- ------------------- --------------- -------------------- --------------------
22 rows selected.
我們可以發現ges_ress和ges_locks的當前分配數量已經超出了初始分配數據,最大分配數甚至超出了幾倍,
我們知道,當ges_ress和ges_locks超出初始分配的數量時,就會從shared_pool_size里面強行申請內存片。
超出的越多,當然就占用更多的內存區域。而這時候,數據庫的壓力又非常的大,從而不斷的產生4031錯誤。
這時候我們能做的就是將這兩個指標的數量控制在一定范圍之內。或者將他的初始分配值擴大,以及限制它的最高值。
可以使用兩個隱含參數來控制:_lm_locks和_lm_ress。如_lm_locks=(200000,200000) 。這么設定的意思是:初始值、最大值。
也就是說初始分配200000,最大也只能使用200000, 但是設置這個需要注意一點的是,會增加oracle使用內存的數量。
假如您使用了10G的sga.那么設定1000000的話大概就是多出1G.加上SGA的話。就是11G的內存了。因此在遇到這種情
況時一定要注意主機的內存情況,因為修改完該參數后,重啟實例時就會預分配內存。這個可以通過ipcs看出來。
查看(107) 評論(0) 收藏 分享 圈子 管理
RAC平臺無法創建任何對象解決一例
2006-12-30 00:00:00
?
Solaris5.9
集群件:veritas cluster server
文件系統:VXFS
oracle9207 on rac
接到電話,說在任何一個實例上無法創建任何數據庫對象。之前曾經發生過此類故障,客戶為重啟實例后解決。結果這幾天該故障重現,
查看到v$lock存在鎖定情況,通過查詢dba_waiters后發現sid為12的會話持有obj$表的鎖。發現在v$sql中的sql語句為:
select o.owner#,o.obj#,decode(o.linkname,null, decode(u.name,null,'SYS',u.name),o.remoteowner),
o.name,o.linkname,o.namespace,o.subname
from user$ u, obj$ o where u.user#(+)=o.owner# and o.type#=1
and not exists (select p_obj# from dependency$ where p_obj# = o.obj#)
order by o.obj# for update;
再通過與v$process關聯取得spid后,發現此會話為smon進程。而該進程直接持有了obj$的鎖。也就是說,兩個實例SMON互鎖,導致無法創建任何數據庫對象。但是為什么會造成互鎖呢?我們知道smon的作用是在實例啟時負責進行恢復工作。此外,還負責清除系統中不再使用的臨時段, 以及為數據字典管理的表空間合并相鄰的可用數據擴展,而在RAC中,一個實例的 SMON 進程能夠為出錯的 CPU 或 實例進行實例恢復。這我們的情況中,RAC并沒有出現實例crash. 當然不存在恢復。那么還有可能的情況就是清除臨時段了,通過查詢,我們發現運營商的DBA發生了一個嚴重的錯誤,那就是將temp表空間使用datafile來創建,而且并且是dict管理模式。這樣導致了兩個實例的smon相互對此臨時段進行清除以及合并,從而導致了互鎖,才導致后面出現的無法創建任何數據庫對象。解決方法,重建temp表空間,使用tempfile,以及創建為local方式的temp tablespace.然后重啟節點。另外一種方法就是使用
Event來禁止某個實例smon
Event=’ 10052 trace name context forever’;
目前我所采用的是第一種策略了.從半個月時間看來,數據庫運行正常.
查看(56) 評論(0) 收藏 分享 圈子 管理
診斷數據庫hang住一例
2006-12-30 00:00:00
?
機器:sun4u Sun Fire E4900
內存: 16GB
CPU: 4*1350
ORACLE:10.2.0.2.0 on RAC
群集軟件:sun cluster
操作系統:Solaris 5.9
癥狀描述:
接到移動報告說數據庫昨天在群發短信時hang住。下午出現同樣情況,并且這個過程中執行任何的sql語句都會hang住。
非常奇怪的是,在另一個節點中可以輕松的執行任何操作。但是在此節點中sqlplus都無法登錄。
趕到現場,收集hang住時間點的awrrpt.以下為top 5 event:
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
latch: library cache 1,031 48,359 46905 89.6 Concurrenc
CPU time 5,235 9.7
gc cr multi block request 4,889,209 824 0 1.5 Cluster
enq: TM - contention 1,181 568 481 1.1 Applicatio
gc current block 2-way 58,770 41 1 0.1 Cluster
可見到latch等待非常之高,對于library cache詳見本博客其他文章。
發現top session發現有一個sql語句在半個月前就開始執行,直到現在還未完成。找到該sql。其實只是一條無關緊要的select語句。
是查鎖的。將其會話kill之。然后手工運行這個sql.發現很快就hang. 難道是這個sql引起的問題?不可能啊。僅僅是查v$lock和v$session這兩個視圖而已。其實就不管是什么視圖,查詢半個月總歸是個大問題,并且總是在發短信的高峰期時hang住。
還好,抓到了一次現場。使用帶truss命令的sqlplus登錄。很快hang住。打開truss內容。并沒有發現什么異常。也就是說問題并沒有出在os級別或者軟件級別了。繼續看awrrpt報告。有一個地方引起懷疑,我發現有一個sql語句的version count達到37000之多:
Version
Count Executions SQL Id
-------- ------------ -------------
37,164 N/A g2k0nc8fbn337
這也就意味著oracle需要進行這么多的尋址。搜索到metalink,居然是個bug:5442957,原來是cursor_sharing=similar引起的bug.通過event 38056可解決。
晚上停掉數據庫,添加Event = '38056 trace name context forever, level 1',問題OK。
這三天以來一直未出現個hang的情況。
注意:在節點01出現問題時,節點02是沒有任何問題的。
查看(172) 評論(0) 收藏 分享 圈子 管理
ORACLE中的等待事件
2006-12-17 00:00:00
?
from
http://www.itpub.net/showthread.php?threadid=398220&pagenumber
=
表:非空閑等待事件的級別含義
Buffer busy wait 表示在等待對數據告訴緩存區的訪問,這種等待出現在會話讀取數據到buffer中或者修改buffer中的數據時,例如DBWR正在寫一些數據塊到數據文件的同時,其他進程需要去讀取相應的數據塊。同時也可能表示在表上設置的freelist太小,不能支持大量并發的INSERT操作。在v$session_wait視圖的p1子段值表示相關數據塊所在的文件號,p2表示文件上的塊編號。通過這些信息與dba_data_files和dba_extents的聯合查詢就可以很快定位到發生競爭的對象,從而近一步確定問題的根源。
Db file parrle write 于dbwr進程相關的等待,一般都代表了io能力出現問題。通常與配置的多個dbwr進程或者dbwr的io slaves個數有關,當然也可能意味這在設備上出現io競爭!
Db file scattered read 表示發生了于全表掃描的等待。通常意味者全表掃描過多,或者io能力不足,或者io競爭
Db file sequential read 表示發生了于索引掃描有關的等待。同樣意味者io出現了問題,表示io出現了競爭和io需求太多
Db file single write 表示在檢查點發生時與文件頭寫操作相關的等待。通常于檢查點同步數據文件時文件號的紊亂有關
Direct path read 表示于直接io讀相關的等待。當直接讀數據到pga內存時,direct path read出現。這種類型的讀請求典型的作為:排序io并行slave查詢或者預先讀請求等。通常這種等待于io能力或者io競爭有關
Direct path write 同上
Enqueue 表示于內部隊列機制有關的等待,例如保護內部資源或者組件的鎖的請求等,一種并發的保護機制
Free buffer inspected 表示在將數據讀入數據告訴緩沖區的時候等待進程找到足夠大的內存空間。通常這種等待表示數據緩沖區偏小。
Free buffer waits 表述數據告訴緩存區缺少內存空間。通常于數據高速緩沖區內存太小或者臟數據寫出太慢導致。在這種情況下,可以考慮增大高速緩存區或者通過設置更多的dbwr來解決
Latch free 表示某個鎖存器發生了競爭。首先應該確保已經提供了足夠多的latch數,如果仍然發生這種等待事件,應該進一步確定是那種鎖存器上發生了競爭(在v$session_wait上的p2子段表示了鎖存器的標號),然后判斷是什么引起了這種鎖存器競爭。大多數鎖存器競爭不是簡單的鎖存器引起的,而是于鎖存器相關的組件引起的,需要找到具體導致競爭的根本。例如,如果發生了library cache latch競爭,那么通常表示庫緩存配置不合理,或者sql語句書寫不合理,帶來了大量的硬分析。
Library cache load lock 表示在將對象裝入到庫高速緩沖區的時候出現了等待。這種事件通常代表者發生了負荷很重的語句重載或者裝載,可能由于sql語句沒有共享池區域偏小導致的。
Library cache lock 表示與訪問庫高速緩存的多個進程相關的等待。通常表示不合理的共享池大小。
Library cache pin 這個等待事件也與庫高速緩存的并發性有關,當庫高速緩存中的對象被修改或者被檢測的時候發生
Log buffer space 表示日志緩沖區出現了空間等待事件。這種等待事件意味者寫日志緩沖區的時候得不到相應的內存空間,通常發生在日志緩沖區太小或者LGWR進程寫太慢的時候。
Log file parallel write 表示等待LGWR向操作系統請求io開始直到完成io。在觸發LGWR寫的情況下入3秒,1/3,1MB、DBWR寫之前可能發生。這種事件發生通常表示日志文件發生了io競爭或者文件所在的驅動器較慢。
Log file single write 表示寫文件頭塊的時候出現了等待。一般都是發生在檢查點發生時。
Log file switch
(archiveing needed) 由于歸檔過慢造成日志無法進行切換而發生的等待。這種等待事件的原因可能比較多,最主要的原因是歸檔速度趕不上日志切換的速度。可能的原因包括了重作日志太了,重作日志組太少,歸檔能力低,歸檔文件發生了io競爭,歸檔日志掛起,或者歸檔日志放在了慢的設備上。
Log file switch
(checkpoint incomplete) 表示在日志切換的時候文件上的檢查點還沒有完成。一般意味者日志文件太小造成日志切換切換太快或者其他原因。
Log file sync 表示當服務進程發出commit或者rollbabk命令后,直到LGWR完成相關日志寫操作這段時間的等待。如果有多個服務進程同時發出這種命令,LGWR不能及時完成日志的寫操作,就有可能造成這種等待。
Transaction 表示發生一個阻賽回滾操作的等待
Undo segment extension 表示在等待回滾段的動態擴展。這表示可能事務量過大,同時也意味者可能回滾段的初始大小不是最優,minextents設置偏小。考慮減少事務,或者使用最小區數更多的回滾段。
查看(28) 評論(0) 收藏 分享 圈子 管理
10G中的ORA-3136
2006-12-09 00:00:00
?
經常看到alert日志中報出錯誤為:
WARNING: inbound connection timed out (ORA-3136)
后來找到metalink的方法,帖出來如下:
1.set INBOUND_CONNECT_TIMEOUT_ =0 in listener.ora
2. set SQLNET.INBOUND_CONNECT_TIMEOUT = 0 in sqlnet.ora of server.
3. stop and start both listener and database.
4. Now try to connect to DB and observe the behaviour
查看(33) 評論(0) 收藏 分享 圈子 管理
格式化輸出plan_table
2006-12-09 00:00:00
?
經常需要看一些sql的執行計劃,大的sql用autotrace就不太合適了,需要時間.而通常用plan_table直接輸出又不方便閱讀.這里給出一個腳本,方便自己,忘了自己在哪本書上看的了.
SQL> EXPLAIN PLAN
2 SET STATEMENT_ID='SQL1' FOR select * from dual;
已解釋。
?
SQL>SQL> select lpad('',2*(level-1))||level||'.'||nvl(position,0)||''||
2 operation||''||options||''||object_name||''||object_type
3 ||''||decode(id,0,statement_id||'cost='||position)||cost
4 ||''||object_node "query plan"
5 from plan_table
6 start with id=0 and statement_id='SQL1'
7 connect by prior id=parent_id
8 and statement_id='SQL1';
query plan-------------------------------------------------------------------------------
1.0SELECT STATEMENTSQL1cost=2.1TABLE ACCESSFULLDUAL
?
SQL>
在RAC環境中,我們經常發現一些類似于以下的參數:
rac1.instance_number=2
rac2.instance_number=1
*.db_block_size=8192
*.db_cache_size=25165824
這里的*號實際表示就是RAC中所有實例都使用!
?
更多文章、技術交流、商務合作、聯系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號聯系: 360901061
您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元
