预备知识:
1)row cache lock事件
--内存share pool分为library cache,dictionary cache;而row cache lock对象分布在dictionary cache中,是由于对于字典缓冲的访问造成的;
--这类属于latch类的资源竞争,相当耗CPU,如果并发量大的话,容易down机
如果发现这个等待十分高,一般来说可能由于2种原因,一是共享池太小了,需要增加共享池,另外一种情况是SQL分析过于频繁,对于共享池的并发访问量过大。对于任何一种情况,绝大多数情况下加大共享池会有助于降低该等待,不过加大共享池的时候也要注意,并不一定所有的情况下增加共享池都会有明显的效果。特别是对于第二种情况,精确的分析十分重要。另外进一步分析,弄清楚哪些ROW CACHE的等待最为严重,有助于解决问题。
row cache lock事件的调优基于每一个队列锁类型的行为,常见的队列锁类型有:
-------DC_SEQUENCES:在使用序列的时候将发生该行缓冲队列锁。调优方式是检查序列是否指定了缓冲选项并确定这个缓冲值可以承受预期的并发insert操作。
Check for appropriate caching of sequences for the application requirements.
-------DC_USED_EXTENTS和DC_FREE_EXTENTS:该行缓冲队列锁可能在空间管理碰到表空间分裂或者没有足够区大小时发生。调优方法是检查表空间是否分裂了、区大小是否太小或者表空间是人工管理。
-------DC_TABLESPACES:该行缓冲队列锁会在分配新区是发生。如果区大小设置得过小,程序将经常申请新区,这将导致冲突。调优方法是快速地增加区的数量。
Probably the most likely cause is the allocation of new extents. If extent sizes are set low then the application may constantly be requesting new extents and causing contention. Do you have objects with small extent sizes that are rapidly growing? (You may be able to spot these by looking for objects with large numbers of extents). Check the trace for insert/update activity, check the objects inserted into for number of extents.
-------DC_OBJECTS:该行缓冲队列锁会在重编译对象的时候发生。当对象编译时将申请一个排他锁阻塞其他行为。通过检查非法对象和依赖关系来调优。
-------DC_SEGMENTS:该行缓冲队列锁会在段分配的时候发生,观察持有这个队列锁的会话在做什么。
This is likely to be down to segment allocation. Identify what the session holding the enqueue is doing and use errorstacks to diagnose.
-------DC_USERS:Deadlock and resulting “WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK!” can occur if a session issues a GRANT to a user, and that user is in the process of logging on to the database.
-------DB_ROLLBACK_SEGMENTS:This is due to rollback segment allocation. Just like dc_segments,identify what is holding the enqueue and also generate errorstacks. Remember that on a multi-node system (RAC) the holder may be on another node and so multiple systemstates from each node will be required.
-------DC_AWR_CONTROL:This enqueue is related to control of the Automatic Workload Repository. As such any operation manipulating the repository may hold this so look for processes blocking these.
2)相关视图字段说明:
ROW CACHE LOCK基础说明
P1 – Cache Id
P2 – Mode Held
P3 – Mode Requested
mode 和REQUEST的取值:
KQRMNULL 0 null mode – not locked
KQRMS 3 share mode
KQRMX 5 exclusive mode
KQRMFAIL 10 fail to acquire instance lock
3)SQL查询
-----查询row cache lock等待
select * from v$session_wait where wait_class = 'row cache lock';
查出p1的值
-----查询rowcache 名称
根据p1的值来查询
select * from v$rowcache where cache# = &p1;
其他:
select event,p1 from v$session a where a.username is not null and a.status='ACTIVE';
4)dba_hist_active_sess_history视图
dba_hist_active_sess_history视图记录了内存中活动会话的历史信息,而动态性能视图V$ACTIVE_SESSION_HISTORY则记录了当前活动会话信息。 通过dba_hist_active_sess_history视图关联v$sqlarea和DBA_HIST_SNAPSHOT可以跟踪到某个时间段内的sql。当然能跟踪到的sql多少取决于v$sqlarea,毕竟只有还保留在v$sqlarea中的sql才能跟踪到。
AWR报告分析
数据库出现状况是8月30号下午3点到4点,于是拉取了两份AWR报告,一份是8月30号下午3点到4点,另一份是31号下午3点到4点,将两份报告比对着看,不同的地方应该就是有问题的地方。
1)发现top 5的前台事件中,排第一的不是DB CPU。默认正常情况下,DB CPU会排第一。
2)继续看Wait Events Statistics ,发现排第一也不是DB CPU,而是connection management call elapsed time,看名字感觉和数据库连接有关。
3)继续比对,发现dc_users的Pct Miss比较大,超过了20%
4)查看可疑的sql语句:
标红的sql执行速度很慢,有些异常。
SQL> select to_char(sample_time, 'YY-MM-DD HH24:MI:SS') sample_time,
2 instance_number,
3 sql_id,
4 P1,
5 event,
6 wait_class
7 from dba_hist_active_sess_history
8 where sample_time between
9 to_date('16-08-29 13:00:00', 'YY-MM-DD HH24:MI:SS') and
10 to_date('16-08-29 13:30:00', 'YY-MM-DD HH24:MI:SS')
11 and sql_id in ('bvtu633rnwrwv',
12 '4a6uhr508t0p6',
13 'fpm2zazkfqhy6',
14 'bhrcaykh5tzsw',
15 '9fxv1px768bd5')
16 order by 1
17 ;
SAMPLE_TIME INSTANCE_NUMBER SQL_ID P1 EVENT WAIT_CLASS
----------------- --------------- ------------- ---------- --------------------- ---------------------
16-08-29 13:00:59 1 4a6uhr508t0p6 10 row cache lock Concurrency
16-08-29 13:01:09 1 4a6uhr508t0p6 10 row cache lock Concurrency
16-08-29 13:01:19 1 4a6uhr508t0p6 10 row cache lock Concurrency
16-08-29 13:01:29 1 4a6uhr508t0p6 7 row cache lock Concurrency
16-08-29 13:01:39 1 bvtu633rnwrwv 10 row cache lock Concurrency
16-08-29 13:01:49 1 bvtu633rnwrwv 10 row cache lock Concurrency
16-08-29 13:01:59 1 bvtu633rnwrwv 10 row cache lock Concurrency
16-08-29 13:02:10 1 bvtu633rnwrwv 7 row cache lock Concurrency
16-08-29 13:02:20 1 fpm2zazkfqhy6 10 row cache lock Concurrency
16-08-29 13:02:30 1 fpm2zazkfqhy6 10 row cache lock Concurrency
16-08-29 13:02:40 1 fpm2zazkfqhy6 10 row cache lock Concurrency
16-08-29 13:02:41 2 bvtu633rnwrwv 7 row cache lock Concurrency
16-08-29 13:02:50 1 fpm2zazkfqhy6 7 row cache lock Concurrency
16-08-29 13:04:31 1 bhrcaykh5tzsw 10 row cache lock Concurrency
16-08-29 13:04:41 1 bhrcaykh5tzsw 10 row cache lock Concurrency
16-08-29 13:04:51 1 bhrcaykh5tzsw 7 row cache lock Concurrency
16-08-29 13:06:32 1 9fxv1px768bd5 10 row cache lock Concurrency
16-08-29 13:06:42 1 9fxv1px768bd5 10 row cache lock Concurrency
18 rows selected
根据上面的sql,发现sql导致的等待事件是row cache lock。再根据P1=7或者10,查出等待事件是发生在具体哪个类别上
SQL> select cache#,parameter from v$rowcache where cache# in ('7','10');
CACHE# PARAMETER
---------- --------------------------------
10 dc_users
7 dc_users
7 dc_users
7 dc_users
根据上面的sql,发现参数类别是dc_users,网上说dc_users是和用户用错误密码登陆有关:
In 11g there is an intentional delay between allowing failed logon attempts to retry. For some specific application types this can cause a problem as the row cache entry is locked for the duration of the delay . This can lead to excessive row cache lock waits for DC_USERS for specific users / schemas.
…
After 3 successive failures a sleep delay is introduced starting at 3 seconds and extending to 10 seconds max. During each delay the user X row cache lock is held in exclusive mode preventing any concurrent logon attempt as user X (and preventing any other operation which would need the row cache lock for user X).
验证一下,看是否在发生row cache lock期间,有用户登陆失败。
SQL> select username,
2 userhost,
3 to_char(timestamp, 'YY-MM-DD HH24:MI:SS') timestamp,
4 action_name
5 from dba_audit_trail
6 where action_name = 'LOGON'
7 and priv_used is null
8 and timestamp between
9 to_date('16-08-29 13:00:00', 'YY-MM-DD HH24:MI:SS') and
10 to_date('16-08-29 13:30:00', 'YY-MM-DD HH24:MI:SS');
USERNAME USERHOST TIMESTAMP ACTION_NAME
------------------------------ --------------------- ----------------- ----------------------------
MAPP_PLATFORM IDC-APP-02 16-08-29 13:03:51 LOGON
MAPP_PLATFORM IDC-APP-01 16-08-29 13:03:41 LOGON
MAPP_PLATFORM IDC-APP-02 16-08-29 13:03:31 LOGON
MAPP_PLATFORM IDC-APP-01 16-08-29 13:03:21 LOGON
MAPP_PLATFORM IDC-APP-02 16-08-29 13:03:11 LOGON
MAPP_PLATFORM IDC-APP-01 16-08-29 13:03:01 LOGON
MAPP_PLATFORM IDC-APP-02 16-08-29 13:02:51 LOGON
MAPP_PLATFORM IDC-APP-01 16-08-29 13:02:41 LOGON
MAPP_PLATFORM IDC-APP-02 16-08-29 13:02:31 LOGON
。。。。
发现用户MAPP_PLATFORM从8.14到8.31,一直以错误密码尝试登陆数据库,验证了以错误密码登陆会导致row cache lock同时sql执行慢的现象发生。
进一步查询登录审计信息:
select * from
(
select os_username,userhost,terminal,username,count(*) failures
from dba_audit_trail
where returncode = 1017 and timestamp between to_date ('2016-8-29 11:30:00','yyyy-mm-dd hh24:mi:ss') and to_date ('2016-8-29 13:30:00','yyyy-mm-dd hh24:mi:ss')
group by os_username,userhost,username,terminal
order by 5 desc );
returncode = 1017 这里的1017就是oracle内部定义的错误返回代码值。
实际上不管用户提供的密码是否正确,Oracle都会为新的connection分配一个shadow process,该服务进程为了进一步验证用户信息都不得不获取少量资源;如果以此为契机,即便在无法正常登录的情况下依然能在短期内造成实例僵死。