Wednesday, October 15, 2008

findstr a Windows alternative to grep

If you've been using grep all week only to come home to an PC running Windows, you're probably missing a part of your life when you leave your Linux workstation. If you're on a text file containing huge amounts of text, such as a log file you can get findstr to search for all lines with a particular word. Findstr does regular expressions too. It's got many of the options that grep does - printing filenames for matches, printing lines that do not match, and multiple file search. For more reference, you can take a look to the official documentation or here

Wednesday, September 17, 2008

JDBC url for oracle RAC

If you have to connect to oracle RAC (Real Application Cluster) using JDBC with thin driver, the classic url:
jdbc:oracle:thin:@<HOST>:1521:<SID>
doesn’t work and you get the error ORA – 12505.
Instead, you must use this url:
jdbc:oracle:thin:@(DESCRIPTION=(LOAD_BALANCE=on)
(ADDRESS=(PROTOCOL=TCP)(HOST=host1) (PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=host2) (PORT=1521))
(CONNECT_DATA=(SERVICE_NAME=service)))
If you have an oracle client, like toad, check the tsnames.ora for the correct values of SERVICE_NAME and host.

Sunday, June 22, 2008

Firewall dropping Oracle connections in WebSphere Connection Pool?

We recently switched one of our client’s WebSphere Application Server and WebSphere Portal based infrastructure to a new physical environment. During our testing we found that we were getting stale connection exceptions in SystemOut.log and in the application logs.
On digging a little deeper, our team tracked it down to the fact that the Cisco firewall was dropping our Oracle database connections after they had been inactive for a certain amount of time.
When our team discussed the issue with the network team, they were essentially told that all was good with the network infrastructure.
The application team tweaked some of the connection pool settings but that only helped alleviate the issue slightly and we started encountering performance issues. Eventually, we figured out a solution that works!

The solution that we ended up with was to work around this issue on the Oracle end. We modified the sqlnet.expire_time parameter in sqlnet.ora on the Oracle 10g server down from the default “30″ (30 minutes) to “1″ (1 minute).

Why this worked?
The parameter SQLNET.EXPIRE_TIME is used to specify the time interval, in minutes, to send a probe to verify that client/server connections are active. If the probe finds a terminated connection, or a connection that is no longer in use, it returns an error, causing the server process to exit. This parameter is primarily intended for the database server so that it can free up server-side resources that are not in use.
A side-effect of the probe is that there is TCP-IP activity between the client and the server each time the probe is executed, and the firewall designates the link as an active one. By lowering the probe interval down to a minute, we managed to fool the firewall into leaving the database connections in the connection pool alone, and not terminate them even though the client may not execute a query using the connection for an extended period of time.
I am sure you can tweak the value of this parameter to determine the optimal value for your firewall and situation; dropping down the sqlnet.expire_time to a minute is probably too radical and redundant in reality. Even though the probe generated by this setting is small, it still does generate network traffic that may be meaningful in some situations.

Sunday, May 11, 2008

The mysterious ORA-03111 error

Recently one of the applications that I developed started throwing exceptions, that had the following message:
SQL state [72000]; error code [1013]; ORA-03111: break received on communication channel
When I googled around, I couldn't come across anything useful. Sadly enough most of the sites just showed the documentation for that error, without any explanation from anyone experiencing that issues. So here you go, with the best possible explanation that I could come up with.

My application sets two things on the connection that is throwing this exception:
  • It sets the fetchSize to be 2500 rows
  • It sets the query timeout to be 10 seconds
The database server and the application are separated over a long latency network (actually there is a NetEm box that emulates the long latency between these two boxes) which has a latency characteristic of 50+/-5 milliseconds. This is the whole setup.

It is important to understand how the timeout is handled by the Oracle client (in my case JDBC client). Once the query is successfully submitted, the client starts a clock for the timeout. Once the timeout is reached, the client sends an URG message to the Oracle server. The moment Oracle server receives this message, it knows that the client wants to cancel the operation that it was carrying on, no matter what stage the operation is in.

So take a couple of cases. Assume the operation is a SELECT query that will result in 10000 rows. If the Oracle server hasn't even started fetching the results, most likely the client's request would be responded immediately with an error code ORA-01013, which has a description like:
SQL state [72000]; error code [1013]; ORA-01013: user requested cancel of current operation
But imagine the server has fetched the rows and in the process of pumping the resultset back to the client. If the client requests the Oracle server to cancel the operation while still there is pending data in the socket to be delivered, it just adds the ORA-03111 packet at the end of the pending packets and lets the client knows that the operation has been cancelled while there is pending data to be delivered.

Look at the tcpdump output below:
23:13:08.613007 IP jdbc_client.48681 > orcle_server.1521: P 2543:3174(631) ack 2342 win 11908
....
23:13:18.635068 IP jdbc_client.48681 > orcle_server.1521: P 3174:3175(1) ack 265693 win 65535 urg 1
....
23:13:20.472561 IP orcle_server.1521 > jdbc_client.48681: P 398520:398615(95) ack 3186 win 65535
0x0000: 0015 c5ec 12a8 0021 1c1d c0c3 0800 4500 .......!......E.
0x0010: 0087 5023 0000 3406 bad6 c0a8 fd9a c0a8 ..P#..4.........
0x0020: fc8b 05f1 be29 a792 459f a091 06cc 5018 .....)..E.....P.
0x0030: ffff 67c2 0000 005f 0000 0600 0000 0000 ..g...._........
0x0040: 0402 04e3 0203 f500 0001 0300 0300 0000 ................
0x0050: 0000 0000 0000 0000 0000 0001 0100 0000 ................
0x0060: 0033 4f52 412d 3033 3131 313a 2062 7265 .3ORA-03111:.bre
0x0070: 616b 2072 6563 6569 7665 6420 6f6e 2063 ak.received.on.c
0x0080: 6f6d 6d75 6e69 6361 7469 6f6e 2063 6861 ommunication.cha
0x0090: 6e6e 656c 0a nnel.

Pay special attention to the times when the SELECT query was sent (21:13:08) and when the cancel request as an URG packet was sent (21:13:18), and when the Oracle sends the last TNS packet that has the error code ORA-03111 (21:13:20).

The cancel request as an URG packet was sent after 10 seconds because as I mentioned earlier my query timeout is 10 seconds.

So now the million dollar question: What should I do if I am facing this issue in my application?

Follow these simple steps:
  • First make sure that your query can be completed within the timeout that you have specified. If you consistently face this exception, try increasing your timeout.
  • That might help to get rid of the exception, but not the root cause. The root cause usually is a database that is not optimized for the query that you are executing or a bad network.
  • To find out if its the database that is the issue, try executing the same query in a host closer to the network. Or try executing the same query hitting the database from a different network. If you are convinced the database is the issue, try to tune it.
  • To find if it is the network that is having the issue, try to do a tcpdump and analyze if there are any out of order deliver of packets. Or dropped packets. If yes, then try to fix the network.
In my case, it turned out to be the bad configuration in the NetEm that was causing too many packets to be delivered out of order and too many duplicated packets. Remember I was introducing a variance of 10 ms (i.e. my packets could be delayed anywhere from 45 ms to 55 ms, as per my configuration). In real cases, at least in a well maintained production network, the variance will not be more than 1 ms.
If by any chance: "algun Oracle Expert se apiada de mi" your collaboration will be greately appreciate it.