Sunday, May 11, 2008

The mysterious ORA-03111 error

Recently one of the applications that I developed started throwing exceptions, that had the following message:
SQL state [72000]; error code [1013]; ORA-03111: break received on communication channel
When I googled around, I couldn't come across anything useful. Sadly enough most of the sites just showed the documentation for that error, without any explanation from anyone experiencing that issues. So here you go, with the best possible explanation that I could come up with.

My application sets two things on the connection that is throwing this exception:
  • It sets the fetchSize to be 2500 rows
  • It sets the query timeout to be 10 seconds
The database server and the application are separated over a long latency network (actually there is a NetEm box that emulates the long latency between these two boxes) which has a latency characteristic of 50+/-5 milliseconds. This is the whole setup.

It is important to understand how the timeout is handled by the Oracle client (in my case JDBC client). Once the query is successfully submitted, the client starts a clock for the timeout. Once the timeout is reached, the client sends an URG message to the Oracle server. The moment Oracle server receives this message, it knows that the client wants to cancel the operation that it was carrying on, no matter what stage the operation is in.

So take a couple of cases. Assume the operation is a SELECT query that will result in 10000 rows. If the Oracle server hasn't even started fetching the results, most likely the client's request would be responded immediately with an error code ORA-01013, which has a description like:
SQL state [72000]; error code [1013]; ORA-01013: user requested cancel of current operation
But imagine the server has fetched the rows and in the process of pumping the resultset back to the client. If the client requests the Oracle server to cancel the operation while still there is pending data in the socket to be delivered, it just adds the ORA-03111 packet at the end of the pending packets and lets the client knows that the operation has been cancelled while there is pending data to be delivered.

Look at the tcpdump output below:
23:13:08.613007 IP jdbc_client.48681 > orcle_server.1521: P 2543:3174(631) ack 2342 win 11908
....
23:13:18.635068 IP jdbc_client.48681 > orcle_server.1521: P 3174:3175(1) ack 265693 win 65535 urg 1
....
23:13:20.472561 IP orcle_server.1521 > jdbc_client.48681: P 398520:398615(95) ack 3186 win 65535
0x0000: 0015 c5ec 12a8 0021 1c1d c0c3 0800 4500 .......!......E.
0x0010: 0087 5023 0000 3406 bad6 c0a8 fd9a c0a8 ..P#..4.........
0x0020: fc8b 05f1 be29 a792 459f a091 06cc 5018 .....)..E.....P.
0x0030: ffff 67c2 0000 005f 0000 0600 0000 0000 ..g...._........
0x0040: 0402 04e3 0203 f500 0001 0300 0300 0000 ................
0x0050: 0000 0000 0000 0000 0000 0001 0100 0000 ................
0x0060: 0033 4f52 412d 3033 3131 313a 2062 7265 .3ORA-03111:.bre
0x0070: 616b 2072 6563 6569 7665 6420 6f6e 2063 ak.received.on.c
0x0080: 6f6d 6d75 6e69 6361 7469 6f6e 2063 6861 ommunication.cha
0x0090: 6e6e 656c 0a nnel.

Pay special attention to the times when the SELECT query was sent (21:13:08) and when the cancel request as an URG packet was sent (21:13:18), and when the Oracle sends the last TNS packet that has the error code ORA-03111 (21:13:20).

The cancel request as an URG packet was sent after 10 seconds because as I mentioned earlier my query timeout is 10 seconds.

So now the million dollar question: What should I do if I am facing this issue in my application?

Follow these simple steps:
  • First make sure that your query can be completed within the timeout that you have specified. If you consistently face this exception, try increasing your timeout.
  • That might help to get rid of the exception, but not the root cause. The root cause usually is a database that is not optimized for the query that you are executing or a bad network.
  • To find out if its the database that is the issue, try executing the same query in a host closer to the network. Or try executing the same query hitting the database from a different network. If you are convinced the database is the issue, try to tune it.
  • To find if it is the network that is having the issue, try to do a tcpdump and analyze if there are any out of order deliver of packets. Or dropped packets. If yes, then try to fix the network.
In my case, it turned out to be the bad configuration in the NetEm that was causing too many packets to be delivered out of order and too many duplicated packets. Remember I was introducing a variance of 10 ms (i.e. my packets could be delayed anywhere from 45 ms to 55 ms, as per my configuration). In real cases, at least in a well maintained production network, the variance will not be more than 1 ms.
If by any chance: "algun Oracle Expert se apiada de mi" your collaboration will be greately appreciate it.