Opened 12 years ago

Closed 11 years ago

Last modified 9 years ago

#325 closed defect (fixed)

Rasdaman not cleaning up after segmentation fault

Reported by: Heinrich Stamerjohanns Owned by: Dimitar Misev
Priority: major Milestone: Future
Component: rasmgr Version: 8.4
Keywords: Cc: Peter Baumann
Complexity: Medium

Description

When an insert leads to segmentation fault (another issue which will be handled in seperate ticket), it is not possible to continue.

Following error message appears: rasdaman error 806: RasManager? Error: Write transaction in progress, please retry again later.

rasql should be able to handle the error and at least abort the transaction.
One might want to consider libsigsegv to handle such problems.

Change History (16)

comment:1 by Dimitar Misev, 11 years ago

Cc: Peter Baumann added
Milestone: Future
Status: newaccepted

That's a very good idea, I didn't know about libsigsegv.

comment:2 by Dimitar Misev, 11 years ago

I submitted a patch that uses libsigsegv to catch and handle segfaults in rasql, rasimport and raserase.

libsigsegv-dev is not a mandatory requirement for compiling rasdaman, I made it optional via appropriate #ifdefs, but I will include it in our short installation guide and on the wiki.

comment:3 by Peter Baumann, 11 years ago

this sounds like an extremely useful feature - anything speaking against making it a regular requirement?

comment:4 by Dimitar Misev, 11 years ago

No I don't think so, from what I could see it's a pretty standard library in all Linuxes. I can make it a regular requirement in a follow patch.

comment:5 by Dimitar Misev, 11 years ago

I had another idea as well — to catch segfaults in rasserver, to notify rasmgr the the server has failed so that it can retry the query evaluation once again with another rasserver. Not sure if it's easily possible, but sometimes retrying the query after a server segfault is successful.

I do this retry mechanism in the import scripts, but it would be nicer if it's integrated into the server.

comment:6 by Peter Baumann, 11 years ago

nice idea indeed! But we should try it only once IMHO - retrying assumes that the abort is not associated with the query as such, but with server state (such as mem leaks), in which case trying another server makes sense. As servers will have individual states anyway, after the 2nd segfault we know it's meant to be that way.

comment:7 by Dimitar Misev, 11 years ago

Yes, that is exactly what I was thinking as well.

comment:8 by Dimitar Misev, 11 years ago

Owner: changed from Dimitar Misev to Nikolche Kolev
Status: acceptedassigned

Ok reassigning to Nikolce to look at this when he has the time. To summarize, it would be ideal to catch segfaults in rasserver (source:server/rasserver_main.cc, for examples see how I have done it in this patch) and

  • print stacktrace of where the segfault happened (as done in gdb for example)
  • notify rasmgr of the segfault, so that it can retry the query once more (but not more than one retry per query)

comment:9 by Dimitar Misev, 11 years ago

comment:10 by Dimitar Misev, 11 years ago

Owner: changed from Nikolche Kolev to Dimitar Misev

Patch submitted

comment:11 by Dimitar Misev, 11 years ago

Resolution: fixed
Status: assignedclosed

comment:12 by Bang Pham Huu, 9 years ago

I still have this problem with Rasdaman version 9.1

When I tried to retest test cases in 'test_wcps' or 'test_wcs' of Rasdaman / systemtest. It will happend.

This only happend yesterday, before that never seen this problem.

test.sh: starting test at Thu Sep 10 08:36:19 CEST 2015
test.sh: 
test.sh: Testing service: wcs
rasdaman error 206: Serialisable exception r_Ebase_dbms: error in base DBMS.
rasdaman error 206: Serialisable exception r_Ebase_dbms: error in base DBMS.
test.sh: deleting coverage rgb from petascope... no such coverage found.
test.sh: done.
test.sh: importing rgb... rasdaman error 206: Serialisable exception r_Ebase_dbms: error in base DBMS.

test.sh: failed, repeating 1... rasdaman error 806: RasManager Error: Write transaction in progress, please retry again later.

test.sh: failed, repeating 2... rasdaman error 806: RasManager Error: Write transaction in progress, please retry again later.

Version 0, edited 9 years ago by Bang Pham Huu (next)

comment:13 by Dimitar Misev, 9 years ago

Hi Bang, your problem is not related to this ticket, please look in the rasdaman logs for more information on the rasdaman error 206: Serialisable exception r_Ebase_dbms: error in base DBMS.

comment:14 by Bang Pham Huu, 9 years ago

Hi Dimitar,

I agree that is not belong to "Segment fault" because when I tried restart computer, it also happended again. So exactly, it is error from RASBASE. I had to remove the RASBASE and create it again, also with Petascopedb (I've done it yesterday before your replying). After that, it could import data normally.

Thanks,

comment:15 by Dimitar Misev, 9 years ago

But what was the problem in RASBASE? Try to check in the logs next time for more details, so in case it's a bug we need to solve it.

comment:16 by Bang Pham Huu, 9 years ago

Hi Dimitar,

I could see the log file (in $RMANHOME/log/), you could see below and try to analyse what is the cause of problem. It looks like it could not query the sqlite_master table in RASBASE when it is locked (may be some other process is modifying data).

10/09/2015 09:30:31.565 INFO ok
Initializing control connections...informing rasmgr: server available...ok
Initializing job control...setting timeout to 300 secs...connecting to base DBMS...10/09/2015 09:30:31.566 INFO Connecting to /home/rasdaman/install/data/RASBASE
10/09/2015 09:30:31.566 FATAL SQL query failed: SELECT name FROM sqlite_master WHERE type='table' AND name='RAS_COUNTERS'
10/09/2015 09:30:31.566 FATAL Database error, code: 5, message: database is locked
10/09/2015 09:30:31.566 ERROR Error: encountered 206: Error in base DBMS, error number: 5
database is locked
10/09/2015 09:30:31.566 INFO rasserver terminated.
Note: See TracTickets for help on using tickets.