Opened 5 years ago

Closed 4 years ago

Last modified 2 years ago

#325 closed defect (fixed)

Rasdaman not cleaning up after segmentation fault

Reported by: hstamerjohanns Owned by: dmisev
Priority: major Milestone: Future
Component: rasmgr Version: 8.4
Keywords: Cc: pbaumann
Complexity: Medium

Description

When an insert leads to segmentation fault (another issue which will be handled in seperate ticket), it is not possible to continue.

Following error message appears: rasdaman error 806: RasManager?? Error: Write transaction in progress, please retry again later.

rasql should be able to handle the error and at least abort the transaction.
One might want to consider libsigsegv to handle such problems.

Change History (16)

comment:1 Changed 4 years ago by dmisev

  • Cc pbaumann added
  • Milestone set to Future
  • Status changed from new to accepted

That's a very good idea, I didn't know about libsigsegv.

comment:2 Changed 4 years ago by dmisev

I submitted a patch that uses libsigsegv to catch and handle segfaults in rasql, rasimport and raserase.

libsigsegv-dev is not a mandatory requirement for compiling rasdaman, I made it optional via appropriate #ifdefs, but I will include it in our short installation guide and on the wiki.

comment:3 Changed 4 years ago by pbaumann

this sounds like an extremely useful feature - anything speaking against making it a regular requirement?

comment:4 Changed 4 years ago by dmisev

No I don't think so, from what I could see it's a pretty standard library in all Linuxes. I can make it a regular requirement in a follow patch.

comment:5 Changed 4 years ago by dmisev

I had another idea as well -- to catch segfaults in rasserver, to notify rasmgr the the server has failed so that it can retry the query evaluation once again with another rasserver. Not sure if it's easily possible, but sometimes retrying the query after a server segfault is successful.

I do this retry mechanism in the import scripts, but it would be nicer if it's integrated into the server.

comment:6 Changed 4 years ago by pbaumann

nice idea indeed! But we should try it only once IMHO - retrying assumes that the abort is not associated with the query as such, but with server state (such as mem leaks), in which case trying another server makes sense. As servers will have individual states anyway, after the 2nd segfault we know it's meant to be that way.

comment:7 Changed 4 years ago by dmisev

Yes, that is exactly what I was thinking as well.

comment:8 Changed 4 years ago by dmisev

  • Owner changed from dmisev to nkolev
  • Status changed from accepted to assigned

Ok reassigning to Nikolce to look at this when he has the time. To summarize, it would be ideal to catch segfaults in rasserver (source:server/rasserver_main.cc, for examples see how I have done it in this patch) and

  • print stacktrace of where the segfault happened (as done in gdb for example)
  • notify rasmgr of the segfault, so that it can retry the query once more (but not more than one retry per query)

comment:9 Changed 4 years ago by dmisev

comment:10 Changed 4 years ago by dmisev

  • Owner changed from nkolev to dmisev

Patch submitted

comment:11 Changed 4 years ago by dmisev

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:12 Changed 2 years ago by bphamhuu

I still have this problem with Rasdaman version 9.1

When I tried to retest test cases in 'test_wcps' or 'test_wcs' of Rasdaman / systemtest. It will happend.

This only happend yesterday, before that never seen this problem. This error also really hard to understand because even when I've restarted computer and try it again, it still has this error.

test.sh: starting test at Thu Sep 10 08:36:19 CEST 2015
test.sh: 
test.sh: Testing service: wcs
rasdaman error 206: Serialisable exception r_Ebase_dbms: error in base DBMS.
rasdaman error 206: Serialisable exception r_Ebase_dbms: error in base DBMS.
test.sh: deleting coverage rgb from petascope... no such coverage found.
test.sh: done.
test.sh: importing rgb... rasdaman error 206: Serialisable exception r_Ebase_dbms: error in base DBMS.

test.sh: failed, repeating 1... rasdaman error 806: RasManager Error: Write transaction in progress, please retry again later.

test.sh: failed, repeating 2... rasdaman error 806: RasManager Error: Write transaction in progress, please retry again later.

Last edited 2 years ago by bphamhuu (previous) (diff)

comment:13 Changed 2 years ago by dmisev

Hi Bang, your problem is not related to this ticket, please look in the rasdaman logs for more information on the rasdaman error 206: Serialisable exception r_Ebase_dbms: error in base DBMS.

comment:14 Changed 2 years ago by bphamhuu

Hi Dimitar,

I agree that is not belong to "Segment fault" because when I tried restart computer, it also happended again. So exactly, it is error from RASBASE. I had to remove the RASBASE and create it again, also with Petascopedb (I've done it yesterday before your replying). After that, it could import data normally.

Thanks,

comment:15 Changed 2 years ago by dmisev

But what was the problem in RASBASE? Try to check in the logs next time for more details, so in case it's a bug we need to solve it.

comment:16 Changed 2 years ago by bphamhuu

Hi Dimitar,

I could see the log file (in $RMANHOME/log/), you could see below and try to analyse what is the cause of problem. It looks like it could not query the sqlite_master table in RASBASE when it is locked (may be some other process is modifying data).

10/09/2015 09:30:31.565 INFO ok
Initializing control connections...informing rasmgr: server available...ok
Initializing job control...setting timeout to 300 secs...connecting to base DBMS...10/09/2015 09:30:31.566 INFO Connecting to /home/rasdaman/install/data/RASBASE
10/09/2015 09:30:31.566 FATAL SQL query failed: SELECT name FROM sqlite_master WHERE type='table' AND name='RAS_COUNTERS'
10/09/2015 09:30:31.566 FATAL Database error, code: 5, message: database is locked
10/09/2015 09:30:31.566 ERROR Error: encountered 206: Error in base DBMS, error number: 5
database is locked
10/09/2015 09:30:31.566 INFO rasserver terminated.
Note: See TracTickets for help on using tickets.