Opened 8 months ago

Closed 2 months ago

#1441 closed defect (fixed)

random systemtest failure

Reported by: dmisev Owned by:
Priority: critical Milestone: 9.4
Component: rasnet Version: development
Keywords: Cc: pbaumann, vmerticariu, bphamhuu
Complexity: Medium

Description

It seems like a completely random query every now and then fails in the systemtest. It's happened with a both update and select queries. The output from rasql is:

rasdaman error 0: The client failed to contact the server.

and there is no trace of the query in the server logs, so it looks like a random rasnet issue.

Attachments (1)

scaling_2.rasql.out (59 bytes) - added by dmisev 8 months ago.

Download all attachments as: .zip

Change History (7)

comment:1 Changed 8 months ago by dmisev

I'm not yet sure on how to go about reproducing this reliably and fixing it, but as a first step we can fix the systemtest at least:

@Bang: if a query fails, can you add a check on whether the output file is such an error, and retry the query if it is? The error in the output/query.out file is exactly (I attached the file)

rasdaman error 0: The client failed to contact the server.

Changed 8 months ago by dmisev

comment:2 Changed 8 months ago by dmisev

I think this error happens when the query is sent right at the time when a rasserver is being restarted (or that is one of the conditions):

rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.5981: Request: 'select scale( test_grey, [ 4, 3 ] ) from test_grey'...
rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.6004: parsing...
rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.6085: checking semantics...
rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.6168: evaluating...
rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.8343: ok, result type 'set <marray <char, [0:1023,0:632]>>', 1 element(s), total size 648192 bytes.
rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.55990: Request: 'select c from RAS_COLLECTIONNAMES as c'...
rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.56054: parsing...
rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.56215: checking semantics...
rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.56578: evaluating...
rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.56609: ok, result type 'set <marray <char, [0:9]>>', 6 element(s), total size 60 bytes.
rasserver.6d8e53d1-f8f3-4019-9aa3-a144d93ed2d2.005857.log: [INFO] - 16/11/2016 14:09:25.75735: Exiting server process.

# rasql attempts to execute the query in this timeframe

rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.140767: rasserver: rasdaman server v9.3.1-gfc2a3ec on base DBMS sqlite.
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.140852:  Copyright 2003, 2004, 2005, 2006, 2007, 2008, 2009 Peter Baumann rasdaman GmbH. 
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.140893: To obtain a list of external packages used, please visit www.rasdaman.org.
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.140927: Server 8b0d3189-00ae-48ee-af6a-da6342431950 of type 
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.140962: RPC, registered with prognum 0x1b5a
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.140992: , connecting to sqlite as '/opt/rasdaman/data//RASBASE'
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141019: Verifying rasmgr host name: localhost...
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141224: ok
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141257: Tile size set to : 4194304
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141280: PCTMin set to    : 2048
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141300: PCTMax set to    : 4096
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141318: IndexSize set to : 0
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141342: Default Tile Conf: [0:1023,0:1023]
rasserver.8b0d3189-00ae-48ee-af6a-da6342431950.012144.log: [INFO] - 16/11/2016 14:09:25.141363: Default Tiling   : AlignedTiling

So one way to test it is to set a low countdown parameter in rasmgr.conf and run a lot of queries perhaps.

comment:3 Changed 8 months ago by dmisev

Just found out that the -countdown option doesn't have any effect in rasnet actually..

#	-countdown n	server will automatically restart after n requests
#			(beware of long-running transactions!)
...
define srv N2 -host hifi -type n -port 7003 -dbh rasdaman_host
change srv N2 -countdown 0 -autorestart on -xp --timeout 30

comment:4 Changed 8 months ago by dmisev

I submitted a patch that increases the countdown parameter in rasmgr.conf which will hopefully fix these issues with the systemtest.

comment:5 Changed 6 months ago by bphamhuu

  • Milestone changed from 9.3 to 9.4

comment:6 Changed 2 months ago by dmisev

  • Resolution set to fixed
  • Status changed from new to closed

I think we haven't had this issue anymore.

Note: See TracTickets for help on using tickets.