Opened 13 years ago

Closed 10 years ago

#133 closed defect (fixed)

Problem in Petascope to serve simultaneous request

Reported by: barboni@… Owned by: Nikolche Kolev
Priority: major Milestone: 9.0.x
Component: petascope Version: 8.3
Keywords: concurrence scalability Cc: abeccati, Peter Baumann, Piero Campalani, damiano, mantovani, sebrecht
Complexity: Medium

Description

Ticket raised after a discussion on earthserver-services mailing list

http://groups.google.com/group/earthserver-services/browse_thread/thread/30e9011406db4853/

I have only one client that perform a series of parallel request to
Petascope.

Rasdaman support parallel read access and the problem seems related to
Petascope that is able to serve only one request at a time.
The same WCS requests, converted in the equivalent psql query and
submitted simultaneously to rasdaman, works fine.

In the log file that I have posted on the mailing list there are a lot messages like
"Preparing request for new connect by resetting old connection"

If I post a series of simultaneous requests, only the last request
received by petascope are served and it seems that the others are
cancelled.

I tried to use the Petascope installed at http://www.earthlook.org/ and
it can serve up to 10 simultaneous requests, mine only one.

Damiano


OS info:
lsb_release -a
LSB
Version: :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: Scientific
Description: Scientific Linux release 6.0 (Carbon)
Release: 6.0
Codename: Carbon

gcc —version
gcc (GCC) 4.4.4 20100726 (Red Hat 4.4.4-13)
Copyright © 2010 Free Software Foundation, Inc.

java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.6) (rhel-1.43.1.10.6.el6_2-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)

tomcat6 version
Server version: Apache Tomcat/6.0.24
Server built: December 5 2011 1239
Server number: 6.0.24.0
OS Name: Linux
OS Version: 2.6.32-220.4.1.el6.x86_64
Architecture: amd64
JVM Version: 1.6.0_22-b22
JVM Vendor: Sun Microsystems Inc.

Attachments (7)

test_parallel_requests.py (5.5 KB ) - added by barboni@… 13 years ago.
rasjQuery.java (1.1 KB ) - added by Ernesto Rodriguez 13 years ago.
concurrentRasj.java (506 bytes ) - added by Ernesto Rodriguez 13 years ago.
catalina.out (38.7 KB ) - added by barboni@… 13 years ago.
5 request (4 OK and 1 KO)
test_parallel_requests.2.py (10.7 KB ) - added by barboni@… 13 years ago.
new version with wcps support
TestParallelRequests.py (4.5 KB ) - added by barboni@… 12 years ago.
New test script with python unittest library
0001-Fix-handling-of-simultaneous-requests-in-petascope-t.patch (9.3 KB ) - added by Dimitar Misev 12 years ago.

Download all attachments as: .zip

Change History (50)

by barboni@…, 13 years ago

Attachment: test_parallel_requests.py added

comment:1 by anonymous, 13 years ago

The script test_parallel_requests.py perform a series of simultaneous requests to petascope and save the receive xml files.

#Usage: ./test_parallel_request.py APP N
# APP can be "r" for RASDAMAN or "p" for PETASCOPE
# N is the number of requests generated

At the beginning of the file configure your host name, port, output file path, collection name ecc

Comment or change rows 124 and 125 to manage subset.

obj.setSubset("x", 0,200)
obj.setSubset("y", 0,200)

comment:2 by Dimitar Misev, 13 years ago

Owner: changed from Dimitar Misev to Ernesto Rodriguez
Status: newassigned

by Ernesto Rodriguez, 13 years ago

Attachment: rasjQuery.java added

by Ernesto Rodriguez, 13 years ago

Attachment: concurrentRasj.java added

comment:3 by Ernesto Rodriguez, 13 years ago

The bug resides in the rasj component of Rasdaman since it can't open multiple connections to the database. This is proven by the attached concurrentRasj.java program which tries to do simultaneous queries to a rasdaman server.

to compile:

javac -cp rasj.jar concurrentRasj.java rasjQuery.java

to run:

java -cp rasj.jar:./ concurrentRasj numThreads

If numThreads is set to one (one simultaneous query) the program runs without errors. But if numThreads is set to a number bigger than 1 (even 2) only one connection is opened and the rest of the connections fail to open, ie. the call to rasj.rnp.RasImplementation.openDB() fails and throws an exception.

comment:4 by anonymous, 13 years ago

That's not good, did you discover why and how it could be fixed? I guess there's some static connection object.

comment:5 by Ernesto Rodriguez, 13 years ago

Not yet, I'm working on that. But I believe it's a little more complex since even running simultaneous queries as separate processes fails:

java -cp rasj.jar:./ concurrentRasj & java -cp rasj.jar:./ concurrentRasj numThreads

(ie. one simultaneous query per process) and in that case static variables shouldn't matter since each process only has one thread. I will continue working and see what I find.

comment:6 by Ernesto Rodriguez, 13 years ago

Correction of the command above:

java -cp rasj.jar:./ concurrentRasj 1 & java -cp rasj.jar:./ concurrentRasj 1

comment:7 by Dimitar Misev, 13 years ago

Quite strange.. I thought it'd be a problem in petascope, as it seems RasUtil.executeRasqlQuery is a static method in petascope.

comment:8 by Ernesto Rodriguez, 13 years ago

Indeed, that was my initial guess as well, but making it non static didn't solve the problem, so I tracked it further and found out I couldn't open multiple connections with rasj. I tried many approaches (using different processes) using only one instance of implementation, using multiple instances of implementation, all resulted in the same error.

comment:9 by Ernesto Rodriguez, 13 years ago

Well, you guys you seem to know your code, the fix is indeed simple. The openDB method is calling the getFreeServer whith false as argument. This results in rasj allways using the same server (ie. the one running on port 7002). So concurrent queries fail since all use the same server. I will soon submit the patch. Calling getFreeServer with true as argument seems to fix the problem.

comment:10 by barboni@…, 13 years ago

Thanks rodriguez for your work

Today I found time to try your fix.
It seems that the problem related to the parallel request is partially solved but the system is unstable.

For example if I perform 5 simultaneous request to petascope i don't always get the same response. Some requests return an xml file with the error:

(1) "Error converting WCPS query to rasql query"

It's strange because the rasdaman logs don't show any error for the request failed and the error isn't systematic. Sometime all 5 requests work but often not.


I attached the catalina.log file for the case:
5 parallel request;

  • 4 response OK
  • 1 response with error (1)

Also considers that the probability of reproducing the error (1) increase with the number of requests. With 10 rasdaman server running and 15 simultaneous request i obtain very few correct results (one or two) and rarely the error is

(2) RasManager Error: System overloaded, please try again later.


In conclusion it seems that the rasj.jar component is duplicated in different folder so the bugfix need to be propagated

The "~/rasdaman/java/dist/rasj.jar" accept parallel request but the following rasj.jar not:

~/rasdaman/applications/petascope/build/web/WEB-INF/lib/rasj.jar
md5sum = 5cb532a97781d10818cdc2600a475fa2

~/rasdaman/applications/petascope/lib/rasj.jar
md5sum = 5cb532a97781d10818cdc2600a475fa2

To compile petascope i used the rasj.jar at "~/rasdaman/java/dist/rasj.jar"


Feel free to ask me for additional info or log messages

Ciao

Damiano

by barboni@…, 13 years ago

Attachment: catalina.out added

5 request (4 OK and 1 KO)

comment:11 by Ernesto Rodriguez, 13 years ago

Thanks for the kind feedback, I'm currently working on the second problem (server overloaded) since the fix should be easy and also on unifying the compilation process of Rasj and Petascope, both problems should be solved soon. I will then start testing the first problem, since there is some uncertainty at the moment regarding where the problem resides.

Regards,

Ernesto Rodriguez

comment:12 by Piero Campalani, 13 years ago

5 requests (3 OK and 2 KO).

A little patch on test_parallel_requests.py:

--- 97     xml = "ERROR " %(r_status)
+++ 97     xml = "ERROR %s" %(r_status)

--- 116    data_Out      =  commands.getstatusoutput(rasdaman_request)
+++ 116    data_Out      =  str(commands.getstatusoutput(rasdaman_request))

Piero

comment:13 by barboni@…, 13 years ago

I updated the script "test_parallel_requests.py" to execute also wcps requests (abstract syntax)

Change at the beginning of the file the global variables hostname, port, coverage_id ecc with yours

Ciao

Damiano

USAGE:
python test_parallel_requests.py APP N

APP can be:

"ras" for RASDAMAN,
"wcs" for PETASCOPE wcs2.0,
"wcps" for PETASCOPE wcps abstract syntax

N is the number of requests generated

by barboni@…, 13 years ago

Attachment: test_parallel_requests.2.py added

new version with wcps support

comment:14 by Dimitar Misev, 13 years ago

So what's the status, Ernesto? Is this ticket fixed?

comment:15 by Ernesto Rodriguez, 13 years ago

Resolution: fixed
Status: assignedclosed

comment:16 by abeccati, 12 years ago

Keywords: concurrence scalability added
Priority: majorblocker
Resolution: fixed
Status: closedreopened

Ticket reopened upon notification by barboni@…

comment:17 by Dimitar Misev, 12 years ago

Any more details? :)

comment:18 by abeccati, 12 years ago

Cc: a.beccati@… p.campalani@… added; beccati@… removed
Owner: changed from Ernesto Rodriguez to Dimitar Misev
Status: reopenedassigned

Damiano should provide an updated script to run the tests.

First we should integrate the provided test into the system tests (jin ?) to ensure no accidental reintroduction, then fix it.

comment:19 by Dimitar Misev, 12 years ago

There is a test, source:systemtest/testcases_open/test_rasjConcurrency but it's failing as far as I've noticed.

by barboni@…, 12 years ago

Attachment: TestParallelRequests.py added

New test script with python unittest library

comment:20 by barboni@…, 12 years ago

Hi Alan,

I loaded the test script also in this track. I saw that there are other scripts attached to this ticket, probably is better to use the last version (TestParallelRequests.py), it's more essential and clean and make use of the python unittest library.

As mentioned in the ES track, i have an old petascope.war that works but I don't remember what was the related git commit. Tell me if you know a way to retrieve the version from a compiled petascope, so it will be possible to define a date range where the bag was reintroduced.

Damiano

comment:21 by Piero Campalani, 12 years ago

Owner: changed from Dimitar Misev to Piero Campalani

comment:22 by abeccati, 12 years ago

Milestone: 8.4

comment:23 by Piero Campalani, 12 years ago

Up to now I can tell that the last working commit (I tried) is:

commit 071e0a0065d8781e07820ca13b80926e7ca6a39c
Author: Alireza Rezaei Mahdiraji <a.rezaeim@jacobs-university.de>
Date:   Wed Aug 15 14:54:17 2012 +0200

    Using constants in WCPS (part4)

comment:24 by Dimitar Misev, 12 years ago

Cc: Alireza added

Perhaps some constant has been misplaced?

comment:25 by Piero Campalani, 12 years ago

That commit is working, so the introduction of the constant should not be the problem.

I just listed this temporary result, I'll continue later on discovering the watershed commit.

comment:26 by Piero Campalani, 12 years ago

Cc: mrusu added

First non-working commit:

commit 42a68acc5a7c7a59c5c6ceb37ac825a3b7ba2f55
Author: Mihaela Rusu <m.rusu@jacobs-university.de>
Date:   Wed Oct 3 17:29:02 2012 +0200

    inserting metadata into wcs2 response

So, one problem seems to lie in the new getImageMetadata() method, which throws the SQLException ("Error retrieving metadata for coverage").

The other error message ("Metadata for coverage '%' is not valid") originates in the DbMetadataSource.read() method. It is called by WcsUtil.java which however hides the details so it is not clear which exception is thrown by the former: only coverageId is propagated, whereas the incoming exception message is ignored.

The method DbMetadataSource.getImageMetadata() in not called inside DbMetadataSource.read(), however no other modifications to DbMetadataSource.java where done in that patch, so I really guess the other error are anyway caused by the clash with the query for the metadata (ps_metadata).

I can have a look at this next week.

comment:27 by abeccati, 12 years ago

Very good, this bisect tool is really well thought ;)
Next week is ok and it would get straight into 8.4

comment:28 by Piero Campalani, 12 years ago

As a quick workaround, in case the gmlcov:metadata element is not needed in the WCS responses, one could comment out the body of the petascope.core.DbMetadataSource.getImageMetadata() method, forcing it to return an empty String "".

comment:29 by Dimitar Misev, 12 years ago

I can see the problem with the getImageMetadata, it's called and handled separately instead of being read into the GetCoverageMetadata.. Not sure if reading it in the same way as all the other coverage info (in DbMetadataSource.read(..)) will fix this problem, but we should anyway remove the getImageMetadata and move the code to the read() method.

comment:30 by Dimitar Misev, 12 years ago

This is what should be moved to the read() somewhere:

            ResultSet r = s.executeQuery("SELECT metadata FROM ps_metadata WHERE coverage in "
                     + "(SELECT id FROM ps_coverage WHERE name = '" + coverageId + "')");
            if (r.next()) {
                return r.getString("metadata");
            } else {
                return "";
            }

comment:31 by Dimitar Misev, 12 years ago

@Piero: you may be getting those errors because you haven't run update_petascopedb.sh on your database, i.e. you don't have the new metadata table? And perhaps that is also causing your multithreading test to fail, i.e. this commit may not be the problematic one for this particular case, especially because I don't see any code that could cause multithreading issues.

comment:32 by Piero Campalani, 12 years ago

Dimitar,
yes, my db is updated with ps_metadata (I don't think Petascope deploys otherwise).

Don't you get the errors with TestParallelRequests.py as well?

Anyway, the solution you proposed is probably what is meant to be to fix the ticket.

comment:33 by Dimitar Misev, 12 years ago

Seems to work for me, but all the ports are closed here and I can't pull and make the patch against the latest rasdaman, so attaching it here and Piero you can redo it on the latest code.

comment:34 by Dimitar Misev, 12 years ago

would be also nice to integrate the python test into the test_wcs systemtest, it seems to work really well (thanks Damiano!), or fix the test_rasjConcurrency.

comment:35 by Piero Campalani, 12 years ago

Tested your patch against on latest rasdaman: it works.

comment:36 by Dimitar Misev, 12 years ago

Resolution: fixed
Status: assignedclosed

comment:37 by barboni@…, 12 years ago

tested today
now petascope works

thanks and ciao

damiano

comment:38 by Dimitar Misev, 11 years ago

Cc: abeccati Peter Baumann Piero Campalani damiano mantovani added; a.beccati@… p.campalani@… Alireza mrusu removed
Complexity: Medium
Milestone: 8.49.0
Priority: blockermajor
Resolution: fixed
Status: closedreopened

comment:39 by Dimitar Misev, 11 years ago

Owner: changed from Piero Campalani to Nikolche Kolev
Status: reopenedassigned

Reopening as it seems the handling of simultaneous connections is quite unreliable in petascope, especially when there are more connections than there are rasservers available.

The test script to be used for testing is attachment:TestParallelRequests.py e.g. try with 10 rasservers and N_REQUEST = 30

comment:40 by Dimitar Misev, 11 years ago

Cc: sebrecht added

Furthermore rasserver dies (i.e. becomes unavailable), when there are too many connections exceeding the max_connections values in postgresql.conf. In the rasdaman log I see SQL error codes 402 and 220, they should be handled better instead of rasserver becoming unavailable.

Same can happen with petascope as well, depending on which one exceeds the threshold first I guess.

Last edited 11 years ago by Dimitar Misev (previous) (diff)

comment:41 by Piero Campalani, 10 years ago

UPDATE (see also #795)

Thread-safety seems damaged from v9.0.1 to v9.0.2.
Using attachment:TestParallelRequests.py (thanks damiano for the update for v9):

> grep ^WCS_request TestParallelRequests.py 
WCS_request    = "/rasdaman/ows/wcs2?service=WCS&Request=GetCoverage&version=2.0.1&coverageid=mr&subset=i(0,10)&subset=j(0,1)"
> $ grep ^N_REQUEST TestParallelRequests.py 
N_REQUEST      = 50
> python TestParallelRequests.py
> egrep "404|500" . -r | wc -l

At MEEO Srl v9.0.1 is stable.
Personally on my laptop v9.0.1 shoots a lot of 'RasdamanUnavailable: Unable to get a free rasdaman server', while v9.0.2 only fails < 10%.

comment:42 by Dimitar Misev, 10 years ago

Piero can you try to isolate the commit between 9.0.1 and 9.0.2 that causes this? But first I guess we have to reproduce the issue.

comment:43 by Dimitar Misev, 10 years ago

Resolution: fixed
Status: assignedclosed

I guess this is fixed.

Note: See TracTickets for help on using tickets.