Opened 13 years ago
Closed 11 years ago
#133 closed defect (fixed)
Problem in Petascope to serve simultaneous request
Reported by: | Owned by: | Nikolche Kolev | |
---|---|---|---|
Priority: | major | Milestone: | 9.0.x |
Component: | petascope | Version: | 8.3 |
Keywords: | concurrence scalability | Cc: | abeccati, Peter Baumann, Piero Campalani, damiano, mantovani, sebrecht |
Complexity: | Medium |
Description
Ticket raised after a discussion on earthserver-services mailing list
http://groups.google.com/group/earthserver-services/browse_thread/thread/30e9011406db4853/
I have only one client that perform a series of parallel request to
Petascope.
Rasdaman support parallel read access and the problem seems related to
Petascope that is able to serve only one request at a time.
The same WCS requests, converted in the equivalent psql query and
submitted simultaneously to rasdaman, works fine.
In the log file that I have posted on the mailing list there are a lot messages like
"Preparing request for new connect by resetting old connection"
If I post a series of simultaneous requests, only the last request
received by petascope are served and it seems that the others are
cancelled.
I tried to use the Petascope installed at http://www.earthlook.org/ and
it can serve up to 10 simultaneous requests, mine only one.
Damiano
OS info:
lsb_release -a
LSB
Version: :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: Scientific
Description: Scientific Linux release 6.0 (Carbon)
Release: 6.0
Codename: Carbon
gcc —version
gcc (GCC) 4.4.4 20100726 (Red Hat 4.4.4-13)
Copyright © 2010 Free Software Foundation, Inc.
java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.6) (rhel-1.43.1.10.6.el6_2-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)
tomcat6 version
Server version: Apache Tomcat/6.0.24
Server built: December 5 2011 1239
Server number: 6.0.24.0
OS Name: Linux
OS Version: 2.6.32-220.4.1.el6.x86_64
Architecture: amd64
JVM Version: 1.6.0_22-b22
JVM Vendor: Sun Microsystems Inc.
Attachments (7)
Change History (50)
by , 13 years ago
Attachment: | test_parallel_requests.py added |
---|
comment:1 by , 13 years ago
comment:2 by , 13 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
by , 13 years ago
Attachment: | rasjQuery.java added |
---|
by , 13 years ago
Attachment: | concurrentRasj.java added |
---|
comment:3 by , 13 years ago
The bug resides in the rasj component of Rasdaman since it can't open multiple connections to the database. This is proven by the attached concurrentRasj.java program which tries to do simultaneous queries to a rasdaman server.
to compile:
javac -cp rasj.jar concurrentRasj.java rasjQuery.java
to run:
java -cp rasj.jar:./ concurrentRasj numThreads
If numThreads is set to one (one simultaneous query) the program runs without errors. But if numThreads is set to a number bigger than 1 (even 2) only one connection is opened and the rest of the connections fail to open, ie. the call to rasj.rnp.RasImplementation.openDB() fails and throws an exception.
comment:4 by , 13 years ago
That's not good, did you discover why and how it could be fixed? I guess there's some static connection object.
comment:5 by , 13 years ago
Not yet, I'm working on that. But I believe it's a little more complex since even running simultaneous queries as separate processes fails:
java -cp rasj.jar:./ concurrentRasj & java -cp rasj.jar:./ concurrentRasj numThreads
(ie. one simultaneous query per process) and in that case static variables shouldn't matter since each process only has one thread. I will continue working and see what I find.
comment:6 by , 13 years ago
Correction of the command above:
java -cp rasj.jar:./ concurrentRasj 1 & java -cp rasj.jar:./ concurrentRasj 1
comment:7 by , 13 years ago
Quite strange.. I thought it'd be a problem in petascope, as it seems RasUtil.executeRasqlQuery is a static method in petascope.
comment:8 by , 13 years ago
Indeed, that was my initial guess as well, but making it non static didn't solve the problem, so I tracked it further and found out I couldn't open multiple connections with rasj. I tried many approaches (using different processes) using only one instance of implementation, using multiple instances of implementation, all resulted in the same error.
comment:9 by , 13 years ago
Well, you guys you seem to know your code, the fix is indeed simple. The openDB method is calling the getFreeServer whith false as argument. This results in rasj allways using the same server (ie. the one running on port 7002). So concurrent queries fail since all use the same server. I will soon submit the patch. Calling getFreeServer with true as argument seems to fix the problem.
comment:10 by , 13 years ago
Thanks rodriguez for your work
Today I found time to try your fix.
It seems that the problem related to the parallel request is partially solved but the system is unstable.
For example if I perform 5 simultaneous request to petascope i don't always get the same response. Some requests return an xml file with the error:
(1) "Error converting WCPS query to rasql query"
It's strange because the rasdaman logs don't show any error for the request failed and the error isn't systematic. Sometime all 5 requests work but often not.
I attached the catalina.log file for the case:
5 parallel request;
- 4 response OK
- 1 response with error (1)
Also considers that the probability of reproducing the error (1) increase with the number of requests. With 10 rasdaman server running and 15 simultaneous request i obtain very few correct results (one or two) and rarely the error is
(2) RasManager Error: System overloaded, please try again later.
In conclusion it seems that the rasj.jar component is duplicated in different folder so the bugfix need to be propagated
The "~/rasdaman/java/dist/rasj.jar" accept parallel request but the following rasj.jar not:
~/rasdaman/applications/petascope/build/web/WEB-INF/lib/rasj.jar
md5sum = 5cb532a97781d10818cdc2600a475fa2
~/rasdaman/applications/petascope/lib/rasj.jar
md5sum = 5cb532a97781d10818cdc2600a475fa2
To compile petascope i used the rasj.jar at "~/rasdaman/java/dist/rasj.jar"
Feel free to ask me for additional info or log messages
Ciao
Damiano
comment:11 by , 13 years ago
Thanks for the kind feedback, I'm currently working on the second problem (server overloaded) since the fix should be easy and also on unifying the compilation process of Rasj and Petascope, both problems should be solved soon. I will then start testing the first problem, since there is some uncertainty at the moment regarding where the problem resides.
Regards,
Ernesto Rodriguez
comment:12 by , 13 years ago
5 requests (3 OK and 2 KO).
A little patch on test_parallel_requests.py:
--- 97 xml = "ERROR " %(r_status) +++ 97 xml = "ERROR %s" %(r_status) --- 116 data_Out = commands.getstatusoutput(rasdaman_request) +++ 116 data_Out = str(commands.getstatusoutput(rasdaman_request))
Piero
comment:13 by , 13 years ago
I updated the script "test_parallel_requests.py" to execute also wcps requests (abstract syntax)
Change at the beginning of the file the global variables hostname, port, coverage_id ecc with yours
Ciao
Damiano
USAGE:
python test_parallel_requests.py APP N
APP can be:
"ras" for RASDAMAN,
"wcs" for PETASCOPE wcs2.0,
"wcps" for PETASCOPE wcps abstract syntax
N is the number of requests generated
comment:15 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
comment:16 by , 12 years ago
Keywords: | concurrence scalability added |
---|---|
Priority: | major → blocker |
Resolution: | fixed |
Status: | closed → reopened |
Ticket reopened upon notification by barboni@…
comment:18 by , 12 years ago
Cc: | added; removed |
---|---|
Owner: | changed from | to
Status: | reopened → assigned |
Damiano should provide an updated script to run the tests.
First we should integrate the provided test into the system tests (jin ?) to ensure no accidental reintroduction, then fix it.
comment:19 by , 12 years ago
There is a test, source:systemtest/testcases_open/test_rasjConcurrency but it's failing as far as I've noticed.
by , 12 years ago
Attachment: | TestParallelRequests.py added |
---|
New test script with python unittest library
comment:20 by , 12 years ago
Hi Alan,
I loaded the test script also in this track. I saw that there are other scripts attached to this ticket, probably is better to use the last version (TestParallelRequests.py), it's more essential and clean and make use of the python unittest library.
As mentioned in the ES track, i have an old petascope.war that works but I don't remember what was the related git commit. Tell me if you know a way to retrieve the version from a compiled petascope, so it will be possible to define a date range where the bag was reintroduced.
Damiano
comment:21 by , 12 years ago
Owner: | changed from | to
---|
comment:22 by , 12 years ago
Milestone: | → 8.4 |
---|
comment:23 by , 12 years ago
Up to now I can tell that the last working commit (I tried) is:
commit 071e0a0065d8781e07820ca13b80926e7ca6a39c Author: Alireza Rezaei Mahdiraji <a.rezaeim@jacobs-university.de> Date: Wed Aug 15 14:54:17 2012 +0200 Using constants in WCPS (part4)
comment:25 by , 12 years ago
That commit is working, so the introduction of the constant should not be the problem.
I just listed this temporary result, I'll continue later on discovering the watershed commit.
comment:26 by , 12 years ago
Cc: | added |
---|
First non-working commit:
commit 42a68acc5a7c7a59c5c6ceb37ac825a3b7ba2f55 Author: Mihaela Rusu <m.rusu@jacobs-university.de> Date: Wed Oct 3 17:29:02 2012 +0200 inserting metadata into wcs2 response
So, one problem seems to lie in the new getImageMetadata()
method, which throws the SQLException
("Error retrieving metadata for coverage").
The other error message ("Metadata for coverage '%' is not valid") originates in the DbMetadataSource.read()
method. It is called by WcsUtil.java
which however hides the details so it is not clear which exception is thrown by the former: only coverageId is propagated, whereas the incoming exception message is ignored.
The method DbMetadataSource.getImageMetadata()
in not called inside DbMetadataSource.read()
, however no other modifications to DbMetadataSource.java
where done in that patch, so I really guess the other error are anyway caused by the clash with the query for the metadata (ps_metadata).
I can have a look at this next week.
comment:27 by , 12 years ago
Very good, this bisect tool is really well thought
Next week is ok and it would get straight into 8.4
comment:28 by , 12 years ago
As a quick workaround, in case the gmlcov:metadata
element is not needed in the WCS responses, one could comment out the body of the petascope.core.DbMetadataSource.getImageMetadata()
method, forcing it to return an empty String "".
comment:29 by , 12 years ago
I can see the problem with the getImageMetadata, it's called and handled separately instead of being read into the GetCoverageMetadata.. Not sure if reading it in the same way as all the other coverage info (in DbMetadataSource.read(..)) will fix this problem, but we should anyway remove the getImageMetadata and move the code to the read() method.
comment:30 by , 12 years ago
This is what should be moved to the read() somewhere:
ResultSet r = s.executeQuery("SELECT metadata FROM ps_metadata WHERE coverage in " + "(SELECT id FROM ps_coverage WHERE name = '" + coverageId + "')"); if (r.next()) { return r.getString("metadata"); } else { return ""; }
comment:31 by , 12 years ago
@Piero: you may be getting those errors because you haven't run update_petascopedb.sh on your database, i.e. you don't have the new metadata table? And perhaps that is also causing your multithreading test to fail, i.e. this commit may not be the problematic one for this particular case, especially because I don't see any code that could cause multithreading issues.
comment:32 by , 12 years ago
Dimitar,
yes, my db is updated with ps_metadata (I don't think Petascope deploys otherwise).
Don't you get the errors with TestParallelRequests.py
as well?
Anyway, the solution you proposed is probably what is meant to be to fix the ticket.
comment:33 by , 12 years ago
Seems to work for me, but all the ports are closed here and I can't pull and make the patch against the latest rasdaman, so attaching it here and Piero you can redo it on the latest code.
by , 12 years ago
Attachment: | 0001-Fix-handling-of-simultaneous-requests-in-petascope-t.patch added |
---|
comment:34 by , 12 years ago
would be also nice to integrate the python test into the test_wcs systemtest, it seems to work really well (thanks Damiano!), or fix the test_rasjConcurrency.
comment:36 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
comment:38 by , 11 years ago
Cc: | added; removed |
---|---|
Complexity: | → Medium |
Milestone: | 8.4 → 9.0 |
Priority: | blocker → major |
Resolution: | fixed |
Status: | closed → reopened |
comment:39 by , 11 years ago
Owner: | changed from | to
---|---|
Status: | reopened → assigned |
Reopening as it seems the handling of simultaneous connections is quite unreliable in petascope, especially when there are more connections than there are rasservers available.
The test script to be used for testing is attachment:TestParallelRequests.py e.g. try with 10 rasservers and N_REQUEST = 30
comment:40 by , 11 years ago
Cc: | added |
---|
Furthermore rasserver dies (i.e. becomes unavailable), when there are too many connections exceeding the max_connections values in postgresql.conf
. In the rasdaman log I see SQL error codes 402 and 220, they should be handled better instead of rasserver becoming unavailable.
Same can happen with petascope as well, depending on which one exceeds the threshold first I guess.
comment:41 by , 11 years ago
UPDATE (see also #795)
Thread-safety seems damaged from v9.0.1 to v9.0.2.
Using attachment:TestParallelRequests.py (thanks damiano for the update for v9):
> grep ^WCS_request TestParallelRequests.py WCS_request = "/rasdaman/ows/wcs2?service=WCS&Request=GetCoverage&version=2.0.1&coverageid=mr&subset=i(0,10)&subset=j(0,1)" > $ grep ^N_REQUEST TestParallelRequests.py N_REQUEST = 50 > python TestParallelRequests.py > egrep "404|500" . -r | wc -l
At MEEO Srl v9.0.1 is stable.
Personally on my laptop v9.0.1 shoots a lot of 'RasdamanUnavailable: Unable to get a free rasdaman server'
, while v9.0.2 only fails < 10%.
comment:42 by , 11 years ago
Piero can you try to isolate the commit between 9.0.1 and 9.0.2 that causes this? But first I guess we have to reproduce the issue.
The script test_parallel_requests.py perform a series of simultaneous requests to petascope and save the receive xml files.
#Usage: ./test_parallel_request.py APP N
# APP can be "r" for RASDAMAN or "p" for PETASCOPE
# N is the number of requests generated
At the beginning of the file configure your host name, port, output file path, collection name ecc
Comment or change rows 124 and 125 to manage subset.