Guidelines
OS optimizations
- filesystem - it's recommended to use the xfs filesystem, over ext4 or similar. Mount with noatime, e.g. in
/etc/fstab
:# / was on /dev/sda1 during installation UUID=212a6de3-515b-4e12-aa17-3e466f6966aa / ext4 noatime,errors=remount-ro 0 1
- scheduler - deadline for spinning disk, noop for ssd or high caching raid, e.g.
echo deadline > /sys/block/DEVICE/queue/scheduler
- read ahead - enable with below command, which can be made permanent by putting in
/etc/rc.local
for example:/sbin/blockdev --setra 4096 /dev/DEVICE
- write cache - put the below in
/etc/sysctl.conf
and load withsysctl -p
:vm.dirty_background_ratio = 5 vm.dirty_background_ratio = 10
- swappiness - set to a minimum, in order to force as little swapping as possible
vm.swappiness = 1
- shared memory - run attachment:calc_kernel_shm.sh and substituted the outputs into
/etc/sysctl.conf
:kernel.shmmax=VALUE kernel.shmall=VALUE
- extra security: force the OS to always keep a certain amount of RAM free for critical stuff (e.g. 64MB). This will prevent it from locking and crashing when an application has used up all the RAM and SWAP.
echo "vm.min_free_kbytes=65536" >> /etc/sysctl.conf sysctl -p
PostgreSQL
The default PostgreSQL configuration can be tuned for better performance with rasdaman. The parameters below should be adapted in the postgresql.conf
, typically found under /etc/postgresql
.
- max_connections - as long as a single user is using the database, this can be decreased to about 40. When many rasservers are started (e.g. 20+) and concurrent connections are expected this values should be increased though, or it may lead to random query failures as the threshold is exceeded (see comment:40:ticket:133)
- shared_buffers - should be 25%-33% of your total RAM
- work_mem -
(total RAM / connections) / 4..16
, but not lower then 128 MB. When you have 4 GB RAM and you are alone then 256 MB is fine. - maintenance_work_mem - total RAM / 16
- synchronous_commit - set to off when doing bulk ingestion
- wal_buffers - should be 16 MB; by default 1/32 from the shared buffer value is taken as long as it is lower then 16 MB otherwise it is 16 MB
- checkpoint_segments - 256 is a good start value when doing bulk ingestion
- checkpoint_completion_target - 0.0 (especially important for bulk ingestion)
- random_page_cost - 2.0 for spinning disk and 1.01 for ssd, high caching raid and when DB fits into RAM
- effective_cache_size - total RAM - shared_buffers value
- logging_collector - set to on if it isn't set already
- log_line_prefix - set to '%t ' to get timestamps in the logs
Benchmarks
Ingestion
Below are the results of ingesting one dataset using various tilings and postgres parameters.
Regular tiling
The dataset is 150 png rgb images of size 1024x608, so that each image is 1867776 bytes uncompressed, or 280 MB in total.
- Default postgres parameters:
Tiling scheme | Tile size | Import time | DB size before | DB size after | BLOBs size | BLOBs size2 |
0:31,0:31,0:149 | 460800 | 84 minutes | 184M | 19416M | 267 MB | 18 GB |
- Changed parameters in postgresql.conf: shared buffers = 1024 MB, temp_buffers = 8MB, fsync = off, synchronous_commit = off, wal_sync_method = fsync, full_page_writes = off, wal_buffers = 1MB, wal_writer_delay = 2000ms, checkpoint_segments = 32
Tiling scheme | Tile size | Import time | DB size before | DB size after | BLOBs size | BLOBs size2 |
0:31,0:31,0:149 | 460800 | 35 minutes | 184M | 19416M | 267 MB | 18 GB |
- Using tile cache
Tiling scheme | Tile size | Import time | DB size before | DB size after | BLOBs size | BLOBs size2 |
0:31,0:31,0:149 | 460800 | 48 seconds | 37M | 329M | 267 MB | 267 MB |
Regular tiling + VACUUM
The dataset is 50 png rgb images of size 1024x608, so that each image is 1867776 bytes uncompressed, or 89 MB in total.
- Changed parameters in postgresql.conf: shared buffers = 1024 MB, temp_buffers = 8MB, fsync = off, synchronous_commit = off, wal_sync_method = fsync, full_page_writes = off, wal_buffers = 8MB, wal_writer_delay = 10000ms, checkpoint_segments = 16
Tiling scheme | Tile size | Import time | DB size before | DB size after | BLOBs size | BLOBs size2 | BLOBs after VACUUM |
0:31,0:31,0:149 | 460800 | 7m 34s | 47M | 2464M | 267 MB | 1776 MB | 96MB |
PostgreSQL tips
Total size of BLOBs
BLOBs are stored in the pg_largeobject
in postgres. Each BLOB is divided into rows (pages) of 2048 bytes typically. More info here
Query below computes the space that all BLOBs take.
SELECT pg_size_pretty(count(loid) * 2048) FROM pg_largeobject;
Total size of BLOBs 2
SELECT tablename, pg_size_pretty(size) AS size_pretty, pg_size_pretty(total_size) AS total_size_pretty FROM (SELECT *, pg_relation_size(schemaname||'.'||tablename) AS size, pg_total_relation_size(schemaname||'.'||tablename) AS total_size FROM pg_tables) AS TABLES WHERE TABLES.tablename = 'pg_largeobject' ORDER BY total_size DESC;
Size of individual BLOBs
SELECT loid, pg_size_pretty(count(*) * 2048) FROM pg_catalog.pg_largeobject GROUP BY loid ORDER BY count(*) DESC;
PostgreSQL storage
First you need to get the object identifier for your database:
SELECT oid, datname FROM pg_database;
- When you look into the PostgreSQL data directory you will find the global directory.
- When you look into the global directory you will find one directory for each database named by the database oid.
- When you look into the database directory you will find one file for each table and index.
- If the table or index exceed 1 GB it will be split into more files and you will find one file for each GB.
- Table and index files are named by filenodes.
- Table and index filenodes can be different their object identifiers.
PostgreSQL use 8 kB pages for storing data in tables and indexes.
More informations about the size of your tables you can get by looking into pg_class.
Consider, that for getting correct data, ANALYZE should have run before.
SELECT relkind, relfilenode, oid, relname, reltuples, relpages, pg_size_pretty(relpages * 1024 * 8) as size FROM pg_class WHERE …;
- relkind i means index; r means relation which is table here.
- reltuple tells you how much tuple / rows exist in the table or index.
- relpages tells you how much pages are allocated for the table or index.
- In PostgreSQL version < 9.2 pg_size_pretty(INTEGER) will overflow when size is more then round about 2 GB because it exceed 4 byte INTEGER.
- In PostgreSQL version ≥ 9.2 you need to cast relpages to NUMERIC because the function changed to pg_size_pretty(NUMERIC).
- Tables and indexes starting with pg_ usually are tables created by the postgresql system
- pg_largeobject belongs to the tables created by postgresql
- oid for objects created by the postgresql system are lower then the database oid
- oid from your own tables and indexes always are bigger then the database oid
When you just want to list all your own tables and indexes plus the toast tables that were created for that tables / indexes:
... WHERE oid > DATABASE_OID;
When you want to get informations about pg_largobject:
... WHERE relname LIKE '%large%';
TOAST
- The Oversize Attribute Storage Technique
- Used on variable length data types (e.g. varchar, text, bytea)
- PostgreSQL creates pg_toast_ tables here which are empty as long as data won't exceed page size
- If tuple exceeds page size (8 kB by default)
- data will get compressed
- data will be broken into multiple rows
Size of RASBASE tables
SELECT tablename, pg_size_pretty(size) AS size_pretty, pg_size_pretty(total_size) AS total_size_pretty FROM (SELECT *, pg_relation_size(schemaname||'.'||tablename) AS size, pg_total_relation_size(schemaname||'.'||tablename) AS total_size FROM pg_tables) AS TABLES WHERE TABLES.schemaname = 'public' ORDER BY total_size DESC;
Reset WAL
Write-ahead logging is used to ensure data integrity. In postgres these logs are stored in pg_xlog and it's recommended to put them in a separate disk. To reset the logs:
service stop postgresql
sudo -u postgres pg_resetxlog $PGDATA
Remove orphaned BLOBs
Orphaned BLOBs are BLOBs that are stored but they are not referenced by oid from any table in the database. To clear them up use vacuumlo
vacuumlo RASBASE
Note, vacuumlo is an additional module made available through postgresql-contrib
VACUUM tables
Tuples that have been deleted or obsoleted are normally not physically deleted, and thus the space continues to be still used, until a VACUUM
is performed on the database or specific tables. VACUUM
alone doesn't free up the disk space for further use by the OS, VACUUM FULL
is necessary for this. In rasdaman it's most important to do periodic VACUUM on the BLOBs table:
- psql -d RASBASE -c "VACUUM FULL pg_largeobject"
or
- vacuumdb -d RASBASE -t "pg_largeobject" -f
Tile cache
A tile cache can be enabled in rasdaman during bulk ingestion, in order to work around inefficiencies of BLOB handling in postgres and for certain advanced, but expensive tiling schemes.
How to use
The cache can be enabled in rasmgr.conf by specifying a --cachelimit
parameter after the -xp
parameter:
--cachelimit <limit> (default: 0) specifies upper limit in bytes on using memory for caching
An example rasmgr.conf
configuration that allows the cache to use up to 1GB memory:
define dbh rasdaman_host -connect RASBASE define db RASBASE -dbh rasdaman_host define srv N1 -host HOST -type n -port 9000 -dbh rasdaman_host change srv N1 -countdown 20000000 -autorestart on -xp --timeout 30000000 --cachelimit 1073741824
When enabled, INSERT
and UPDATE
rasql statements can be issued in the normal way. At the end of the ingestion a COMMIT
command should be executed, to make sure that the cache is cleared and everything is flushed to disk. E.g.
rasql -q 'insert into coll ...' rasql -q 'update coll ...' rasql -q 'update coll ...' rasql -q 'update coll ...' # done with ingestion rasql -q 'commit'
To make most optimal use of the tile cache feature, it's important to understand how it works. The tile cache especially helps with regular tiling, when there are many partial updates at slices along some axis. The regular tiling section provides an introduction into this (_please read before continuing_), here we continue with the same example from that section and define a best practice for using the tile cache.
As we has been demonstrated by the regular tiling example, inserting a 1000x1000x1 data cube results in generating a 1000x1000x1000 cube, due to the specific tiling scheme used. If the tile cache is enabled and at least 1000x1000x1000 = 1GB of memory is allowed with --cachelimit
, this cube will fit in memory and will be cached. If less memory is allowed for the tile cache, then some of the 100x100x1000 tiles that don't fit in the memory will have to be written to disk. Assuming that the whole cube fits in main memory, then any slices updating the cube along the third dimension will be very fast, e.g.
update test as m set m[*:*,*:*,1] assign marray x in [0:999,0:999] values 1c update test as m set m[*:*,*:*,2] assign marray x in [0:999,0:999] values 1c ... update test as m set m[*:*,*:*,999] assign marray x in [0:999,0:999] values 1c
Before making an update at slice 1000, it is best to flush the current cube to disk with
rasql -q 'commit'
Because update test as m set m[*:*,*:*,1000] ..
will initiate creation of a completely new cube of tiles, namely [0:99,0:99,1000:1999]
, [0:99,100:199,1000:1999]
, etc.
Therefore, it is best to group updates by such tile cubes, manually commit when data in one tile cube is completely ingested, and avoid jumping from update in one cube to update in another cube (e.g. slice 1, then 1000, then 2, then 1001, etc with respect to the example above). It is best when --cachelimit
is set to at least the size of one such cube, but more is always better.
Important limitation
At the moment the tile cache should be used during bulk ingestion _ONLY_, and disabled afterwards. Leaving the tile cache on will cause selection queries to fail randomly. Furthermore, it is best to have only one server defined in rasmgr.conf, and -countdown
and --timeout
parameters set to very large values.
Attachments (1)
- calc_kernel_shm.sh (901 bytes ) - added by 11 years ago.
Download all attachments as: .zip