Opened 9 years ago

Last modified 8 years ago

#1073 closed defect

Filestorage - divide tile files into subdirectories — at Version 6

Reported by: Dimitar Misev Owned by: Dimitar Misev
Priority: critical Milestone: 9.1.x
Component: relblobif Version: development
Keywords: Cc: Peter Baumann, Vlad Merticariu, Alex Dumitru
Complexity: Medium

Description (last modified by Dimitar Misev)

The flat directory organization of the tile files in $RASDATA is not scalable as we reach filesystem limits. Therefore tiles should be distributed into subdirectories.

Currently all data is stored in a single directory $RASDATA, i.e. we have

$RASDATA
 |_ RASBASE
 |_ 1
 |_ 2
 |_ 3
 |_ ..


Proposed restructuring

$RASDATA
 |_ RASBASE
 |_ TILES
      |_ ..

How should TILES be structured? Maximum number of subdirectories across the most common filesystems:

  • ext3 : 32,000
  • ext4 : unlimited in theory, but may be set to 64,000 by default
  • xfs : tested to millions and performance is not impacted
  • btrfs: similar to xfs
  • ntfs : 2^32-1 theoretically (same limit as number of files in a directory)

Between 10,000 and 100,000 files per directory seems like a good number well supported across filesystems. If we take 100,000 on ext3 that gives us a lower limit of 3 billion tiles.

Single-level nesting

Distributing tiles in 100,000 per directory we have this organization:

$RASDATA
 |_ RASBASE
 |_ TILES
      |_ 0
      |  |_ 1
      |  |_ 2
      |  |_ 3
      |  |_ ...
      |   
      |_ 1
      |  |_ 100,000
      |  |_ 100,001
      |  |_ 100,002
      |  |_ ...
      |  
      |_ ...

The subdirectory index in TILES is dir_index = tile_index / 100,000. The 100,000 number can be a compile time constant that can be adjusted as necessary. By default it is maybe better if it is 2^16 or 2^17 so that the dir_index can be computed with a fast bit shift.

With 30,000 subdirs this gives us a "lower" limit of ~12 PB (with 4MB tile size).

Two-level nesting

$RASDATA
 |_ RASBASE
 |_ TILES
      |  dir1_index
      |_ 0
      |  |  dir2_index
      |  |_ 0
      |  |  |_ 1
      |  |  |_ 2
      |  |  |_ 3
      |  |  |_ ..
      |  |  
      |  |_ 1
      |  |  |_ 100,000
      |  |  |_ 100,002
      |  |  |_ ..
      |  |  
      |  |_ 2
      |  |_ ...
      |  |_ 32,767
      |   
      |_ 1
      |  |_ 32,768
      |  |_ 32,769
      |  |_ ...
      |  
      |_ ...

The subdirectory index in TILES is:

  • dir2_index = tile_index / max_files (100,000)
  • dir1_index = dir2_index / max_dirs (32,768)


This sets a "lower" limit of ~400 EB (with 4MB tiles).

Backwards compatibility

Rasdaman could support both structures (old and new) with a simple check at startup; in v10.0 we can enforce this structure. update_db.sh can be executed to migrate to the new directory structure.

Change History (6)

comment:1 by Dimitar Misev, 9 years ago

Description: modified (diff)

comment:2 by Dimitar Misev, 9 years ago

Description: modified (diff)

comment:3 by Vlad Merticariu, 9 years ago

100,000 tiles / directory, with a limit of 32,000 directories and a 4MB tile size means ~12 PB maximum size. If somebody chooses smaller tiles, like 1MB, then we would have a limit of 3 PB.

I agree that we should avoid complexity and keep it simple, but in order not to worry about this I guess the limit should be in the order of EB.

What about adding 1 extra level of nesting (so in the subdir 0 of TILES, you can have 32,000 directories), which increases the limit to more than 100 EB?

Please correct me if there's anything wrong with my math.

comment:4 by Dimitar Misev, 9 years ago

Yes true, although 30,000 subdirs is really a lower limit (ext3 is seriously outdated, no one will put PB on ext3). Can you workout a simple bucketing scheme for two levels so that both levels get gradually filled up with subdirs?

I just thought of network filesystems like NFS btw, is anyone familiar with these? Probably they have quite some limitations.

in reply to:  4 comment:5 by Dimitar Misev, 9 years ago

Replying to dmisev:

I just thought of network filesystems like NFS btw, is anyone familiar with these? Probably they have quite some limitations.

Seems like this is up to the underlying filesystem, so we can ignore it.

comment:6 by Dimitar Misev, 9 years ago

Description: modified (diff)
Note: See TracTickets for help on using tickets.