Filesets¶
Filesets are an important part of Moa - they are used to define input and output files for Moa jobs. In principle, a fileset is not much more than a collection of files. They are three different types:
Types¶
Type “set”¶
A “set” fileset is given a filesystem glob, checks the filesystem and returns a list of files that conform to the glob pattern. Type “set” filesets are typically used to define input of a Moa job. A “set” fileset can (currently) contain only one * wildcard. A correct example would be:
/data/sequences/*.fasta
This glob does exactly what you expect. Lets assume that there are three sequences in this directory, the set would contain three filenames:
/data/sequences/input_01.fasta
/data/sequences/input_02.fasta
/data/sequences/input_03.fasta
More complex patterns, and wildcards other than * are not supported (yet). Each Moa job can have at most one “set” fileset.
Type “map”¶
A “map” fileset converts a “set” fileset (the source) to a related fileset, typically to calculate the output of Moa job. A “map” fileset must be linked to “set” fileset and uses a glob like pattern to convert the input “set” fileset to the resulting fileset. For example, if we take the example fileset defined above, and apply the following pattern:
./*.output
we would end up with the following “map” fileset:
./input_01.output
./input_02.output
./input_03.output
A potential pitfall is the following situation, where we have a “set” fileset defined as follows:
/data/sequences/input_*.fasta
This would result in exactly the same fileset as above. But if we now apply the same “map” pattern, the resulting output fileset would be:
./01.output
./02.output
./03.output
This is because the * from the “set” glob maps the the * in the “map” pattern, the rest is omitted. This can be useful, for example if you would be using this in a Blast job, you could specify the following “map” pattern:
./blast_*.out
which would result in the following output:
./blast_01.out
./blast_02.out
./blast_03.out
In the case of a “map” set it is allowed to use a second wildcard in the pattern, for example:
*/blast_*.out
in which case the first wildcard is replaced with the original path. In the above example this would result in:
/data/sequences//blast_01.out
/data/sequences//blast_02.out
/data/sequences//blast_03.out
(note . you might not want to do this)
Type “single”¶
Is a very simple fileset, pointing to a single file. No wildcards are allowed.
Categories¶
Moa has to keep track (using Ruffus) of in- and output of a job - it does this by tracking filesets. The category defines in a file(set) is considered “input”, “output” or a “prerequisite”. In- & output speaks for itself, a prerequisite is also considered input (i.e. if it changes the job will be repeated), but is typically kept out of the one-on-one file mapping that takes place for in- and output files.
Defining filesets¶
If you are developing a template, there is whole section devoted to filesets. The following example is taken from the Moa BLAST template, and contains almost everything that you will come across:
filesets:
db:
category: prerequisite
help: Blast database
optional: false
pattern: '*/*'
type: single
input:
category: input
help: Directory with the input files for BLAST, in Fasta format
optional: false
pattern: '*/*.fasta'
type: set
outgff:
category: output
help: GFF output files
optional: true
pattern: gff/*.gff
source: input
type: map
output:
help: XML blast output files
category: output
optional: true
pattern: out/*.out
source: input
type: map
Most of this speaks for itself. A few things to note are:
- Both “outgff” and “output” are category “output”, type “map”, filesets mapping to the same input, type “set”, fileset. This is common practice. If you have a look at the map22 template, you can even see an example of category “input”, type “map” fileset.
- If a fileset has reasonable default patterns (values) (typically goes for output fileset), it is possible to make them optional.
- Please specify a good help text