Today I needed to extract the headers of multiple fasta files in order to compare them. R is no good for this so I though this would be a good job to test my Julia skills again. This was surprisingly straightforward to do with Julia’s FASTX package.
The script below does the following:
- Search a directory for files with the
.faa
extension - Open each file
- Loop over each sequence and extract its header + description
- Write file name, header and description in a comma separated file
# Extract all fasta headers of faa files
# Define packages and functions
## Load FASTX
using FASTX
## Define function for directory search of a keyword
searchdir(path, key) = filter(x -> occursin(key, x), readdir(path))
# Define file locations
in_dir = "project/faa_files/"
outfile = open("faa_headers.csv", "w")
# Do the extraction
for file in searchdir(in_dir, ".faa")
reader = open(FASTA.Reader, string(in_dir, file))
for seq in reader
identifier = FASTA.identifier(seq)
description = FASTA.description(seq)
write(outfile, string(file, ";", identifier, ";", description, "\n"))
end
close(reader)
end
close(outfile)
It worked well but since I am still learning, I am not sure if this is the proper way to handle things in Julia.
As always, comments are very welcome!