2 changed files with 32 additions and 19 deletions
Binary file not shown.
@ -1,51 +1,64 @@
|
||||
#################### |
||||
### Tasks set: 1 ### |
||||
#################### |
||||
|
||||
# Download raw read data from ENA based on previously extracted metadata into a dedicated directory. |
||||
# Search strategy (just FYI for documentation): |
||||
### Download raw read data from ENA based on previously extracted metadata (ena_search_output.txt) into a dedicated directory. |
||||
# The next line contains the search strategy that was used to retrieve the metadata from ENA (just FYI for documentation): |
||||
# curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(408172)%20AND%20geo_box1(53.5060%2C9.0508%2C66.0010%2C31.0703)%20AND%20country%3D%22Baltic%20Sea%22%20AND%20instrument_model%3D%22Illumina%20MiSeq%22%20AND%20library_layout%3D%22PAIRED%22%20AND%20library_selection%3D%22PCR%22%20AND%20library_strategy%3D%22AMPLICON%22%20AND%20library_source%3D%22METAGENOMIC%22&fields=accession%2Caltitude%2Cassembly_quality%2Cassembly_software%2Cbase_count%2Cbinning_software%2Cbio_material%2Cbroker_name%2Ccell_line%2Ccell_type%2Ccenter_name%2Cchecklist%2Ccollected_by%2Ccollection_date%2Ccollection_date_submitted%2Ccompleteness_score%2Ccontamination_score%2Ccountry%2Ccram_index_aspera%2Ccram_index_ftp%2Ccram_index_galaxy%2Ccultivar%2Cculture_collection%2Cdepth%2Cdescription%2Cdev_stage%2Cecotype%2Celevation%2Cenvironment_biome%2Cenvironment_feature%2Cenvironment_material%2Cenvironmental_package%2Cenvironmental_sample%2Cexperiment_accession%2Cexperiment_alias%2Cexperiment_title%2Cexperimental_factor%2Cfastq_aspera%2Cfastq_bytes%2Cfastq_ftp%2Cfastq_galaxy%2Cfastq_md5%2Cfirst_created%2Cfirst_public%2Cgermline%2Chost%2Chost_body_site%2Chost_genotype%2Chost_gravidity%2Chost_growth_conditions%2Chost_phenotype%2Chost_sex%2Chost_status%2Chost_tax_id%2Cidentified_by%2Cinstrument_model%2Cinstrument_platform%2Cinvestigation_type%2Cisolate%2Cisolation_source%2Clast_updated%2Clat%2Clibrary_construction_protocol%2Clibrary_layout%2Clibrary_name%2Clibrary_selection%2Clibrary_source%2Clibrary_strategy%2Clocation%2Clon%2Cmating_type%2Cnominal_length%2Cnominal_sdev%2Cparent_study%2Cph%2Cproject_name%2Cprotocol_label%2Cread_count%2Crun_accession%2Crun_alias%2Csalinity%2Csample_accession%2Csample_alias%2Csample_capture_status%2Csample_collection%2Csample_description%2Csample_material%2Csample_title%2Csampling_campaign%2Csampling_platform%2Csampling_site%2Cscientific_name%2Csecondary_sample_accession%2Csecondary_study_accession%2Csequencing_method%2Cserotype%2Cserovar%2Csex%2Cspecimen_voucher%2Csra_aspera%2Csra_bytes%2Csra_ftp%2Csra_galaxy%2Csra_md5%2Cstrain%2Cstudy_accession%2Cstudy_alias%2Cstudy_title%2Csub_species%2Csub_strain%2Csubmission_accession%2Csubmission_tool%2Csubmitted_aspera%2Csubmitted_bytes%2Csubmitted_format%2Csubmitted_ftp%2Csubmitted_galaxy%2Csubmitted_host_sex%2Csubmitted_md5%2Csubmitted_sex%2Ctarget_gene%2Ctax_id%2Ctaxonomic_classification%2Ctaxonomic_identity_marker%2Ctemperature%2Ctissue_lib%2Ctissue_type%2Cvariety&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search" |
||||
# The download links are in the column "fastq_ftp". |
||||
|
||||
# Download the TARA Oceans prokaryotic bin set into a dedicated directory. |
||||
# https://www.genoscope.cns.fr/tara/localdata/data/BAC_ARC_MAGs-v1/FASTA_1888_MAGs_Bac_Arc.tar.gz |
||||
# To solve this task, you will need a combination of several linux commands. |
||||
# I suggest that you first try to break down the tasks into its smallest component steps and then try to find the corresponding commands. |
||||
# To get you started: The download links are in the column named "fastq_ftp". To extract only that column from the table, you will need to know its column number. How could you find this out using command line tools? |
||||
|
||||
# Unzip the data. |
||||
### Download the TARA Oceans prokaryotic bin set into a dedicated directory. |
||||
# Here is the download link: https://www.genoscope.cns.fr/tara/localdata/data/BAC_ARC_MAGs-v1/FASTA_1888_MAGs_Bac_Arc.tar.gz |
||||
|
||||
### Unzip both the data from ENA and from TARA. |
||||
|
||||
|
||||
#################### |
||||
### Tasks set: 2 ### |
||||
#################### |
||||
|
||||
# List all fastq files in the ENA output directory and create a table that contains 3 columns: sample name (unique identifier) and absolute path to R1 and R2 file name. |
||||
### List all fastq files in the ENA output directory and create a table that contains 3 columns: sample name (unique sample identifier) and absolute path to R1 (*_1.fastq.gz) and R2 (*_2.fastq.gz) file names. |
||||
|
||||
# Move some R2 files to a different directoy. |
||||
### Move some R2 files to a different directoy. |
||||
|
||||
# Check which R2 files are now missing based on the sample names that are part of the R1 file names. |
||||
### Check which R2 files are now missing based on the sample names that are part of the R1 file names. |
||||
|
||||
# Count the number of sequences per sample. |
||||
### Count the number of sequences per sample. Remember that a fastq file contains 4 lines per sequence entry: sequence name (header), sequence, +, quality |
||||
|
||||
|
||||
################### |
||||
### Task set: 3 ### |
||||
################### |
||||
|
||||
# Concatenate all TARA Oceans fasta files in the directory, but prepend the filename (without file extension) in the header of each sequence (space separated). |
||||
### Concatenate all TARA Oceans fasta files in the directory, but prepend the filename (without file extension) in the header of each sequence (space separated). Reminder: fasta files have 2 lines per entry (header identified by a '>' at the beginning of the line, followed by the sequence). Hint: remember loops, variables, and sed. |
||||
|
||||
# In the above output, change the order of information in the fasta header: sequence name first, then file name of origin. Hint: google using back references with regular expressions. |
||||
### In the above output, change the order of information in the fasta header: sequence name first, then file name of origin. Hint: google using back references with regular expressions. |
||||
|
||||
# Check for unusual characters in sequence (not ATCG). Get the header of those sequences if there are any. If not, manually modify a few sequences (try vi for that on the smallest file) for the exercise. |
||||
### Check for unusual characters in the sequence (not ATCG). Get the header of those sequences if there are any. If not, manually modify a few sequences (try vi for that on the smallest file) for the exercise. |
||||
|
||||
# Remove those unusual sequences from the multifasta file based on their name. |
||||
### Remove those unusual sequences from the multifasta file based on their name. |
||||
|
||||
# Remove everything after the first space in the fasta header. |
||||
### Remove everything after the first space in the fasta header. |
||||
|
||||
# Create a frequency table of sequence lengths from one of the fasta files. |
||||
### Create a frequency table of sequence lengths from one of the fasta files. |
||||
|
||||
|
||||
################### |
||||
### Task set: 4 ### |
||||
################### |
||||
|
||||
# Filter blast output based on percentage identity and query coverage: minimum similarity of 97% and full query coverage. |
||||
### Filter the example blast output (blastout.txt) based on percentage identity (column 3) and query coverage (column 13): minimum similarity of 97% and full query coverage. |
||||
|
||||
# Subset the ASV table to only those sequences classified in the filtered blast output. |
||||
### Subset the ASV table (asvtab.txt) to only those sequences classified in the filtered blast output. |
||||
|
||||
# Retrieve taxonomy based on the reference sequence accession numbers in the SILVA blast output. |
||||
### Retrieve taxonomy based on the reference sequence accession numbers (see silva_138.1_nr99_acc2tax.txt) in the filtered SILVA blast output. |
||||
|
||||
|
||||
############## |
||||
### Extra: ### |
||||
############## |
||||
|
||||
# Increased difficulty: add solutions to these tasks, i.e. modify the script, using vi. |
Loading…
Reference in new issue