Browse Source

updated linux intro info

master
chassenr 1 year ago
parent
commit
aca3b9a85a
  1. 14
      Linux_intro_2022/README.md
  2. 51
      Linux_intro_2022/tasks.sh
  3. 4
      Microbio_practical_2021/R_intro.Rmd
  4. 294
      Microbio_practical_2021/R_intro.html
  5. BIN
      Microbio_practical_2021/R_intro.pdf

14
Linux_intro_2022/README.md

@ -20,16 +20,17 @@ The workshop will consist of 2 sessions: The first session will cover the basics
* Finding a suitable terminal to work on
* User environment: default shell, bash (```.bashrc```), environment variables (```$PATH```)
* Basic bash commands: ```cd```, ```htop```, ```ls``` (file permissions), ```cp```, ```scp```, ```mv```, ```less```, ```cut```, ```paste```, ```sort```, ```uniq```, ```mkdir```, ```rm```, ```echo```, ```cat```, ```hexdump```, ```wc```, ```dos2unix```, ```touch```, ```man```, etc.
* Basic linux commands: ```cd``` (file paths), ```pwd```, ```man```, ```ls``` (file permissions), ```echo``` (variables), ```cp```, ```mv```, ```mkdir```, ```rm```, ```ssh```, ```scp```, ```htop```, ```du```, ```df```, ```time```, ```less```, ```cut```, ```paste```, ```sort```, ```uniq```, ```diff```, ```cat```, ```hexdump```, ```wc```, ```tr```, ```dos2unix```, ```touch```, ```grep```, ```awk```, ```sed``` (regular expressions), ```vi```, ```xargs```, ```basename```, ```ln```, ```wget```, ```md5sum```, ```gzip```, ```bzip2```, ```tar```, ```zless```, ```zcat```, ```zgrep```, ```screen```, etc.
* Autocomplete, copy/paste and recall previous commands
* Keyboard shortcuts
* Standard input (stdin), standard output (stdout), standard error (stderr), redirection, pipes
* More basic bash commands: ```grep```, ```awk```, ```sed``` (regular expressions), ```vi```, ```xargs```, ```basename```, ```tr```, ```du```, ```df```, ```ln```, ```time```, ```wget```, ```md5sum```, ```gzip```, ```bzip2```, ```zless```, ```zcat```, ```zgrep```, etc.
* Variables
* Loops for sequential processing (for and while loops)
* Tasks for individual work (break-out rooms)
### Session 2:
* Solution to individual tasks (only if things are unclear)
* if/else statements
* ssh keys
* Software installation, conda, modules
* Writing an executable bash script
@ -38,7 +39,6 @@ The workshop will consist of 2 sessions: The first session will cover the basics
## Course material:
* [Slides]()
* [Examples session 1]()
* [Examples session 1]()
* [Tasks script]()
* [Slides](https://git.io-warnemuende.de/bio_inf/tutorials_collection/src/branch/master/Linux_intro_2022/linux_intro_slides.pdf)
* [Example data](https://owncloud.io-warnemuende.de/index.php/s/MTi0e2nuP2slZ3J)
* [Tasks script](https://git.io-warnemuende.de/bio_inf/tutorials_collection/src/branch/master/Linux_intro_2022/tasks.sh)

51
Linux_intro_2022/tasks.sh

@ -0,0 +1,51 @@
### Tasks set: 1 ###
# Download raw read data from ENA based on previously extracted metadata into a dedicated directory.
# Search strategy (just FYI for documentation):
# curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(408172)%20AND%20geo_box1(53.5060%2C9.0508%2C66.0010%2C31.0703)%20AND%20country%3D%22Baltic%20Sea%22%20AND%20instrument_model%3D%22Illumina%20MiSeq%22%20AND%20library_layout%3D%22PAIRED%22%20AND%20library_selection%3D%22PCR%22%20AND%20library_strategy%3D%22AMPLICON%22%20AND%20library_source%3D%22METAGENOMIC%22&fields=accession%2Caltitude%2Cassembly_quality%2Cassembly_software%2Cbase_count%2Cbinning_software%2Cbio_material%2Cbroker_name%2Ccell_line%2Ccell_type%2Ccenter_name%2Cchecklist%2Ccollected_by%2Ccollection_date%2Ccollection_date_submitted%2Ccompleteness_score%2Ccontamination_score%2Ccountry%2Ccram_index_aspera%2Ccram_index_ftp%2Ccram_index_galaxy%2Ccultivar%2Cculture_collection%2Cdepth%2Cdescription%2Cdev_stage%2Cecotype%2Celevation%2Cenvironment_biome%2Cenvironment_feature%2Cenvironment_material%2Cenvironmental_package%2Cenvironmental_sample%2Cexperiment_accession%2Cexperiment_alias%2Cexperiment_title%2Cexperimental_factor%2Cfastq_aspera%2Cfastq_bytes%2Cfastq_ftp%2Cfastq_galaxy%2Cfastq_md5%2Cfirst_created%2Cfirst_public%2Cgermline%2Chost%2Chost_body_site%2Chost_genotype%2Chost_gravidity%2Chost_growth_conditions%2Chost_phenotype%2Chost_sex%2Chost_status%2Chost_tax_id%2Cidentified_by%2Cinstrument_model%2Cinstrument_platform%2Cinvestigation_type%2Cisolate%2Cisolation_source%2Clast_updated%2Clat%2Clibrary_construction_protocol%2Clibrary_layout%2Clibrary_name%2Clibrary_selection%2Clibrary_source%2Clibrary_strategy%2Clocation%2Clon%2Cmating_type%2Cnominal_length%2Cnominal_sdev%2Cparent_study%2Cph%2Cproject_name%2Cprotocol_label%2Cread_count%2Crun_accession%2Crun_alias%2Csalinity%2Csample_accession%2Csample_alias%2Csample_capture_status%2Csample_collection%2Csample_description%2Csample_material%2Csample_title%2Csampling_campaign%2Csampling_platform%2Csampling_site%2Cscientific_name%2Csecondary_sample_accession%2Csecondary_study_accession%2Csequencing_method%2Cserotype%2Cserovar%2Csex%2Cspecimen_voucher%2Csra_aspera%2Csra_bytes%2Csra_ftp%2Csra_galaxy%2Csra_md5%2Cstrain%2Cstudy_accession%2Cstudy_alias%2Cstudy_title%2Csub_species%2Csub_strain%2Csubmission_accession%2Csubmission_tool%2Csubmitted_aspera%2Csubmitted_bytes%2Csubmitted_format%2Csubmitted_ftp%2Csubmitted_galaxy%2Csubmitted_host_sex%2Csubmitted_md5%2Csubmitted_sex%2Ctarget_gene%2Ctax_id%2Ctaxonomic_classification%2Ctaxonomic_identity_marker%2Ctemperature%2Ctissue_lib%2Ctissue_type%2Cvariety&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"
# The download links are in the column "fastq_ftp".
# Download the TARA Oceans prokaryotic bin set into a dedicated directory.
# https://www.genoscope.cns.fr/tara/localdata/data/BAC_ARC_MAGs-v1/FASTA_1888_MAGs_Bac_Arc.tar.gz
# Unzip the data.
### Tasks set: 2 ###
# List all fastq files in the ENA output directory and create a table that contains 3 columns: sample name (unique identifier) and absolute path to R1 and R2 file name.
# Move some R2 files to a different directoy.
# Check which R2 files are now missing based on the sample names that are part of the R1 file names.
# Count the number of sequences per sample.
### Task set: 3 ###
# Concatenate all TARA Oceans fasta files in the directory, but prepend the filename (without file extension) in the header of each sequence (space separated).
# In the above output, change the order of information in the fasta header: sequence name first, then file name of origin. Hint: google using back references with regular expressions.
# Check for unusual characters in sequence (not ATCG). Get the header of those sequences if there are any. If not, manually modify a few sequences (try vi for that on the smallest file) for the exercise.
# Remove those unusual sequences from the multifasta file based on their name.
# Remove everything after the first space in the fasta header.
# Create a frequency table of sequence lengths from one of the fasta files.
### Task set: 4 ###
# Filter blast output based on percentage identity and query coverage: minimum similarity of 97% and full query coverage.
# Subset the ASV table to only those sequences classified in the filtered blast output.
# Retrieve taxonomy based on the reference sequence accession numbers in the SILVA blast output.
### Extra: ###
# Increased difficulty: add solutions to these tasks, i.e. modify the script, using vi.

4
Microbio_practical_2021/R_intro.Rmd

@ -1,7 +1,7 @@
---
output:
html_document: default
pdf_document: default
html_document: default
---
### Installing and starting R
@ -279,7 +279,7 @@ str(L$int)
While the above overview of data and object types may have been a bit dry, it will help you to understand and diagnose the majority of errors you are likely to get when starting to work with R. I would say about 80% of errors are either related to typo, or the wrong data or object type in the input for a function.
Now, let's take a closer look at the actual example data that you will be using during the practical.
The data set consists of 3 tables: an integer matrix (those are the sequence counts per sample), a character matrix (this is the taxonomy assigned to each amplicon sequence variant (ASV)), and a data frame (this contains metadata, experimental conditions, and environmental parameters). They can be downloaded [here](https://owncloud.io-warnemuende.de/index.php/s/nZb3M2tIywPIDoR) (password: practical2021).
The data set consists of 3 tables: an integer matrix (those are the sequence counts per sample), a character matrix (this is the taxonomy assigned to each amplicon sequence variant (ASV)), and a data frame (this contains metadata, experimental conditions, and environmental parameters). Please contact christiane.hassenrueck@io-warnemuende.de to get access.
The data set that you will work with was generated during an experiment investigating the effects of salinity on the composition and activity of microbial communities in the the Warnow.
Water was collected form the Warnow and incubated at 4 different salinity levels for 21 days.

294
Microbio_practical_2021/R_intro.html

File diff suppressed because one or more lines are too long

BIN
Microbio_practical_2021/R_intro.pdf

Binary file not shown.
Loading…
Cancel
Save