# The Poseidon Standard Poseidon is a solution for genotype data organisation established within the Department of Archaeogenetics at the Max Planck Institute for the Science of Human History (MPI-SHH) in Jena. Detailed documentation for Poseidon can be found at [https://poseidon-framework.github.io](https://poseidon-framework.github.io) ## The Poseidon package structure All ancient and modern data in Poseidon are distributed into so-called packages, which are directories containing a dedicated set of files. Packages correspond to published sets of genomes, or in case of unpublished projects, ongoing (and growing) sets of samples currently analysed. All text files in the package are UTF-8 encoded. Every package should have the following files: - A `POSEIDON.yml` file to formally define the package - Genotype data in eigenstrat or plink format - A `.janno` file to store context information - A `.bib` file for literature references It can also contain the following files: - A `README.txt` file for arbitrary context information - A `CHANGELOG.txt` file to document changes to the package Example: ``` Switzerland_LNBA_Roswita/POSEIDON.yml Switzerland_LNBA_Roswita/Switzerland_LNBA.plink.bed Switzerland_LNBA_Roswita/Switzerland_LNBA.plink.bim Switzerland_LNBA_Roswita/Switzerland_LNBA.plink.fam Switzerland_LNBA_Roswita/Switzerland_LNBA.janno Switzerland_LNBA_Roswita/Switzerland_LNBA.bib Switzerland_LNBA_Roswita/README.txt Switzerland_LNBA_Roswita/CHANGELOG.txt ``` ## The `POSEIDON.yml` file The `POSEIDON.yml` file lists relative file paths and metainformation in a standardized, machine-readable format. - It must be a valid [YAML file](https://yaml.org/). - Its fields of the `POSEIDON.yml` file are documented in the [POSEIDON_yml_fields.tsv file](https://github.com/poseidon-framework/poseidon2-schema/blob/master/POSEIDON_yml_fields.tsv) in this repository. Example: ``` poseidonVersion: 2.3.1 title: Switzerland_LNBA_Roswita description: LNBA Switzerland genetic data not yet published # optional contributor: - name: Roswita Malone email: roswita.malone@example.org - name: Paul Panther email: paul.panther@example.edu packageVersion: 1.1.2 lastModified: 2021-01-28 genotypeData: format: PLINK genoFile: Switzerland_LNBA_Roswita.bed genoFileChkSum: 95b093eefacc1d6499afcfe89b15d56c # optional snpFile: Switzerland_LNBA_Roswita.bim snpFileChkSum: 6771d7c873219039ba3d5bdd96031ce3 # optional indFile: Switzerland_LNBA_Roswita.fam indFileChkSum: f77dc756666dbfef3bb35191ae15a167 # optional snpSet: 1240K jannoFile : Switzerland_LNBA_Roswita.janno jannoFileChkSum: 555d7733135ebcabd032d581381c5d6f # optional bibFile: sources.bib bibFileChkSum: 70cd3d5801cee8a93fc2eb40a99c63fa # optional readmeFile: README.txt # optional changelogFile: CHANGELOG.txt # optional ``` When a package is modified in any way (e.g. updates of the context information in the `.janno` file), then the `packageVersion` field should be incremented and the `lastModified` field updated to the current date. ## Genotype data Genotype data in Poseidon packages is stored either in PLINK (binary) or EIGENSTRAT format. | | PLINK (binary) | EIGENSTRAT | |---|---|---| | genotype file | [`.bed` (binary biallelic genotype table)](https://www.cog-genomics.org/plink/1.9/formats#bed) | [`.geno` (genotype file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) | SNP file | [`.bim` (extended MAP file)](https://www.cog-genomics.org/plink/1.9/formats#bim) | [`.snp` (snp file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) | | individual file | [`.fam` (sample information)](https://www.cog-genomics.org/plink/1.9/formats#fam) | [`.ind` (indiv file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) | In addition to these files and their checksums, you also should provide a `snpSet` entry which determines the shape of the genotype file. Currently, only `1240K`, `HumanOrigins` or `Other` are allowed. While technically not a mandatory field, for backwards compatibility, we encourage users to set this field, and in fact our software also encourages this through required user input. ## The `.janno` file The `.janno` file is a tab-separated text file with a header line. It holds context information (variables/columns) for each sample (objects/rows) in a package. - A set of strictly defined core variables (defined by column name) and their possible content are documented here: [janno_columns.tsv](https://github.com/poseidon-framework/poseidon2-schema/blob/master/janno_columns.tsv) - A `.janno` file can have all of these core variables, or only a subset of them. - Only three columns are mandatory to make the file valid: **Individual_ID**, **Group_Name** and **Genetic_Sex** - Arbitrary columns not defined here can be added as long as their column names do not clash with the defined ones. - The column order is irrelevant. - If information is unknown or a variable does not apply for a certain sample, then the respective cell(s) can be filled with the NULL value `n/a` or simply an empty string. - The order of the samples (rows) in the `.janno` file must be equal to the order in the genetic data files (`.ind`, `.fam`). - The values in the columns **Individual_ID**, **Group_Name** and **Genetic_Sex** must be equal to the terms used in the genetic data files (`.ind`, `.fam`). - Multiple pre-defined columns of the `.janno` file are list columns that hold multiple values (either strings or numerics) separated by `;`. - The decimal separator for all floating point numbers is `.`. For a more extensive documenation of the columns and their interaction see [https://poseidon-framework.github.io/#/janno_details](https://poseidon-framework.github.io/#/janno_details). ## The `.bib` file [BibTeX](http://www.bibtex.org/) file with all references listed in the `.janno` file. The bibtex keys must fit to ones used in the `.janno` file. Example: ``` @article{CassidyPNAS2015, doi = {10.1073/pnas.1518445113}, url = {https://doi.org/10.1073%2Fpnas.1518445113}, year = 2015, month = {dec}, publisher = {Proceedings of the National Academy of Sciences}, volume = {113}, number = {2}, pages = {368--373}, author = {Lara M. Cassidy and Rui Martiniano and Eileen M. Murphy and Matthew D. Teasdale and James Mallory and Barrie Hartwell and Daniel G. Bradley}, title = {Neolithic and Bronze Age migration to Ireland and establishment of the insular Atlantic genome}, journal = {Proceedings of the National Academy of Sciences} } ``` ## The `README.txt` file Informal information accompanying the package. Example: ``` This package contains a rather interesting set of samples relevant for the peopling of the Territory of Christmas Island in the Indian Ocean. We consider this especially relevant, because ... ``` ## The `CHANGELOG.txt` file Documentation of important changes in the history of a package. Example: ``` ## 1.2.0 - Fixed a spelling mistake in the site name "Hosenacker"->"Rosenacker". ## 1.1.1 - Added mtDNA contamination estimation to .janno file ## 1.1.0 - The authors of @Gassenhauer_2021 made some previously restricted samples for their publication available later and we added them. ## 1.0.0 - Creation of the package. ```