http://stato-ontology.org/ Alejandra Gonzalez-Beltran (http://orcid.org/0000-0003-3499-8262) STATO: the statistical methods ontology Camille Maumet (http://orcid.org/0000-0002-6290-553X) STATO is the statistical methods ontology. It contains concepts and properties related to statistical methods, probability distributions and other concepts related to statistical analysis, including relationships to study designs and plots. stat-ontology@googlegroups.com This Ontology is distributed under a Creative Commons Attribution License RC1.4 http://creativecommons.org/licenses/by/3.0/ Philippe Rocca-Serra (http://orcid.org/0000-0001-9853-5668) Thomas Nichols (http://orcid.org/0000-0002-4516-5103) Chris Mungall (http://orcid.org/0000-0002-6601-2165) Orlaith Burke Statistical Method, Design of Experiment, Plots, Statistical Model Nolan Nichols (http://orcid.org/0000-0003-1099-3328) Hanna Cwiek (https://orcid.org/0000-0001-9113-567X) https://github.com/ISA-tools/stato/issues Relates an entity in the ontology to the name of the variable that is used to represent it in the code that generates the BFO OWL file from the lispy specification. Really of interest to developers only BFO OWL specification label Relates an entity in the ontology to the term that is used to represent it in the the CLIF specification of BFO2 Person:Alan Ruttenberg Really of interest to developers only BFO CLIF specification label editor preferred label editor preferred label editor preferred term editor preferred term editor preferred term~editor preferred label The concise, meaningful, and human-friendly name for a class or property preferred by the ontology developers. (US-English) PERSON:Daniel Schober GROUP:OBI:<http://purl.obolibrary.org/obo/obi> editor preferred label editor preferred label editor preferred term editor preferred term editor preferred term~editor preferred label example A phrase describing how a term should be used and/or a citation to a work which uses it. May also include other kinds of examples that facilitate immediate understanding, such as widely know prototypes or instances of a class, or cases where a relation is said to hold. PERSON:Daniel Schober GROUP:OBI:<http://purl.obolibrary.org/obo/obi> example of usage has curation status PERSON:Alan Ruttenberg PERSON:Bill Bug PERSON:Melanie Courtot OBI_0000281 has curation status definition definition definition textual definition textual definition The official OBI definition, explaining the meaning of a class or property. Shall be Aristotelian, formalized and normalized. Can be augmented with colloquial definitions. The official definition, explaining the meaning of a class or property. Shall be Aristotelian, formalized and normalized. Can be augmented with colloquial definitions. 2012-04-05: Barry Smith The official OBI definition, explaining the meaning of a class or property: 'Shall be Aristotelian, formalized and normalized. Can be augmented with colloquial definitions' is terrible. Can you fix to something like: A statement of necessary and sufficient conditions explaining the meaning of an expression referring to a class or property. Alan Ruttenberg Your proposed definition is a reasonable candidate, except that it is very common that necessary and sufficient conditions are not given. Mostly they are necessary, occasionally they are necessary and sufficient or just sufficient. Often they use terms that are not themselves defined and so they effectively can't be evaluated by those criteria. On the specifics of the proposed definition: We don't have definitions of 'meaning' or 'expression' or 'property'. For 'reference' in the intended sense I think we use the term 'denotation'. For 'expression', I think we you mean symbol, or identifier. For 'meaning' it differs for class and property. For class we want documentation that let's the intended reader determine whether an entity is instance of the class, or not. For property we want documentation that let's the intended reader determine, given a pair of potential relata, whether the assertion that the relation holds is true. The 'intended reader' part suggests that we also specify who, we expect, would be able to understand the definition, and also generalizes over human and computer reader to include textual and logical definition. Personally, I am more comfortable weakening definition to documentation, with instructions as to what is desirable. We also have the outstanding issue of how to aim different definitions to different audiences. A clinical audience reading chebi wants a different sort of definition documentation/definition from a chemistry trained audience, and similarly there is a need for a definition that is adequate for an ontologist to work with. PERSON:Daniel Schober GROUP:OBI:<http://purl.obolibrary.org/obo/obi> definition definition definition textual definition textual definition editor note An administrative note intended for its editor. It may not be included in the publication version of the ontology, so it should contain nothing necessary for end users to understand the ontology. PERSON:Daniel Schober GROUP:OBI:<http://purl.obfoundry.org/obo/obi> editor note term editor Name of editor entering the term in the file. The term editor is a point of contact for information regarding the term. The term editor may be, but is not always, the author of the definition, which may have been worked upon by several people 20110707, MC: label update to term editor and definition modified accordingly. See https://github.com/information-artifact-ontology/IAO/issues/115. PERSON:Daniel Schober GROUP:OBI:<http://purl.obolibrary.org/obo/obi> term editor alternative term An alternative name for a class or property which means the same thing as the preferred name (semantically equivalent) PERSON:Daniel Schober GROUP:OBI:<http://purl.obolibrary.org/obo/obi> alternative term definition source formal citation, e.g. identifier in external database to indicate / attribute source(s) for the definition. Free text indicate / attribute source(s) for the definition. EXAMPLE: Author Name, URI, MeSH Term C04, PUBMED ID, Wiki uri on 31.01.2007 PERSON:Daniel Schober Discussion on obo-discuss mailing-list, see http://bit.ly/hgm99w GROUP:OBI:<http://purl.obolibrary.org/obo/obi> definition source curator note An administrative note of use for a curator but of no use for a user PERSON:Alan Ruttenberg curator note term tracker item the URI for an OBI Terms ticket at sourceforge, such as https://sourceforge.net/p/obi/obi-terms/772/ An IRI or similar locator for a request or discussion of an ontology term. Person: Jie Zheng, Chris Stoeckert, Alan Ruttenberg Person: Jie Zheng, Chris Stoeckert, Alan Ruttenberg The 'tracker item' can associate a tracker with a specific ontology term. term tracker item imported from For external terms/classes, the ontology from which the term was imported PERSON:Alan Ruttenberg PERSON:Melanie Courtot GROUP:OBI:<http://purl.obolibrary.org/obo/obi> imported from OBO foundry unique label An alternative name for a class or property which is unique across the OBO Foundry. The intended usage of that property is as follow: OBO foundry unique labels are automatically generated based on regular expressions provided by each ontology, so that SO could specify unique label = 'sequence ' + [label], etc. , MA could specify 'mouse + [label]' etc. Upon importing terms, ontology developers can choose to use the 'OBO foundry unique label' for an imported term or not. The same applies to tools . PERSON:Alan Ruttenberg PERSON:Bjoern Peters PERSON:Chris Mungall PERSON:Melanie Courtot GROUP:OBO Foundry <http://obofoundry.org/> OBO foundry unique label elucidation person:Alan Ruttenberg Person:Barry Smith Primitive terms in a highest-level ontology such as BFO are terms which are so basic to our understanding of reality that there is no way of defining them in a non-circular fashion. For these, therefore, we can provide only elucidations, supplemented by examples and by axioms elucidation has associated axiom(nl) Person:Alan Ruttenberg Person:Alan Ruttenberg An axiom associated with a term expressed using natural language has associated axiom(nl) has associated axiom(fol) Person:Alan Ruttenberg Person:Alan Ruttenberg An axiom expressed in first order logic using CLIF syntax has associated axiom(fol) ISA alternative term An alternative term used by the ISA tools project (http://isa-tools.org). Requested by Alejandra Gonzalez-Beltran https://sourceforge.net/tracker/?func=detail&aid=3603413&group_id=177891&atid=886178 Person: Alejandra Gonzalez-Beltran Person: Philippe Rocca-Serra ISA tools project (http://isa-tools.org) ISA alternative term IEDB alternative term An alternative term used by the IEDB. PERSON:Randi Vita, Jason Greenbaum, Bjoern Peters IEDB IEDB alternative term temporal interpretation https://github.com/oborel/obo-relations/wiki/ROAndTime an alternative term used for STATO statistical ontology and ISA team Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO alternative term a R command syntax or link to a R documentation in support of Statistical Ontology Classes or Data Transformations Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra R command an annotation property to provide a canonical command to invoke a method implementation using Python programming language Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Python command the most common series or system of written mathematical symbols used to represent the entity AGB preferred mathematical notation Examples of a Contributor include a person, an organisation, or a service. Typically, the name of a Contributor should be used to indicate the entity. An entity responsible for making contributions to the content of the resource. Contributor Contributor Examples of a Creator include a person, an organisation, or a service. Typically, the name of a Creator should be used to indicate the entity. An entity primarily responsible for making the content of the resource. Creator Creator Typically, Date will be associated with the creation or availability of the resource. Recommended best practice for encoding the date value is defined in a profile of ISO 8601 [W3CDTF] and follows the YYYY-MM-DD format. A date associated with an event in the life cycle of the resource. Date Date Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content. An account of the content of the resource. Description Description Typically, Format may include the media-type or dimensions of the resource. Format may be used to determine the software, hardware or other equipment needed to display or operate the resource. Examples of dimensions include size and duration. Recommended best practice is to select a value from a controlled vocabulary (for example, the list of Internet Media Types [MIME] defining computer media formats). The physical or digital manifestation of the resource. Format Format The present resource may be derived from the Source resource in whole or in part. Recommended best practice is to reference the resource by means of a string or number conforming to a formal identification system. A reference to a resource from which the present resource is derived. Source Source Typically, a Subject will be expressed as keywords, key phrases or classification codes that describe a topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme. The topic of the content of the resource. Subject and Keywords Subject and Keywords Mark Miller 2018-05-11T13:47:29Z label label is part of my brain is part of my body (continuant parthood, two material entities) my stomach cavity is part of my stomach (continuant parthood, immaterial entity is part of material entity) this day is part of this year (occurrent parthood) a core relation that holds between a part and its whole Everything is part of itself. Any part of any part of a thing is itself part of that thing. Two distinct things cannot be part of each other. Occurrents are not subject to change and so parthood between occurrents holds for all the times that the part exists. Many continuants are subject to change, so parthood between continuants will only hold at certain times, but this is difficult to specify in OWL. See https://code.google.com/p/obo-relations/wiki/ROAndTime Parthood requires the part and the whole to have compatible classes: only an occurrent can be part of an occurrent; only a process can be part of a process; only a continuant can be part of a continuant; only an independent continuant can be part of an independent continuant; only an immaterial entity can be part of an immaterial entity; only a specifically dependent continuant can be part of a specifically dependent continuant; only a generically dependent continuant can be part of a generically dependent continuant. (This list is not exhaustive.) A continuant cannot be part of an occurrent: use 'participates in'. An occurrent cannot be part of a continuant: use 'has participant'. A material entity cannot be part of an immaterial entity: use 'has location'. A specifically dependent continuant cannot be part of an independent continuant: use 'inheres in'. An independent continuant cannot be part of a specifically dependent continuant: use 'bearer of'. part_of part of http://www.obofoundry.org/ro/#OBO_REL:part_of has part my body has part my brain (continuant parthood, two material entities) my stomach has part my stomach cavity (continuant parthood, material entity has part immaterial entity) this year has part this day (occurrent parthood) a core relation that holds between a whole and its part Everything has itself as a part. Any part of any part of a thing is itself part of that thing. Two distinct things cannot have each other as a part. Occurrents are not subject to change and so parthood between occurrents holds for all the times that the part exists. Many continuants are subject to change, so parthood between continuants will only hold at certain times, but this is difficult to specify in OWL. See https://code.google.com/p/obo-relations/wiki/ROAndTime Parthood requires the part and the whole to have compatible classes: only an occurrent have an occurrent as part; only a process can have a process as part; only a continuant can have a continuant as part; only an independent continuant can have an independent continuant as part; only a specifically dependent continuant can have a specifically dependent continuant as part; only a generically dependent continuant can have a generically dependent continuant as part. (This list is not exhaustive.) A continuant cannot have an occurrent as part: use 'participates in'. An occurrent cannot have a continuant as part: use 'has participant'. An immaterial entity cannot have a material entity as part: use 'location of'. An independent continuant cannot have a specifically dependent continuant as part: use 'bearer of'. A specifically dependent continuant cannot have an independent continuant as part: use 'inheres in'. has_part has part realized in this disease is realized in this disease course this fragility is realized in this shattering this investigator role is realized in this investigation is realized by realized_in [copied from inverse property 'realizes'] to say that b realizes c at t is to assert that there is some material entity d & b is a process which has participant d at t & c is a disposition or role of which d is bearer_of at t& the type instantiated by b is correlated with the type instantiated by c. (axiom label in BFO2 Reference: [059-003]) Paraphrase of elucidation: a relation between a realizable entity and a process, where there is some material entity that is bearer of the realizable entity and participates in the process, and the realizable entity comes to be realized in the course of the process realized in realizes this disease course realizes this disease this investigation realizes this investigator role this shattering realizes this fragility to say that b realizes c at t is to assert that there is some material entity d & b is a process which has participant d at t & c is a disposition or role of which d is bearer_of at t& the type instantiated by b is correlated with the type instantiated by c. (axiom label in BFO2 Reference: [059-003]) Paraphrase of elucidation: a relation between a process and a realizable entity, where there is some material entity that is bearer of the realizable entity and participates in the process, and the realizable entity comes to be realized in the course of the process realizes preceded by An example is: translation preceded_by transcription; aging preceded_by development (not however death preceded_by aging). Where derives_from links classes of continuants, preceded_by links classes of processes. Clearly, however, these two relations are not independent of each other. Thus if cells of type C1 derive_from cells of type C, then any cell division involving an instance of C1 in a given lineage is preceded_by cellular processes involving an instance of C. The assertion P preceded_by P1 tells us something about Ps in general: that is, it tells us something about what happened earlier, given what we know about what happened later. Thus it does not provide information pointing in the opposite direction, concerning instances of P1 in general; that is, that each is such as to be succeeded by some instance of P. Note that an assertion to the effect that P preceded_by P1 is rather weak; it tells us little about the relations between the underlying instances in virtue of which the preceded_by relation obtains. Typically we will be interested in stronger relations, for example in the relation immediately_preceded_by, or in relations which combine preceded_by with a condition to the effect that the corresponding instances of P and P1 share participants, or that their participants are connected by relations of derivation, or (as a first step along the road to a treatment of causality) that the one process in some way affects (for example, initiates or regulates) the other. is preceded by preceded_by http://www.obofoundry.org/ro/#OBO_REL:preceded_by preceded by precedes precedes has measurement unit label This document is about information artifacts and their representations is_about is a (currently) primitive relation that relates an information artifact to an entity. 7/6/2009 Alan Ruttenberg. Following discussion with Jonathan Rees, and introduction of "mentions" relation. Weaken the is_about relationship to be primitive. We will try to build it back up by elaborating the various subproperties that are more precisely defined. Some currently missing phenomena that should be considered "about" are predications - "The only person who knows the answer is sitting beside me" , Allegory, Satire, and other literary forms that can be topical without explicitly mentioning the topic. person:Alan Ruttenberg Smith, Ceusters, Ruttenberg, 2000 years of philosophy is about A person's name denotes the person. A variable name in a computer program denotes some piece of memory. Lexically equivalent strings can denote different things, for instance "Alan" can denote different people. In each case of use, there is a case of the denotation relation obtaining, between "Alan" and the person that is being named. denotes is a primitive, instance-level, relation obtaining between an information content entity and some portion of reality. Denotation is what happens when someone creates an information content entity E in order to specifically refer to something. The only relation between E and the thing is that E can be used to 'pick out' the thing. This relation connects those two together. Freedictionary.com sense 3: To signify directly; refer to specifically 2009-11-10 Alan Ruttenberg. Old definition said the following to emphasize the generic nature of this relation. We no longer have 'specifically denotes', which would have been primitive, so make this relation primitive. g denotes r =def r is a portion of reality there is some c that is a concretization of g every c that is a concretization of g specifically denotes r person:Alan Ruttenberg Conversations with Barry Smith, Werner Ceusters, Bjoern Peters, Michel Dumontier, Melanie Courtot, James Malone, Bill Hogan denotes m is a quality measurement of q at t when q is a quality there is a measurement process p that has specified output m, a measurement datum, that is about q 8/6/2009 Alan Ruttenberg: The strategy is to be rather specific with this relationship. There are other kinds of measurements that are not of qualities, such as those that measure time. We will add these as separate properties for the moment and see about generalizing later From the second IAO workshop [Alan Ruttenberg 8/6/2009: not completely current, though bringing in comparison is probably important] This one is the one we are struggling with at the moment. The issue is what a measurement measures. On the one hand saying that it measures the quality would include it "measuring" the bearer = referring to the bearer in the measurement. However this makes comparisons of two different things not possible. On the other hand not having it inhere in the bearer, on the face of it, breaks the audit trail. Werner suggests a solution based on "Magnitudes" a proposal for which we are awaiting details. -- From the second IAO workshop, various comments, [commented on by Alan Ruttenberg 8/6/2009] unit of measure is a quality, e.g. the length of a ruler. [We decided to hedge on what units of measure are, instead talking about measurement unit labels, which are the information content entities that are about whatever measurement units are. For IAO we need that information entity in any case. See the term measurement unit label] [Some struggling with the various subflavors of is_about. We subsequently removed the relation represents, and describes until and only when we have a better theory] a represents b means either a denotes b or a describes describe: a describes b means a is about b and a allows an inference of at least one quality of b We have had a long discussion about denotes versus describes. From the second IAO workshop: An attempt at tieing the quality to the measurement datum more carefully. a is a magnitude means a is a determinate quality particular inhering in some bearer b existing at a time t that can be represented/denoted by an information content entity e that has parts denoting a unit of measure, a number, and b. The unit of measure is an instance of the determinable quality. From the second meeting on IAO: An attempt at defining assay using Barry's "reliability" wording assay: process and has_input some material entity and has_output some information content entity and which is such that instances of this process type reliably generate outputs that describes the input. This one is the one we are struggling with at the moment. The issue is what a measurement measures. On the one hand saying that it measures the quality would include it "measuring" the bearer = referring to the bearer in the measurement. However this makes comparisons of two different things not possible. On the other hand not having it inhere in the bearer, on the face of it, breaks the audit trail. Werner suggests a solution based on "Magnitudes" a proposal for which we are awaiting details. Alan Ruttenberg is quality measurement of relating a cartesian spatial coordinate datum to a unit label that together with the values represent a point has coordinate unit label relates a process to a time-measurement-datum that represents the duration of the process Person:Alan Ruttenberg is duration of inverse of the relation of is quality measurement of 2009/10/19 Alan Ruttenberg. Named 'junk' relation useful in restrictions, but not a real instance relationship Person:Alan Ruttenberg is quality measured as relates a time stamped measurement datum to the time measurement datum that denotes the time when the measurement was taken Alan Ruttenberg has time stamp relates a time stamped measurement datum to the measurement datum that was measured Alan Ruttenberg has measurement datum is_supported_by_data The relation between the conclusion "Gene tpbA is involved in EPS production" and the data items produced using two sets of organisms, one being a tpbA knockout, the other being tpbA wildtype tested in polysacharide production assays and analyzed using an ANOVA. The relation between a data item and a conclusion where the conclusion is the output of a data interpreting process and the data item is used as an input to that process OBI OBI Philly 2011 workshop is_supported_by_data has_specified_input has_specified_input see is_input_of example_of_usage A relation between a planned process and a continuant participating in that process that is not created during the process. The presence of the continuant during the process is explicitly specified in the plan specification which the process realizes the concretization of. 8/17/09: specified inputs of one process are not necessarily specified inputs of a larger process that it is part of. This is in contrast to how 'has participant' works. PERSON: Alan Ruttenberg PERSON: Bjoern Peters PERSON: Larry Hunter PERSON: Melanie Coutot has_specified_input is_specified_input_of some Autologous EBV(Epstein-Barr virus)-transformed B-LCL (B lymphocyte cell line) is_input_for instance of Chromum Release Assay described at https://wiki.cbil.upenn.edu/obiwiki/index.php/Chromium_Release_assay A relation between a planned process and a continuant participating in that process that is not created during the process. The presence of the continuant during the process is explicitly specified in the plan specification which the process realizes the concretization of. Alan Ruttenberg PERSON:Bjoern Peters is_specified_input_of has_specified_output has_specified_output A relation between a planned process and a continuant participating in that process. The presence of the continuant at the end of the process is explicitly specified in the objective specification which the process realizes the concretization of. PERSON: Alan Ruttenberg PERSON: Bjoern Peters PERSON: Larry Hunter PERSON: Melanie Courtot has_specified_output is_specified_output_of is_specified_output_of A relation between a planned process and a continuant participating in that process. The presence of the continuant at the end of the process is explicitly specified in the objective specification which the process realizes the concretization of. Alan Ruttenberg PERSON:Bjoern Peters is_specified_output_of is_proxy_for position on a gel is_proxy_for mass and charge of molecule in an western blot. Florescent intensity is_proxy_for amount of protein labeled with GFP. Examples: A260/A280 (of a DNA sample) is_proxy_for DNA-purity. NMR Sample scan is a proxy for sample quality. Within the assay mentioned here: https://wiki.cbil.upenn.edu/obiwiki/index.php/Chromium_Release_assay level of radioactivity is_proxy_for level of toxicity A relation between continuant instances c1 and c2 where within an experiment/ protocol application, measurement of c1 is used to determine what a measurement of c2 would be. A relation between continuant instances c1 and c2 where within a protocol application, measurement of c1 is related to a what would be the measurement of c2. (another definition) Alan Ruttenberg is_proxy_for achieves_planned_objective A cell sorting process achieves the objective specification 'material separation objective' This relation obtains between a planned process and a objective specification when the criteria specified in the objective specification are met at the end of the planned process. BP, AR, PPPB branch PPPB branch derived modified according to email thread from 1/23/09 in accordince with DT and PPPB branch achieves_planned_objective has grain the relation of the cells in the finger of the skin to the finger, in which an indeterminate number of grains are parts of the whole by virtue of being grains in a collective that is part of the whole, and in which removing one granular part does not nec- essarily damage or diminish the whole. Ontological Whether there is a fixed, or nearly fixed number of parts - e.g. fingers of the hand, chambers of the heart, or wheels of a car - such that there can be a notion of a single one being missing, or whether, by contrast, the number of parts is indeterminate - e.g., cells in the skin of the hand, red cells in blood, or rubber molecules in the tread of the tire of the wheel of the car. Discussion in Karslruhe with, among others, Alan Rector, Stefan Schulz, Marijke Keet, Melanie Courtot, and Alan Ruttenberg. Definition take from the definition of granular parthood in the cited paper. Needs work to put into standard form PERSON: Alan Ruttenberg PAPER: Granularity, scale and collectivity: When size does and does not matter, Alan Rector, Jeremy Rogers, Thomas Bittner, Journal of Biomedical Informatics 39 (2006) 333-349 has grain objective_achieved_by This relation obtains between a a objective specification and a planned process when the criteria specified in the objective specification are met at the end of the planned process. OBI OBI objective_achieved_by is member of organization Relating a legal person or organization to an organization in the case where the legal person or organization has a role as member of the organization. 2009/10/01 Alan Ruttenberg. Barry prefers generic is-member-of. Question of what the range should be. For now organization. Is organization a population? Would the same relation be used to record members of a population JZ: Discussed on May 7, 2012 OBI dev call. Bjoern points out that we need to allow for organizations to be members of organizations. And agreed by the other OBI developers. So, human and organization were specified in 'Domains'. The textual definition was updated based on it. Person:Alan Ruttenberg Person:Helen Parkinson Person:Alan Ruttenberg Person:Helen Parkinson 2009/09/28 Alan Ruttenberg. Fucoidan-use-case is member of organization has organization member Relating an organization to a legal person or organization. See tracker: https://sourceforge.net/tracker/index.php?func=detail&aid=3512902&group_id=177891&atid=886178 Person: Jie Zheng has organization member specifies value of A relation between a value specification and an entity which the specification is about. specifies value of has value specification A relation between an information content entity and a value specification that specifies its value. PERSON: James A. Overton OBI has value specification inheres in this fragility inheres in this vase this red color inheres in this apple a relation between a specifically dependent continuant (the dependent) and an independent continuant (the bearer), in which the dependent specifically depends on the bearer for its existence A dependent inheres in its bearer at all times for which the dependent exists. inheres_in inheres in bearer of this apple is bearer of this red color this vase is bearer of this fragility a relation between an independent continuant (the bearer) and a specifically dependent continuant (the dependent), in which the dependent specifically depends on the bearer for its existence A bearer can have many dependents, and its dependents can exist for different periods of time, but none of its dependents can exist when the bearer does not exist. bearer_of is bearer of bearer of participates in this blood clot participates in this blood coagulation this input material (or this output material) participates in this process this investigator participates in this investigation a relation between a continuant and a process, in which the continuant is somehow involved in the process participates_in participates in has participant this blood coagulation has participant this blood clot this investigation has participant this investigator this process has participant this input material (or this output material) a relation between a process and a continuant, in which the continuant is somehow involved in the process Has_participant is a primitive instance-level relation between a process, a continuant, and a time at which the continuant participates in some way in the process. The relation obtains, for example, when this particular process of oxygen exchange across this particular alveolar membrane has_participant this particular sample of hemoglobin at this particular time. has_participant http://www.obofoundry.org/ro/#OBO_REL:has_participant has participant A journal article is an information artifact that inheres in some number of printed journals. For each copy of the printed journal there is some quality that carries the journal article, such as a pattern of ink. The journal article (a generically dependent continuant) is concretized as the quality (a specifically dependent continuant), and both depend on that copy of the printed journal (an independent continuant). An investigator reads a protocol and forms a plan to carry out an assay. The plan is a realizable entity (a specifically dependent continuant) that concretizes the protocol (a generically dependent continuant), and both depend on the investigator (an independent continuant). The plan is then realized by the assay (a process). A relationship between a generically dependent continuant and a specifically dependent continuant, in which the generically dependent continuant depends on some independent continuant in virtue of the fact that the specifically dependent continuant also depends on that same independent continuant. A generically dependent continuant may be concretized as multiple specifically dependent continuants. is concretized as A journal article is an information artifact that inheres in some number of printed journals. For each copy of the printed journal there is some quality that carries the journal article, such as a pattern of ink. The quality (a specifically dependent continuant) concretizes the journal article (a generically dependent continuant), and both depend on that copy of the printed journal (an independent continuant). An investigator reads a protocol and forms a plan to carry out an assay. The plan is a realizable entity (a specifically dependent continuant) that concretizes the protocol (a generically dependent continuant), and both depend on the investigator (an independent continuant). The plan is then realized by the assay (a process). A relationship between a specifically dependent continuant and a generically dependent continuant, in which the generically dependent continuant depends on some independent continuant in virtue of the fact that the specifically dependent continuant also depends on that same independent continuant. Multiple specifically dependent continuants can concretize the same generically dependent continuant. concretizes this catalysis function is a function of this enzyme a relation between a function and an independent continuant (the bearer), in which the function specifically depends on the bearer for its existence A function inheres in its bearer at all times for which the function exists, however the function need not be realized at all the times that the function exists. function_of is function of function of this red color is a quality of this apple a relation between a quality and an independent continuant (the bearer), in which the quality specifically depends on the bearer for its existence A quality inheres in its bearer at all times for which the quality exists. is quality of quality_of quality of this investigator role is a role of this person a relation between a role and an independent continuant (the bearer), in which the role specifically depends on the bearer for its existence A role inheres in its bearer at all times for which the role exists, however the role need not be realized at all the times that the role exists. is role of role_of role of this enzyme has function this catalysis function (more colloquially: this enzyme has this catalysis function) a relation between an independent continuant (the bearer) and a function, in which the function specifically depends on the bearer for its existence A bearer can have many functions, and its functions can exist for different periods of time, but none of its functions can exist when the bearer does not exist. A function need not be realized at all the times that the function exists. has_function has function this apple has quality this red color a relation between an independent continuant (the bearer) and a quality, in which the quality specifically depends on the bearer for its existence A bearer can have many qualities, and its qualities can exist for different periods of time, but none of its qualities can exist when the bearer does not exist. has_quality has quality this person has role this investigator role (more colloquially: this person has this role of investigator) a relation between an independent continuant (the bearer) and a role, in which the role specifically depends on the bearer for its existence A bearer can have many roles, and its roles can exist for different periods of time, but none of its roles can exist when the bearer does not exist. A role need not be realized at all the times that the role exists. has_role has role derives from this cell derives from this parent cell (cell division) this nucleus derives from this parent nucleus (nuclear division) a relation between two distinct material entities, the new entity and the old entity, in which the new entity begins to exist when the old entity ceases to exist, and the new entity inherits the significant portion of the matter of the old entity This is a very general relation. More specific relations are preferred when applicable, such as 'directly develops from'. derives_from derives from this parent cell derives into this cell (cell division) this parent nucleus derives into this nucleus (nuclear division) a relation between two distinct material entities, the old entity and the new entity, in which the new entity begins to exist when the old entity ceases to exist, and the new entity inherits the significant portion of the matter of the old entity This is a very general relation. More specific relations are preferred when applicable, such as 'directly develops into'. To avoid making statements about a future that may not come to pass, it is often better to use the backward-looking 'derives from' rather than the forward-looking 'derives into'. derives_into derives into is location of my head is the location of my brain this cage is the location of this rat a relation between two independent continuants, the location and the target, in which the target is entirely within the location Most location relations will only hold at certain times, but this is difficult to specify in OWL. See https://code.google.com/p/obo-relations/wiki/ROAndTime location_of location of located in my brain is located in my head this rat is located in this cage a relation between two independent continuants, the target and the location, in which the target is entirely within the location Location as a relation between instances: The primitive instance-level relation c located_in r at t reflects the fact that each continuant is at any given time associated with exactly one spatial region, namely its exact location. Following we can use this relation to define a further instance-level location relation - not between a continuant and the region which it exactly occupies, but rather between one continuant and another. c is located in c1, in this sense, whenever the spatial region occupied by c is part_of the spatial region occupied by c1. Note that this relation comprehends both the relation of exact location between one continuant and another which obtains when r and r1 are identical (for example, when a portion of fluid exactly fills a cavity), as well as those sorts of inexact location relations which obtain, for example, between brain and head or between ovum and uterus Most location relations will only hold at certain times, but this is difficult to specify in OWL. See https://code.google.com/p/obo-relations/wiki/ROAndTime located_in http://www.obofoundry.org/ro/#OBO_REL:located_in located in move to BFO? Allen A relation that holds between two occurrents. This is a grouping relation that collects together all the Allen relations. temporal relation property to indicate that a design declares a variable; the inverse property is 'is declared by' Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra declares property to indicate the variables declared by a design; the inverse property is 'declares' Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra is declared by the relationship between a fraction and the number above the line Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB has numerator relationship between a planned process and the plan specification that it carries out; it is defined as equivalent to the composed relationship (realizes o concretizes) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB executes This is the inverse of 'specifies value of' and it is intended to say things such as 'compound' 'assumes values specified by' 'independent variable specification' A relation between an entity and a value specification, where the value specification is about the entity. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB assumes values specified by relationship between an element and a set it belongs to Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB is member of relationship between a set and one of its elements Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB has member Inverse relation of 'denotes', where denotation is what happens when someone creates an information content entity E in order to specifically refer to something (from 'denotes' definition). Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra is denoted by the relationship between a fraction and the number below the line (or divisor) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB has denominator has effect on has fixed effect on has interaction effect on has random effect on has order in sequence Relationship between a parameter of a model and the estimate produced by estimation process as used in statistical modeling. estimate of computed_from is a relation between 2 information content entity denoting how one is derived from another on through the application of a data transformation or computation process. computed from is model for is modeled by has measurement value has x coordinate value has y coordinate value has specified numeric value A relation between a value specification and a number that quantifies it. A range of 'real' might be better than 'float'. For now we follow 'has measurement value' until we can consider technical issues with SPARQL queries and reasoning. PERSON: James A. Overton OBI has specified numeric value has specified value A relation between a value specification and a literal. This is not an RDF/OWL object property. It is intended to link a value found in e.g. a database column of 'M' (the literal) to an instance of a value specification class, which can then be linked to indicate that this is about the biological gender of a human subject. OBI has specified value A relationship (data property) between an entity and its value. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra has value entity Entity Julius Caesar Verdi’s Requiem the Second World War your body mass index BFO 2 Reference: In all areas of empirical inquiry we encounter general terms of two sorts. First are general terms which refer to universals or types:animaltuberculosissurgical procedurediseaseSecond, are general terms used to refer to groups of entities which instantiate a given universal but do not correspond to the extension of any subuniversal of that universal because there is nothing intrinsic to the entities in question by virtue of which they – and only they – are counted as belonging to the given group. Examples are: animal purchased by the Emperortuberculosis diagnosed on a Wednesdaysurgical procedure performed on a patient from Stockholmperson identified as candidate for clinical trial #2056-555person who is signatory of Form 656-PPVpainting by Leonardo da VinciSuch terms, which represent what are called ‘specializations’ in [81 Entity doesn't have a closure axiom because the subclasses don't necessarily exhaust all possibilites. For example Werner Ceusters 'portions of reality' include 4 sorts, entities (as BFO construes them), universals, configurations, and relations. It is an open question as to whether entities as construed in BFO will at some point also include these other portions of reality. See, for example, 'How to track absolutely everything' at http://www.referent-tracking.com/_RTU/papers/CeustersICbookRevised.pdf An entity is anything that exists or has existed or will exist. (axiom label in BFO2 Reference: [001-001]) entity continuant Continuant An entity that exists in full at any time in which it exists at all, persists through time while maintaining its identity and has no temporal parts. BFO 2 Reference: Continuant entities are entities which can be sliced to yield parts only along the spatial dimension, yielding for example the parts of your table which we call its legs, its top, its nails. ‘My desk stretches from the window to the door. It has spatial parts, and can be sliced (in space) in two. With respect to time, however, a thing is a continuant.’ [60, p. 240 Continuant doesn't have a closure axiom because the subclasses don't necessarily exhaust all possibilites. For example, in an expansion involving bringing in some of Ceuster's other portions of reality, questions are raised as to whether universals are continuants A continuant is an entity that persists, endures, or continues to exist through time while maintaining its identity. (axiom label in BFO2 Reference: [008-002]) if b is a continuant and if, for some t, c has_continuant_part b at t, then c is a continuant. (axiom label in BFO2 Reference: [126-001]) if b is a continuant and if, for some t, cis continuant_part of b at t, then c is a continuant. (axiom label in BFO2 Reference: [009-002]) if b is a material entity, then there is some temporal interval (referred to below as a one-dimensional temporal region) during which b exists. (axiom label in BFO2 Reference: [011-002]) (forall (x y) (if (and (Continuant x) (exists (t) (continuantPartOfAt y x t))) (Continuant y))) // axiom label in BFO2 CLIF: [009-002] (forall (x y) (if (and (Continuant x) (exists (t) (hasContinuantPartOfAt y x t))) (Continuant y))) // axiom label in BFO2 CLIF: [126-001] (forall (x) (if (Continuant x) (Entity x))) // axiom label in BFO2 CLIF: [008-002] (forall (x) (if (Material Entity x) (exists (t) (and (TemporalRegion t) (existsAt x t))))) // axiom label in BFO2 CLIF: [011-002] continuant occurrent Occurrent An entity that has temporal parts and that happens, unfolds or develops through time. BFO 2 Reference: every occurrent that is not a temporal or spatiotemporal region is s-dependent on some independent continuant that is not a spatial region BFO 2 Reference: s-dependence obtains between every process and its participants in the sense that, as a matter of necessity, this process could not have existed unless these or those participants existed also. A process may have a succession of participants at different phases of its unfolding. Thus there may be different players on the field at different times during the course of a football game; but the process which is the entire game s-depends_on all of these players nonetheless. Some temporal parts of this process will s-depend_on on only some of the players. Occurrent doesn't have a closure axiom because the subclasses don't necessarily exhaust all possibilites. An example would be the sum of a process and the process boundary of another process. Simons uses different terminology for relations of occurrents to regions: Denote the spatio-temporal location of a given occurrent e by 'spn[e]' and call this region its span. We may say an occurrent is at its span, in any larger region, and covers any smaller region. Now suppose we have fixed a frame of reference so that we can speak not merely of spatio-temporal but also of spatial regions (places) and temporal regions (times). The spread of an occurrent, (relative to a frame of reference) is the space it exactly occupies, and its spell is likewise the time it exactly occupies. We write 'spr[e]' and `spl[e]' respectively for the spread and spell of e, omitting mention of the frame. An occurrent is an entity that unfolds itself in time or it is the instantaneous boundary of such an entity (for example a beginning or an ending) or it is a temporal or spatiotemporal region which such an entity occupies_temporal_region or occupies_spatiotemporal_region. (axiom label in BFO2 Reference: [077-002]) Every occurrent occupies_spatiotemporal_region some spatiotemporal region. (axiom label in BFO2 Reference: [108-001]) b is an occurrent entity iff b is an entity that has temporal parts. (axiom label in BFO2 Reference: [079-001]) (forall (x) (if (Occurrent x) (exists (r) (and (SpatioTemporalRegion r) (occupiesSpatioTemporalRegion x r))))) // axiom label in BFO2 CLIF: [108-001] (forall (x) (iff (Occurrent x) (and (Entity x) (exists (y) (temporalPartOf y x))))) // axiom label in BFO2 CLIF: [079-001] occurrent ic IndependentContinuant a chair a heart a leg a molecule a spatial region an atom an orchestra. an organism the bottom right portion of a human torso the interior of your mouth A continuant that is a bearer of quality and realizable entity entities, in which other entities inhere and which itself cannot inhere in anything. b is an independent continuant = Def. b is a continuant which is such that there is no c and no t such that b s-depends_on c at t. (axiom label in BFO2 Reference: [017-002]) For any independent continuant b and any time t there is some spatial region r such that b is located_in r at t. (axiom label in BFO2 Reference: [134-001]) For every independent continuant b and time t during the region of time spanned by its life, there are entities which s-depends_on b during t. (axiom label in BFO2 Reference: [018-002]) (forall (x t) (if (IndependentContinuant x) (exists (r) (and (SpatialRegion r) (locatedInAt x r t))))) // axiom label in BFO2 CLIF: [134-001] (forall (x t) (if (and (IndependentContinuant x) (existsAt x t)) (exists (y) (and (Entity y) (specificallyDependsOnAt y x t))))) // axiom label in BFO2 CLIF: [018-002] (iff (IndependentContinuant a) (and (Continuant a) (not (exists (b t) (specificallyDependsOnAt a b t))))) // axiom label in BFO2 CLIF: [017-002] independent continuant s-region SpatialRegion BFO 2 Reference: Spatial regions do not participate in processes. Spatial region doesn't have a closure axiom because the subclasses don't exhaust all possibilites. An example would be the union of a spatial point and a spatial line that doesn't overlap the point, or two spatial lines that intersect at a single point. In both cases the resultant spatial region is neither 0-dimensional, 1-dimensional, 2-dimensional, or 3-dimensional. A spatial region is a continuant entity that is a continuant_part_of spaceR as defined relative to some frame R. (axiom label in BFO2 Reference: [035-001]) All continuant parts of spatial regions are spatial regions. (axiom label in BFO2 Reference: [036-001]) (forall (x y t) (if (and (SpatialRegion x) (continuantPartOfAt y x t)) (SpatialRegion y))) // axiom label in BFO2 CLIF: [036-001] (forall (x) (if (SpatialRegion x) (Continuant x))) // axiom label in BFO2 CLIF: [035-001] spatial region 2d-s-region TwoDimensionalSpatialRegion an infinitely thin plane in space. the surface of a sphere-shaped part of space A two-dimensional spatial region is a spatial region that is of two dimensions. (axiom label in BFO2 Reference: [039-001]) (forall (x) (if (TwoDimensionalSpatialRegion x) (SpatialRegion x))) // axiom label in BFO2 CLIF: [039-001] two-dimensional spatial region process Process a process of cell-division, \ a beating of the heart a process of meiosis a process of sleeping the course of a disease the flight of a bird the life of an organism your process of aging. An occurrent that has temporal proper parts and for some time t, p s-depends_on some material entity at t. p is a process = Def. p is an occurrent that has temporal proper parts and for some time t, p s-depends_on some material entity at t. (axiom label in BFO2 Reference: [083-003]) BFO 2 Reference: The realm of occurrents is less pervasively marked by the presence of natural units than is the case in the realm of independent continuants. Thus there is here no counterpart of ‘object’. In BFO 1.0 ‘process’ served as such a counterpart. In BFO 2.0 ‘process’ is, rather, the occurrent counterpart of ‘material entity’. Those natural – as contrasted with engineered, which here means: deliberately executed – units which do exist in the realm of occurrents are typically either parasitic on the existence of natural units on the continuant side, or they are fiat in nature. Thus we can count lives; we can count football games; we can count chemical reactions performed in experiments or in chemical manufacturing. We cannot count the processes taking place, for instance, in an episode of insect mating behavior.Even where natural units are identifiable, for example cycles in a cyclical process such as the beating of a heart or an organism’s sleep/wake cycle, the processes in question form a sequence with no discontinuities (temporal gaps) of the sort that we find for instance where billiard balls or zebrafish or planets are separated by clear spatial gaps. Lives of organisms are process units, but they too unfold in a continuous series from other, prior processes such as fertilization, and they unfold in turn in continuous series of post-life processes such as post-mortem decay. Clear examples of boundaries of processes are almost always of the fiat sort (midnight, a time of death as declared in an operating theater or on a death certificate, the initiation of a state of war) (iff (Process a) (and (Occurrent a) (exists (b) (properTemporalPartOf b a)) (exists (c t) (and (MaterialEntity c) (specificallyDependsOnAt a c t))))) // axiom label in BFO2 CLIF: [083-003] process disposition Disposition an atom of element X has the disposition to decay to an atom of element Y certain people have a predisposition to colon cancer children are innately disposed to categorize objects in certain ways. the cell wall is disposed to filter chemicals in endocytosis and exocytosis BFO 2 Reference: Dispositions exist along a strength continuum. Weaker forms of disposition are realized in only a fraction of triggering cases. These forms occur in a significant number of cases of a similar type. b is a disposition means: b is a realizable entity & b’s bearer is some material entity & b is such that if it ceases to exist, then its bearer is physically changed, & b’s realization occurs when and because this bearer is in some special physical circumstances, & this realization occurs in virtue of the bearer’s physical make-up. (axiom label in BFO2 Reference: [062-002]) If b is a realizable entity then for all t at which b exists, b s-depends_on some material entity at t. (axiom label in BFO2 Reference: [063-002]) (forall (x t) (if (and (RealizableEntity x) (existsAt x t)) (exists (y) (and (MaterialEntity y) (specificallyDepends x y t))))) // axiom label in BFO2 CLIF: [063-002] (forall (x) (if (Disposition x) (and (RealizableEntity x) (exists (y) (and (MaterialEntity y) (bearerOfAt x y t)))))) // axiom label in BFO2 CLIF: [062-002] disposition realizable RealizableEntity the disposition of this piece of metal to conduct electricity. the disposition of your blood to coagulate the function of your reproductive organs the role of being a doctor the role of this boundary to delineate where Utah and Colorado meet A specifically dependent continuant that inheres in continuant entities and are not exhibited in full at every time in which it inheres in an entity or group of entities. The exhibition or actualization of a realizable entity is a particular manifestation, functioning or process that occurs under certain circumstances. To say that b is a realizable entity is to say that b is a specifically dependent continuant that inheres in some independent continuant which is not a spatial region and is of a type instances of which are realized in processes of a correlated type. (axiom label in BFO2 Reference: [058-002]) All realizable dependent continuants have independent continuants that are not spatial regions as their bearers. (axiom label in BFO2 Reference: [060-002]) (forall (x t) (if (RealizableEntity x) (exists (y) (and (IndependentContinuant y) (not (SpatialRegion y)) (bearerOfAt y x t))))) // axiom label in BFO2 CLIF: [060-002] (forall (x) (if (RealizableEntity x) (and (SpecificallyDependentContinuant x) (exists (y) (and (IndependentContinuant y) (not (SpatialRegion y)) (inheresIn x y)))))) // axiom label in BFO2 CLIF: [058-002] realizable entity 0d-s-region ZeroDimensionalSpatialRegion A zero-dimensional spatial region is a point in space. (axiom label in BFO2 Reference: [037-001]) (forall (x) (if (ZeroDimensionalSpatialRegion x) (SpatialRegion x))) // axiom label in BFO2 CLIF: [037-001] zero-dimensional spatial region quality Quality the ambient temperature of this portion of air the color of a tomato the length of the circumference of your waist the mass of this piece of gold. the shape of your nose the shape of your nostril a quality is a specifically dependent continuant that, in contrast to roles and dispositions, does not require any further process in order to be realized. (axiom label in BFO2 Reference: [055-001]) If an entity is a quality at any time that it exists, then it is a quality at every time that it exists. (axiom label in BFO2 Reference: [105-001]) (forall (x) (if (Quality x) (SpecificallyDependentContinuant x))) // axiom label in BFO2 CLIF: [055-001] (forall (x) (if (exists (t) (and (existsAt x t) (Quality x))) (forall (t_1) (if (existsAt x t_1) (Quality x))))) // axiom label in BFO2 CLIF: [105-001] quality sdc SpecificallyDependentContinuant Reciprocal specifically dependent continuants: the function of this key to open this lock and the mutually dependent disposition of this lock: to be opened by this key of one-sided specifically dependent continuants: the mass of this tomato of relational dependent continuants (multiple bearers): John’s love for Mary, the ownership relation between John and this statue, the relation of authority between John and his subordinates. the disposition of this fish to decay the function of this heart: to pump blood the mutual dependence of proton donors and acceptors in chemical reactions [79 the mutual dependence of the role predator and the role prey as played by two organisms in a given interaction the pink color of a medium rare piece of grilled filet mignon at its center the role of being a doctor the shape of this hole. the smell of this portion of mozzarella A continuant that inheres in or is borne by other entities. Every instance of A requires some specific instance of B which must always be the same. b is a relational specifically dependent continuant = Def. b is a specifically dependent continuant and there are n &gt; 1 independent continuants c1, … cn which are not spatial regions are such that for all 1 i &lt; j n, ci and cj share no common parts, are such that for each 1 i n, b s-depends_on ci at every time t during the course of b’s existence (axiom label in BFO2 Reference: [131-004]) b is a specifically dependent continuant = Def. b is a continuant & there is some independent continuant c which is not a spatial region and which is such that b s-depends_on c at every time t during the course of b’s existence. (axiom label in BFO2 Reference: [050-003]) Specifically dependent continuant doesn't have a closure axiom because the subclasses don't necessarily exhaust all possibilites. We're not sure what else will develop here, but for example there are questions such as what are promises, obligation, etc. (iff (RelationalSpecificallyDependentContinuant a) (and (SpecificallyDependentContinuant a) (forall (t) (exists (b c) (and (not (SpatialRegion b)) (not (SpatialRegion c)) (not (= b c)) (not (exists (d) (and (continuantPartOfAt d b t) (continuantPartOfAt d c t)))) (specificallyDependsOnAt a b t) (specificallyDependsOnAt a c t)))))) // axiom label in BFO2 CLIF: [131-004] (iff (SpecificallyDependentContinuant a) (and (Continuant a) (forall (t) (if (existsAt a t) (exists (b) (and (IndependentContinuant b) (not (SpatialRegion b)) (specificallyDependsOnAt a b t))))))) // axiom label in BFO2 CLIF: [050-003] specifically dependent continuant role Role John’s role of husband to Mary is dependent on Mary’s role of wife to John, and both are dependent on the object aggregate comprising John and Mary as member parts joined together through the relational quality of being married. the priest role the role of a boundary to demarcate two neighboring administrative territories the role of a building in serving as a military target the role of a stone in marking a property boundary the role of subject in a clinical trial the student role A realizable entity the manifestation of which brings about some result or end that is not essential to a continuant in virtue of the kind of thing that it is but that can be served or participated in by that kind of continuant in some kinds of natural, social or institutional contexts. BFO 2 Reference: One major family of examples of non-rigid universals involves roles, and ontologies developed for corresponding administrative purposes may consist entirely of representatives of entities of this sort. Thus ‘professor’, defined as follows,b instance_of professor at t =Def. there is some c, c instance_of professor role & c inheres_in b at t.denotes a non-rigid universal and so also do ‘nurse’, ‘student’, ‘colonel’, ‘taxpayer’, and so forth. (These terms are all, in the jargon of philosophy, phase sortals.) By using role terms in definitions, we can create a BFO conformant treatment of such entities drawing on the fact that, while an instance of professor may be simultaneously an instance of trade union member, no instance of the type professor role is also (at any time) an instance of the type trade union member role (any more than any instance of the type color is at any time an instance of the type length).If an ontology of employment positions should be defined in terms of roles following the above pattern, this enables the ontology to do justice to the fact that individuals instantiate the corresponding universals – professor, sergeant, nurse – only during certain phases in their lives. b is a role means: b is a realizable entity & b exists because there is some single bearer that is in some special physical, social, or institutional set of circumstances in which this bearer does not have to be& b is not such that, if it ceases to exist, then the physical make-up of the bearer is thereby changed. (axiom label in BFO2 Reference: [061-001]) (forall (x) (if (Role x) (RealizableEntity x))) // axiom label in BFO2 CLIF: [061-001] role 1d-s-region OneDimensionalSpatialRegion an edge of a cube-shaped portion of space. A one-dimensional spatial region is a line or aggregate of lines stretching from one point in space to another. (axiom label in BFO2 Reference: [038-001]) (forall (x) (if (OneDimensionalSpatialRegion x) (SpatialRegion x))) // axiom label in BFO2 CLIF: [038-001] one-dimensional spatial region 3d-s-region ThreeDimensionalSpatialRegion a cube-shaped region of space a sphere-shaped region of space, A three-dimensional spatial region is a spatial region that is of three dimensions. (axiom label in BFO2 Reference: [040-001]) (forall (x) (if (ThreeDimensionalSpatialRegion x) (SpatialRegion x))) // axiom label in BFO2 CLIF: [040-001] three-dimensional spatial region gdc GenericallyDependentContinuant The entries in your database are patterns instantiated as quality instances in your hard drive. The database itself is an aggregate of such patterns. When you create the database you create a particular instance of the generically dependent continuant type database. Each entry in the database is an instance of the generically dependent continuant type IAO: information content entity. the pdf file on your laptop, the pdf file that is a copy thereof on my laptop the sequence of this protein molecule; the sequence that is a copy thereof in that protein molecule. A continuant that is dependent on one or other independent continuant bearers. For every instance of A requires some instance of (an independent continuant type) B but which instance of B serves can change from time to time. b is a generically dependent continuant = Def. b is a continuant that g-depends_on one or more other entities. (axiom label in BFO2 Reference: [074-001]) (iff (GenericallyDependentContinuant a) (and (Continuant a) (exists (b t) (genericallyDependsOnAt a b t)))) // axiom label in BFO2 CLIF: [074-001] generically dependent continuant function Function the function of a hammer to drive in nails the function of a heart pacemaker to regulate the beating of a heart through electricity the function of amylase in saliva to break down starch into sugar BFO 2 Reference: In the past, we have distinguished two varieties of function, artifactual function and biological function. These are not asserted subtypes of BFO:function however, since the same function – for example: to pump, to transport – can exist both in artifacts and in biological entities. The asserted subtypes of function that would be needed in order to yield a separate monoheirarchy are not artifactual function, biological function, etc., but rather transporting function, pumping function, etc. A function is a disposition that exists in virtue of the bearer’s physical make-up and this physical make-up is something the bearer possesses because it came into being, either through evolution (in the case of natural biological entities) or through intentional design (in the case of artifacts), in order to realize processes of a certain sort. (axiom label in BFO2 Reference: [064-001]) (forall (x) (if (Function x) (Disposition x))) // axiom label in BFO2 CLIF: [064-001] function material MaterialEntity a flame a forest fire a human being a hurricane a photon a puff of smoke a sea wave a tornado an aggregate of human beings. an energy wave an epidemic the undetached arm of a human being An independent continuant that is spatially extended whose identity is independent of that of other entities and can be maintained through time. BFO 2 Reference: Material entities (continuants) can preserve their identity even while gaining and losing material parts. Continuants are contrasted with occurrents, which unfold themselves in successive temporal parts or phases [60 BFO 2 Reference: Object, Fiat Object Part and Object Aggregate are not intended to be exhaustive of Material Entity. Users are invited to propose new subcategories of Material Entity. BFO 2 Reference: ‘Matter’ is intended to encompass both mass and energy (we will address the ontological treatment of portions of energy in a later version of BFO). A portion of matter is anything that includes elementary particles among its proper or improper parts: quarks and leptons, including electrons, as the smallest particles thus far discovered; baryons (including protons and neutrons) at a higher level of granularity; atoms and molecules at still higher levels, forming the cells, organs, organisms and other material entities studied by biologists, the portions of rock studied by geologists, the fossils studied by paleontologists, and so on.Material entities are three-dimensional entities (entities extended in three spatial dimensions), as contrasted with the processes in which they participate, which are four-dimensional entities (entities extended also along the dimension of time).According to the FMA, material entities may have immaterial entities as parts – including the entities identified below as sites; for example the interior (or ‘lumen’) of your small intestine is a part of your body. BFO 2.0 embodies a decision to follow the FMA here. A material entity is an independent continuant that has some portion of matter as proper or improper continuant part. (axiom label in BFO2 Reference: [019-002]) Every entity which has a material entity as continuant part is a material entity. (axiom label in BFO2 Reference: [020-002]) every entity of which a material entity is continuant part is also a material entity. (axiom label in BFO2 Reference: [021-002]) (forall (x) (if (MaterialEntity x) (IndependentContinuant x))) // axiom label in BFO2 CLIF: [019-002] (forall (x) (if (and (Entity x) (exists (y t) (and (MaterialEntity y) (continuantPartOfAt x y t)))) (MaterialEntity x))) // axiom label in BFO2 CLIF: [021-002] (forall (x) (if (and (Entity x) (exists (y t) (and (MaterialEntity y) (continuantPartOfAt y x t)))) (MaterialEntity x))) // axiom label in BFO2 CLIF: [020-002] material entity immaterial ImmaterialEntity BFO 2 Reference: Immaterial entities are divided into two subgroups:boundaries and sites, which bound, or are demarcated in relation, to material entities, and which can thus change location, shape and size and as their material hosts move or change shape or size (for example: your nasal passage; the hold of a ship; the boundary of Wales (which moves with the rotation of the Earth) [38, 7, 10 immaterial entity peptide Amide derived from two or more amino carboxylic acid molecules (the same or different) by formation of a covalent bond from the carbonyl carbon of one to the nitrogen atom of another with formal loss of water. The term is usually applied to structures formed from alpha-amino acids, but it includes those derived from any amino carboxylic acid. X = OH, OR, NH2, NHR, etc. peptide deoxyribonucleic acid High molecular weight, linear polymers, composed of nucleotides containing deoxyribose and linked by phosphodiester bonds; DNA contain the genetic information of organisms. deoxyribonucleic acid molecular entity Any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer etc., identifiable as a separately distinguishable entity. We are assuming that every molecular entity has to be completely connected by chemical bonds. This excludes protein complexes, which are comprised of minimally two separate molecular entities. We will follow up with Chebi to ensure this is their understanding as well molecular entity atom A chemical entity constituting the smallest component of an element having the chemical properties of the element. atom nucleic acid A macromolecule made up of nucleotide units and hydrolysable into certain pyrimidine or purine bases (usually adenine, cytosine, guanine, thymine, uracil), D-ribose or 2-deoxy-D-ribose and phosphoric acid. nucleic acid ribonucleic acid High molecular weight, linear polymers, composed of nucleotides containing ribose and linked by phosphodiester bonds; RNA is central to the synthesis of proteins. ribonucleic acid macromolecule A macromolecule is a molecule of high relative molecular mass, the structure of which essentially comprises the multiple repetition of units derived, actually or conceptually, from molecules of low relative molecular mass. polymer macromolecule cell cell PMID:18089833.Cancer Res. 2007 Dec 15;67(24):12018-25. "...Epithelial cells were harvested from histologically confirmed adenocarcinomas .." A material entity of anatomical origin (part of or deriving from an organism) that has as its parts a maximally connected cell compartment surrounded by a plasma membrane. cell cell cultured cell A cell in vitro that is or has been maintained or propagated as part of a cell culture. cultured cell experimentally modified cell in vitro A cell in vitro that has undergone physical changes as a consequence of a deliberate and specific experimental procedure. experimentally modified cell in vitro molecular_function A molecular process that can be carried out by the action of a single macromolecular machine, usually via direct physical interactions with other molecular entities. Function in this sense denotes an action, or activity, that a gene product (or a complex) performs. These actions are described from two distinct but related perspectives: (1) biochemical activity, and (2) role as a component in a larger system/process. GO:molecular_function catalytic activity Catalysis of a biochemical reaction at physiological temperatures. In biologically catalyzed reactions, the reactants are known as substrates, and the catalysts are naturally occurring macromolecular substances known as enzymes. Enzymes possess specific binding sites for substrates, and are usually composed wholly or largely of protein, but RNA that has catalytic activity (ribozyme) is often also regarded as enzymatic. catalytic activity biological_process A biological process represents a specific objective that the organism is genetically programmed to achieve. Biological processes are often described by their outcome or ending state, e.g., the biological process of cell division results in the creation of two daughter cells (a divided cell) from a single parent cell. A biological process is accomplished by a particular set of molecular functions carried out by specific gene products (or macromolecular complexes), often in a highly regulated manner and in a particular temporal sequence. biological_process gene expression The process in which a gene's sequence is converted into a mature gene product or products (proteins or RNA). This includes the production of an RNA transcript as well as any processing to produce a mature RNA product or an mRNA or circRNA (for protein-coding genes) and the translation of that mRNA or circRNA into protein. Protein maturation is included when required to form an active form of a product from an inactive precursor form. gene expression protein complex A ribosome is a protein complex A stable macromolecular complex composed (only) of two or more polypeptide subunits along with any covalently attached molecules (such as lipid anchors or oligosaccharide) or non-protein prosthetic groups (such as nucleotides or metal ions). Prosthetic group in this context refers to a tightly bound cofactor. The component polypeptide subunits may be identical. protein complex conditional specification a directive information entity that specifies what should happen if the trigger condition is fulfilled PlanAndPlannedProcess Branch OBI branch derived OBI_0000349 conditional specification measurement unit label Examples of measurement unit labels are liters, inches, weight per volume. A measurement unit label is as a label that is part of a scalar measurement datum and denotes a unit of measure. 2009-03-16: provenance: a term measurement unit was proposed for OBI (OBI_0000176) , edited by Chris Stoeckert and Cristian Cocos, and subsequently moved to IAO where the objective for which the original term was defined was satisfied with the definition of this, different, term. 2009-03-16: review of this term done during during the OBI workshop winter 2009 and the current definition was considered acceptable for use in OBI. If there is a need to modify this definition please notify OBI. PERSON: Alan Ruttenberg PERSON: Melanie Courtot measurement unit label objective specification In the protocol of a ChIP assay the objective specification says to identify protein and DNA interaction. a directive information entity that describes an intended process endpoint. When part of a plan specification the concretization is realized in a planned process in which the bearer tries to effect the world so that the process endpoint is achieved. 2009-03-16: original definition when imported from OBI read: "objective is an non realizable information entity which can serve as that proper part of a plan towards which the realization of the plan is directed." 2014-03-31: In the example of usage ("In the protocol of a ChIP assay the objective specification says to identify protein and DNA interaction") there is a protocol which is the ChIP assay protocol. In addition to being concretized on paper, the protocol can be concretized as a realizable entity, such as a plan that inheres in a person. The objective specification is the part that says that some protein and DNA interactions are identified. This is a specification of a process endpoint: the boundary in the process before which they are not identified and after which they are. During the realization of the plan, the goal is to get to the point of having the interactions, and participants in the realization of the plan try to do that. Answers the question, why did you do this experiment? PERSON: Alan Ruttenberg PERSON: Barry Smith PERSON: Bjoern Peters PERSON: Jennifer Fostel goal specification OBI Plan and Planned Process/Roles Branch OBI_0000217 objective specification Pour the contents of flask 1 into flask 2 a directive information entity that describes an action the bearer will take Alan Ruttenberg OBI Plan and Planned Process branch action specification datum label A label is a symbol that is part of some other datum and is used to either partially define the denotation of that datum or to provide a means for identifying the datum as a member of the set of data with the same label http://www.golovchenko.org/cgi-bin/wnsearch?q=label#4n GROUP: IAO 9/22/11 BP: changed the rdfs:label for this class from 'label' to 'datum label' to convey that this class is not intended to cover all kinds of labels (stickers, radiolabels, etc.), and not even all kind of textual labels, but rather the kind of labels occuring in a datum. datum label information carrier In the case of a printed paperback novel the physicality of the ink and of the paper form part of the information bearer. The qualities of appearing black and having a certain pattern for the ink and appearing white for the paper form part of the information carrier in this case. A quality of an information bearer that imparts the information content 12/15/09: There is a concern that some ways that carry information may be processes rather than qualities, such as in a 'delayed wave carrier'. 2014-03-10: We are not certain that all information carriers are qualities. There was a discussion of dropping it. PERSON: Alan Ruttenberg Smith, Ceusters, Ruttenberg, 2000 years of philosophy information carrier data item Data items include counts of things, analyte concentrations, and statistical summaries. a data item is an information content entity that is intended to be a truthful statement about something (modulo, e.g., measurement precision or other systematic errors) and is constructed/acquired by a method which reliably tends to produce (approximately) truthful statements. 2/2/2009 Alan and Bjoern discussing FACS run output data. This is a data item because it is about the cell population. Each element records an event and is typically further composed a set of measurment data items that record the fluorescent intensity stimulated by one of the lasers. 2009-03-16: data item deliberatly ambiguous: we merged data set and datum to be one entity, not knowing how to define singular versus plural. So data item is more general than datum. 2009-03-16: removed datum as alternative term as datum specifically refers to singular form, and is thus not an exact synonym. 2014-03-31: See discussion at http://odontomachus.wordpress.com/2014/03/30/aboutness-objects-propositions/ JAR: datum -- well, this will be very tricky to define, but maybe some information-like stuff that might be put into a computer and that is meant, by someone, to denote and/or to be interpreted by some process... I would include lists, tables, sentences... I think I might defer to Barry, or to Brian Cantwell Smith JAR: A data item is an approximately justified approximately true approximate belief PERSON: Alan Ruttenberg PERSON: Chris Stoeckert PERSON: Jonathan Rees data data item symbol a serial number such as "12324X" a stop sign a written proper name such as "OBI" An information content entity that is a mark(s) or character(s) used as a conventional representation of another entity. 20091104, MC: this needs work and will most probably change 2014-03-31: We would like to have a deeper analysis of 'mark' and 'sign' in the future (see https://github.com/information-artifact-ontology/IAO/issues/154). PERSON: James A. Overton PERSON: Jonathan Rees based on Oxford English Dictionary symbol information content entity Examples of information content entites include journal articles, data, graphical layouts, and graphs. A generically dependent continuant that is about some thing. 2014-03-10: The use of "thing" is intended to be general enough to include universals and configurations (see https://groups.google.com/d/msg/information-ontology/GBxvYZCk1oc/-L6B5fSBBTQJ). information_content_entity 'is_encoded_in' some digital_entity in obi before split (040907). information_content_entity 'is_encoded_in' some physical_document in obi before split (040907). Previous. An information content entity is a non-realizable information entity that 'is encoded in' some digital or physical entity. PERSON: Chris Stoeckert OBI_0000142 information content entity 1 1 10 feet. 3 ml. a scalar measurement datum is a measurement datum that is composed of two parts, numerals and a unit label. 2009-03-16: we decided to keep datum singular in scalar measurement datum, as in this case we explicitly refer to the singular form Would write this as: has_part some 'measurement unit label' and has_part some numeral and has_part exactly 2, except for the fact that this won't let us take advantage of OWL reasoning over the numbers. Instead use has measurment value property to represent the same. Use has measurement unit label (subproperty of has_part) so we can easily say that there is only one of them. PERSON: Alan Ruttenberg PERSON: Melanie Courtot scalar measurement datum An information content entity whose concretizations indicate to their bearer how to realize them in a process. 2009-03-16: provenance: a term realizable information entity was proposed for OBI (OBI_0000337) , edited by the PlanAndPlannedProcess branch. Original definition was "is the specification of a process that can be concretized and realized by an actor" with alternative term "instruction".It has been subsequently moved to IAO where the objective for which the original term was defined was satisfied with the definitionof this, different, term. 2013-05-30 Alan Ruttenberg: What differentiates a directive information entity from an information concretization is that it can have concretizations that are either qualities or realizable entities. The concretizations that are realizable entities are created when an individual chooses to take up the direction, i.e. has the intention to (try to) realize it. 8/6/2009 Alan Ruttenberg: Changed label from "information entity about a realizable" after discussions at ICBO Werner pushed back on calling it realizable information entity as it isn't realizable. However this name isn't right either. An example would be a recipe. The realizable entity would be a plan, but the information entity isn't about the plan, it, once concretized, *is* the plan. -Alan PERSON: Alan Ruttenberg PERSON: Bjoern Peters directive information entity dot plot Dot plot of SSC-H and FSC-H. A dot plot is a report graph which is a graphical representation of data where each data point is represented by a single dot placed on coordinates corresponding to data point values in particular dimensions. person:Allyson Lister person:Chris Stoeckert OBI_0000123 group:OBI dot plot graph A diagram that presents one or more tuples of information by mapping those tuples in to a two dimensional space in a non arbitrary way. PERSON: Lawrence Hunter person:Alan Ruttenberg person:Allyson Lister OBI_0000240 group:OBI graph rule example to be added a rule is an executable which guides, defines, restricts actions MSI PRS OBI_0500021 PRS rule algorithm PMID: 18378114.Genomics. 2008 Mar 28. LINKGEN: A new algorithm to process data in genetic linkage studies. A plan specification which describes the inputs and output of mathematical functions as well as workflow of execution for achieving an predefined objective. Algorithms are realized usually by means of implementation as computer programs for execution by automata. Philippe Rocca-Serra PlanAndPlannedProcess Branch OBI_0000270 adapted from discussion on OBI list (Matthew Pocock, Christian Cocos, Alan Ruttenberg) algorithm curation status specification The curation status of the term. The allowed values come from an enumerated list of predefined terms. See the specification of these instances for more detailed definitions of each enumerated value. Better to represent curation as a process with parts and then relate labels to that process (in IAO meeting) PERSON:Bill Bug GROUP:OBI:<http://purl.obolibrary.org/obo/obi> OBI_0000266 curation status specification data set Intensity values in a CEL file or from multiple CEL files comprise a data set (as opposed to the CEL files themselves). A data item that is an aggregate of other data items of the same type that have something in common. Averages and distributions can be determined for data sets. 2009/10/23 Alan Ruttenberg. The intention is that this term represent collections of like data. So this isn't for, e.g. the whole contents of a cel file, which includes parameters, metadata etc. This is more like java arrays of a certain rather specific type 2014-05-05: Data sets are aggregates and thus must include two or more data items. We have chosen not to add logical axioms to make this restriction. person:Allyson Lister person:Chris Stoeckert OBI_0000042 group:OBI data set image An image is an affine projection to a two dimensional surface, of measurements of some quality of an entity or entities repeated at regular intervals across a spatial range, where the measurements are represented as color and luminosity on the projected on surface. person:Alan Ruttenberg person:Allyson person:Chris Stoeckert OBI_0000030 group:OBI image data about an ontology part is a data item about a part of an ontology, for example a term Person:Alan Ruttenberg data about an ontology part plan specification PMID: 18323827.Nat Med. 2008 Mar;14(3):226.New plan proposed to help resolve conflicting medical advice. A directive information entity with action specifications and objective specifications as parts that, when concretized, is realized in a process in which the bearer tries to achieve the objectives by taking the actions specified. 2009-03-16: provenance: a term a plan was proposed for OBI (OBI_0000344) , edited by the PlanAndPlannedProcess branch. Original definition was " a plan is a specification of a process that is realized by an actor to achieve the objective specified as part of the plan". It has been subsequently moved to IAO where the objective for which the original term was defined was satisfied with the definitionof this, different, term. 2014-03-31: A plan specification can have other parts, such as conditional specifications. Alternative previous definition: a plan is a set of instructions that specify how an objective should be achieved Alan Ruttenberg OBI Plan and Planned Process branch OBI_0000344 2/3/2009 Comment from OBI review. Action specification not well enough specified. Conditional specification not well enough specified. Question whether all plan specifications have objective specifications. Request that IAO either clarify these or change definitions not to use them plan specification measurement datum Examples of measurement data are the recoding of the weight of a mouse as {40,mass,"grams"}, the recording of an observation of the behavior of the mouse {,process,"agitated"}, the recording of the expression level of a gene as measured through the process of microarray experiment {3.4,luminosity,}. A measurement datum is an information content entity that is a recording of the output of a measurement such as produced by a device. 2/2/2009 is_specified_output of some assay? person:Chris Stoeckert OBI_0000305 group:OBI measurement datum version number A version number is an information content entity which is a sequence of characters borne by part of each of a class of manufactured products or its packaging and indicates its order within a set of other products having the same name. Note: we feel that at the moment we are happy with a general version number, and that we will subclass as needed in the future. For example, see 7. genome sequence version GROUP: IAO version number conclusion textual entity that fucoidan has a small statistically significant effect on AT3 level but no useful clinical effect as in-vivo anticoagulant, a paraphrase of part of the last paragraph of the discussion section of the paper 'Pilot clinical study to evaluate the anticoagulant activity of fucoidan', by Lowenthal et. al.PMID:19696660 A textual entity that expresses the results of reasoning about a problem, for instance as typically found towards the end of scientific papers. 2009/09/28 Alan Ruttenberg. Fucoidan-use-case 2009/10/23 Alan Ruttenberg: We need to work on the definition still Person:Alan Ruttenberg conclusion textual entity scatter plot Comparison of gene expression values in two samples can be displayed in a scatter plot A scatterplot is a graph which uses Cartesian coordinates to display values for two variables for a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. PERSON:Chris Stoeckert PERSON:James Malone PERSON:Melanie Courtot scattergraph WEB: http://en.wikipedia.org/wiki/Scatterplot scatter plot textual entity Words, sentences, paragraphs, and the written (non-figure) parts of publications are all textual entities A textual entity is a part of a manifestation (FRBR sense), a generically dependent continuant whose concretizations are patterns of glyphs intended to be interpreted as words, formulas, etc. AR, (IAO call 2009-09-01): a document as a whole is not typically a textual entity, because it has pictures in it - rather there are parts of it that are textual entities. Examples: The title, paragraph 2 sentence 7, etc. MC, 2009-09-14 (following IAO call 2009-09-01): textual entities live at the FRBR (http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records) manifestation level. Everything is significant: line break, pdf and html versions of same document are different textual entities. PERSON: Lawrence Hunter text textual entity table | T F --+----- T | T F F | F F A textual entity that contains a two-dimensional arrangement of texts repeated at regular intervals across a spatial range, such that the spatial relationships among the constituent texts expresses propositions PERSON: Lawrence Hunter table figure Any picture, diagram or table An information content entity consisting of a two dimensional arrangement of information content entities such that the arrangement itself is about something. PERSON: Lawrence Hunter figure diagram A molecular structure ribbon cartoon showing helices, turns and sheets and their relations to each other in space. A figure that expresses one or more propositions PERSON: Lawrence Hunter diagram document A journal article, patent application, laboratory notebook, or a book A collection of information content entities intended to be understood together as a whole PERSON: Lawrence Hunter document 1 A cartesian spatial coordinate datum is a representation of a point in a spatial region, in which equal changes in the magnitude of a coordinate value denote length qualities with the same magnitude 2009-08-18 Alan Ruttenberg - question to BFO list about whether the BFO sense of the lower dimensional regions is that they are always part of actual space (the three dimensional sort) http://groups.google.com/group/bfo-discuss/browse_thread/thread/9d04e717e39fb617 Alan Ruttenberg AR notes: We need to discuss whether it should include site. cartesian spatial coordinate datum http://groups.google.com/group/bfo-discuss/browse_thread/thread/9d04e717e39fb617 1 A cartesion spatial coordinate datum that uses one value to specify a position along a one dimensional spatial region Alan Ruttenberg one dimensional cartesian spatial coordinate datum 1 1 A cartesion spatial coordinate datum that uses two values to specify a position within a two dimensional spatial region Alan Ruttenberg two dimensional cartesian spatial coordinate datum A scalar measurement datum that is the result of measurement of mass quality 2009/09/28 Alan Ruttenberg. Fucoidan-use-case Person:Alan Ruttenberg mass measurement datum A scalar measurement datum that is the result of measuring a temporal interval 2009/09/28 Alan Ruttenberg. Fucoidan-use-case Person:Alan Ruttenberg time measurement datum Recording the current temperature in a laboratory notebook. Writing a journal article. Updating a patient record in a database. a planned process in which a document is created or added to by including the specified input in it. 6/11/9: Edited at OBI workshop. We need to be able identify a child form of information artifact which corresponds to something enduring (not brain like). This used to be restricted to physical document or digital entity as the output, but that excludes e.g. an audio cassette tape Bjoern Peters wikipedia http://en.wikipedia.org/wiki/Documenting documenting line graph A line graph is a type of graph created by connecting a series of data points together with a line. PERSON:Chris Stoeckert PERSON:Melanie Courtot line chart GROUP:OBI WEB: http://en.wikipedia.org/wiki/Line_chart line graph The sentence "The article has Pubmed ID 12345." contains a CRID that has two parts: one part is the CRID symbol, which is '12345'; the other part denotes the CRID registry, which is Pubmed. A symbol that is part of a CRID and that is sufficient to look up a record from the CRID's registry. PERSON: Alan Ruttenberg PERSON: Bill Hogan PERSON: Bjoern Peters PERSON: Melanie Courtot CRID symbol Original proposal from Bjoern, discussions at IAO calls centrally registered identifier symbol The sentence "The article has Pubmed ID 12345." contains a CRID that has two parts: one part is the CRID symbol, which is '12345'; the other part denotes the CRID registry, which is Pubmed. An information content entity that consists of a CRID symbol and additional information about the CRID registry to which it belongs. 2014-05-05: In defining this term we take no position on what the CRID denotes. In particular do not assume it denotes a *record* in the CRID registry (since the registry might not have 'records'). Alan, IAO call 20101124: potentially the CRID denotes the instance it was associated with during creation. Note, IAO call 20101124: URIs are not always CRID, as not centrally registered. We acknowledge that CRID is a subset of a larger identifier class, but this subset fulfills our current needs. OBI PURLs are CRID as they are registered with OCLC. UPCs (Universal Product Codes from AC Nielsen)are not CRID as they are not centrally registered. PERSON: Alan Ruttenberg PERSON: Bill Hogan PERSON: Bjoern Peters PERSON: Melanie Courtot CRID Original proposal from Bjoern, discussions at IAO calls centrally registered identifier PubMed is a CRID registry. It has a dataset of PubMed identifiers associated with journal articles. A CRID registry is a dataset of CRID records, each consisting of a CRID symbol and additional information which was recorded in the dataset through a assigning a centrally registered identifier process. PERSON: Alan Ruttenberg PERSON: Bill Hogan PERSON: Bjoern Peters PERSON: Melanie Courtot CRID registry Original proposal from Bjoern, discussions at IAO calls centrally registered identifier registry time stamped measurement datum pmid:20604925 - time-lapse live cell microscopy A data set that is an aggregate of data recording some measurement at a number of time points. The time series data set is an ordered list of pairs of time measurement data and the corresponding measurement data acquired at that time. Alan Ruttenberg experimental time series time sampled measurement data set Viruses Viruses Euteleostomi bony vertebrates Euteleostomi Bacteria eubacteria Bacteria Archaea Archaea Eukaryota eucaryotes eukaryotes Eukaryota Euarchontoglires Euarchontoglires Tetrapoda tetrapods Tetrapoda Amniota amniotes Amniota Opisthokonta Opisthokonta Bilateria Bilateria Mammalia mammals Mammalia Vertebrata <Metazoa> Vertebrata vertebrates Vertebrata <Metazoa> Homo sapiens human human being man Homo sapiens fluorescent reporter intensity A measurement datum that represents the output of a scanner measuring the intensity value for each fluorescent reporter. person:Chris Stoeckert group:OBI From the DT branch: This term and definition were originally submitted by the community to our branch, but we thought they best fit DENRIE. However we see several issues with this. First of all the name 'probe' might not be used in OBI. Instead we have a 'reporter' role. Also, albeit the term 'probe intensity' is often used in communities such as the microarray one, the name 'probe' is ambiguous (some use it to refer to what's on the array, some use it to refer to what's hybed to the array). Furthermore, this concept could possibly be encompassed by combining different OBI terms, such as the roles of analyte, detector and reporter (you need something hybed to a probe on the array to get an intensity) and maybe a more general term for 'measuring intensities'. We need to find the right balance between what is consistent with OBI and combinations of its terms and what is user-friendly. Finally, note that 'intensity' is already in the OBI .owl file and is also in PATO. Why didn't OBI import it from PATO? This might be a problem. fluorescent reporter intensity planned process planned process Injecting mice with a vaccine in order to test its efficacy A processual entity that realizes a plan which is the concretization of a plan specification. 'Plan' includes a future direction sense. That can be problematic if plans are changed during their execution. There are however implicit contingencies for protocols that an agent has in his mind that can be considered part of the plan, even if the agent didn't have them in mind before. Therefore, a planned process can diverge from what the agent would have said the plan was before executing it, by adjusting to problems encountered during execution (e.g. choosing another reagent with equivalent properties, if the originally planned one has run out.) We are only considering successfully completed planned processes. A plan may be modified, and details added during execution. For a given planned process, the associated realized plan specification is the one encompassing all changes made during execution. This means that all processes in which an agent acts towards achieving some objectives is a planned process. Bjoern Peters branch derived 6/11/9: Edited at workshop. Used to include: is initiated by an agent This class merges the previously separated objective driven process and planned process, as they the separation proved hard to maintain. (1/22/09, branch call) planned process biological feature identification objective Biological_feature_identification_objective is an objective role carried out by the proposition defining the aim of a study designed to examine or characterize a particular biological feature. Jennifer Fostel biological feature identification objective processed material Examples include gel matrices, filter paper, parafilm and buffer solutions, mass spectrometer, tissue samples Is a material entity that is created or changed during material processing. PERSON: Alan Ruttenberg processed material investigation Lung cancer investigation using expression profiling, a stem cell transplant investigation, biobanking is not an investigation, though it may be part of an investigation a planned process that consists of parts: planning, study design execution, documentation and which produce conclusion(s). Bjoern Peters OBI branch derived Could add specific objective specification Following OBI call November 2012,26th: it was decided there was no need for adding "achieves objective of drawing conclusion" as existing relations were providing equivalent ability. this note closes the issue and validates the class definition to be part of the OBI core editor = PRS study investigation evaluant role When a specimen of blood is assayed for glucose concentration, the blood has the evaluant role. When measuring the mass of a mouse, the evaluant is the mouse. When measuring the time of DNA replication, the evaluant is the DNA. When measuring the intensity of light on a surface, the evaluant is the light source. a role that inheres in a material entity that is realized in an assay in which data is generated about the bearer of the evaluant role Role call - 17nov-08: JF and MC think an evaluant role is always specified input of a process. Even in the case where we have an assay taking blood as evaluant and outputting blood, the blood is not the specified output at the end of the assay (the concentration of glucose in the blood is) examples of features that could be described in an evaluant: quality.... e.g. "contains 10 pg/ml IL2", or "no glucose detected") GROUP: Role Branch OBI Feb 10, 2009. changes after discussion at OBI Consortium Workshop Feb 2-6, 2009. accepted as core term. evaluant role assay Assay the wavelength of light emitted by excited Neon atoms. Count of geese flying over a house. A planned process with the objective to produce information about the material entity that is the evaluant, by physically examining it or its proxies. 12/3/12: BP: the reference to the 'physical examination' is included to point out that a prediction is not an assay, as that does not require physical examiniation. PlanAndPlannedProcess Branch measuring scientific observation OBI branch derived study assay any method assay quantitative confidence value A data item which is used to indicate the degree of uncertainty about a measurement. person:Chris Stoeckert group:OBI quantitative confidence value culture medium A growth medium or culture medium is a substance in which microorganisms or cells can grow. Wikipedia, growth medium, Feb 29, 2008 a processed material that provides the needed nourishment for microorganisms or cells grown in vitro. changed from a role to a processed material based on on Aug 22, 2011 dev call. Details see the tracker item: http://sourceforge.net/tracker/?func=detail&aid=3325270&group_id=177891&atid=886178 Modification made by JZ. Person: Jennifer Fostel, Jie Zheng OBI culture medium reagent role Buffer, dye, a catalyst, a solvating agent. A role inhering in a biological or chemical entity that is intended to be applied in a scientific technique to participate (or have molecular components that participate) in a chemical reaction that facilitates the generation of data about some entity distinct from the bearer, or the generation of some specified material output distinct from the bearer. PERSON:Matthew Brush reagent PERSON:Matthew Brush Feb 10, 2009. changes after discussion at OBI Consortium Workshop Feb 2-6, 2009. accepted as core term. May 28 2013. Updated definition taken from ReO based on discussions initiated in Philly 2011 workshop. Former defnition described a narrower view of reagents in chemistry that restricts bearers of the role to be chemical entities ("a role played by a molecular entity used to produce a chemical reaction to detect, measure, or produce other substances"). Updated definition allows for broader view of reagents in the domain of biomedical research to include larger materials that have parts that participate chemically in a molecular reaction or interaction. (copied from ReO) Reagents are distinguished from instruments or devices that also participate in scientific techniques by the fact that reagents are chemical or biological in nature and necessarily participate in or have parts that participate in some chemical interaction or reaction during their intended participation in some technique. By contrast, instruments do not participate in a chemical reaction/interaction during the technique. Reagents are distinguished from study subjects/evaluants in that study subjects and evaluants are that about which conclusions are drawn and knowledge is sought in an investigation - while reagents, by definition, are not. It should be noted, however, that reagent and study subject/evaluant roles can be borne by instances of the same type of material entity - but a given instance will realize only one of these roles in the execution of a given assay or technique. For example, taq polymerase can bear a reagent role or an evaluant role. In a DNA sequencing assay aimed at generating sequence data about some plasmid, the reagent role of the taq polymerase is realized. In an assay to evaluate the quality of the taq polymerase itself, the evaluant/study subject role of the taq is realized, but not the reagent role since the taq is the subject about which data is generated. In regard to the statement that reagents are 'distinct' from the specified outputs of a technique, note that a reagent may be incorporated into a material output of a technique, as long as the IDENTITY of this output is distinct from that of the bearer of the reagent role. For example, dNTPs input into a PCR are reagents that become part of the material output of this technique, but this output has a new identity (ie that of a 'nucleic acid molecule') that is distinct from the identity of the dNTPs that comprise it. Similarly, a biotin molecule input into a cell labeling technique are reagents that become part of the specified output, but the identity of the output is that of some modified cell specimen which shares identity with the input unmodified cell specimen, and not with the biotin label. Thus, we see that an important criteria of 'reagent-ness' is that it is a facilitator, and not the primary focus of an investigation or material processing technique (ie not the specified subject/evaluant about which knowledge is sought, or the specified output material of the technique). reagent role material processing A cell lysis, production of a cloning vector, creating a buffer. A planned process which results in physical changes in a specified input material PERSON: Bjoern Peters PERSON: Frank Gibson PERSON: Jennifer Fostel PERSON: Melanie Courtot PERSON: Philippe Rocca Serra material transformation OBI branch derived material processing study subject role Human subjects in a clinical trial, rats in a toxicogenomics study, tissue cutlures subjected to drug tests, fish observed in an ecotoxicology study. Parasite example: people are infected with a parasite which is then extracted; the particpant under investigation could be the parasite, the people, or a population of which the people are members, depending on the nature of the study. Lake example: a lake could realize this role in an investigation that assays pollution levels in samples of water taken from the lake. A role that is realized through the execution of a study design in which the bearer of the role participates and in which data about that bearer is collected. A participant can realize both "specimen role" and "participant under investigation role" at the same time. However "participant under investigation role" is distinct from "specimen role", since a specimen could somehow be involved in an investigation without being the thing that is under investigation. GROUP: Role Branch OBI Following OBI call November 2012,26th: 1. it was decided there was no need for moving the children class and making them siblings of study subject role. 2. it also settles the disambiguation about 'study subject'. This is about the individual participating in the investigation/study, Not the 'topic' (as in 'toxicity study') of the investigation/study This note closes the issue and validates the class definition to be part of the OBI core editor = PRS participant under investigation role specimen role liver section; a portion of a culture of cells; a nemotode or other animal once no longer a subject (generally killed); portion of blood from a patient. a role borne by a material entity that is gained during a specimen collection process and that can be realized by use of the specimen in an investigation 22Jun09. The definition includes whole organisms, and can include a human. The link between specimen role and study subject role has been removed. A specimen taken as part of a case study is not considered to be a population representative, while a specimen taken as representing a population, e.g. person taken from a cohort, blood specimen taken from an animal) would be considered a population representative and would also bear material sample role. Note: definition is in specimen creation objective which is defined as an objective to obtain and store a material entity for potential use as an input during an investigation. blood taken from animal: animal continues in study, whereas blood has role specimen. something taken from study subject, leaves the study and becomes the specimen. parasite example - when parasite in people we study people, people are subjects and parasites are specimen - when parasite extracted, they become subject in the following study specimen can later be subject. GROUP: Role Branch OBI specimen role sequence feature identification objective Sequence_feature_identification_objective is a biological_feature_identification_objective role describing a study designed to examine or characterize molecular features exhibited at the level of a macromolecular sequence, e.g. nucleic acid, protein, polysaccharide. Jennifer Fostel sequence feature identification objective intervention design PMID: 18208636.Br J Nutr. 2008 Jan 22;:1-11.Effect of vitamin D supplementation on bone and vitamin D status among Pakistani immigrants in Denmark: a randomised double-blinded placebo-controlled intervention study. An intervention design is a study design in which a controlled process applied to the subjects (the intervention) serves as the independent variable manipulated by the experimentalist. The treatment (perturbation or intervention) defined can be defined as a combination of values taken by independent variable manipulated by the experimentalists are applied to the recruited subjects assigned (possibly by applying specific methods) to treatment groups. The specificity of intervention design is the fact that independent variables are being manipulated and a response of the biological system is evaluated via response variables as monitored by possibly a series of assays. Philppe Rocca-Serra OBI branch derived intervention design gene list Gene lists may arise from analysis to determine differentially expressed genes, may be collected from the literature for involvement in a particular process or pathway (e.g., inflammation), or may be the input for gene set enrichment analysis. A data set of the names or identifiers of genes that are the outcome of an analysis or have been put together for the purpose of an analysis. person:Chris Stoeckert group:OBI kind of report. (alan) need to be careful to distinguish from output of a data transformation or calculation. A gene list is a report when it is published as such? Relates to question of whether report is a whole, or whether it can be a part of some other narrative object. gene list molecular feature identification objective Molecular_feature_identification_objective is a biological_feature_identification_objective role describing a study designed to examine or characterize molecular features of a biological system, e.g. expression profiling, copy number of molecular components, epigenetic modifications. Jennifer Fostel molecular feature identification objective cDNA library PMID:6110205. collection of cDNA derived from mouse splenocytes. Mixed population of cDNAs (complementaryDNA) made from mRNA from a defined source, usually a specific cell type. This term should be associated only to nucleic acid interactors not to their proteins product. For instance in 2h screening use living cells (MI:0349) as sample process. ALT DEF (PRS):: a cDNA library is a collection of host cells, typically E.Coli cells but not exclusively. modified by transfer of plasmid DNA molecule used as vector containing a fragment or totality of cDNA molecule (the insert) . cDNA library may have an array of role and applications. PERSON: Luisa Montecchi PERSON: Philippe Rocca-Serra GROUP: PSI PRS: 22022008. class moved under population, modification of definition and replacement of biomaterials in previous definition with 'material' addition of has_role restriction cDNA library p-value PMID:19696660 in contrast to the in-vivo data AT-III increased significantly from 113.5% at baseline to 117% after 4 days (n = 10, P-value= 0.02; Table 2). A quantitative confidence value that represents the probability of obtaining a result at least as extreme as that actually obtained, assuming that the actual value was the result of chance alone. Addition of restriction 'output of null hypothesis testing' by AGB and PRS while working on STATO May be outside the scope of OBI long term, is needed so is retained Alejandra Gonzalez-Beltran PERSON:Chris Stoeckert Philippe Rocca-Serra WEB: http://en.wikipedia.org/wiki/P-value p p-value population PMID12564891. Environ Sci Technol. 2003 Jan 15;37(2):223-8. Effects of historic PCB exposures on the reproductive success of the Hudson River striped bass population. a population is a collection of individuals from the same taxonomic class living, counted or sampled at a particular site or in a particular area 1/28/2013, BP, on the call it was raised that we may want to switch to an external ontology for all populatin terms: http://code.google.com/p/popcomm-ontology/ PERSON: Philippe Rocca-Serra adapted from Oxford English Dictionnary rem1: collection somehow always involve a selection process population imaging assay An imaging assay is an assay to produce a picture of an entity. definition_source: OBI. PlanAndPlannedProcess Branch OBI branch derived imaging assay organization PMID: 16353909.AAPS J. 2005 Sep 22;7(2):E274-80. Review. The joint food and agriculture organization of the United Nations/World Health Organization Expert Committee on Food Additives and its role in the evaluation of the safety of veterinary drug residues in foods. An entity that can bear roles, has members, and has a set of organization rules. Members of organizations are either organizations themselves or individual people. Members can bear specific organization member roles that are determined in the organization rules. The organization rules also determine how decisions are made on behalf of the organization by the organization members. BP: The definition summarizes long email discussions on the OBI developer, roles, biomaterial and denrie branches. It leaves open if an organization is a material entity or a dependent continuant, as no consensus was reached on that. The current placement as material is therefore temporary, in order to move forward with development. Here is the entire email summary, on which the definition is based: 1) there are organization_member_roles (president, treasurer, branch editor), with individual persons as bearers 2) there are organization_roles (employer, owner, vendor, patent holder) 3) an organization has a charter / rules / bylaws, which specify what roles there are, how they should be realized, and how to modify the charter/rules/bylaws themselves. It is debatable what the organization itself is (some kind of dependent continuant or an aggregate of people). This also determines who/what the bearer of organization_roles' are. My personal favorite is still to define organization as a kind of 'legal entity', but thinking it through leads to all kinds of questions that are clearly outside the scope of OBI. Interestingly enough, it does not seem to matter much where we place organization itself, as long as we can subclass it (University, Corporation, Government Agency, Hospital), instantiate it (Affymetrix, NCBI, NIH, ISO, W3C, University of Oklahoma), and have it play roles. This leads to my proposal: We define organization through the statements 1 - 3 above, but without an 'is a' statement for now. We can leave it in its current place in the is_a hierarchy (material entity) or move it up to 'continuant'. We leave further clarifications to BFO, and close this issue for now. PERSON: Alan Ruttenberg PERSON: Bjoern Peters PERSON: Philippe Rocca-Serra PERSON: Susanna Sansone GROUP: OBI organization dye role A molecular label role which inheres in a material entity and which is realized in the process of detecting a molecular dye that imparts color to some material of interest. Jennifer Fostel dye A substance used to color materials www.answers.com/topic/dye 19feb09 dye role protocol PCR protocol, has objective specification, amplify DNA fragment of interest, and has action specification describes the amounts of experimental reagents used (e..g. buffers, dNTPS, enzyme), and the temperature and cycle time settings for running the PCR. A plan specification which has sufficient level of detail and quantitative information to communicate it between investigation agents, so that different investigation agents will reliably be able to independently reproduce the process. PlanAndPlannedProcess Branch OBI branch derived + wikipedia (http://en.wikipedia.org/wiki/Protocol_%28natural_sciences%29) study protocol protocol adding a material entity into a target Injecting a drug into a mouse. Adding IL-2 to a cell culture. Adding NaCl into water. is a process with the objective to place a material entity bearing the 'material to be added role' into a material bearing the 'target of material addition role'. Class was renamed from 'administering substance', as this is commonly used only for additions into organisms. BP branch derived adding a material entity into a target analyte role Glucose in blood (measured in an assay to determine the concentration of glucose). A measurand role borne by a molecular entity or an atom and realized in an analyte assay which achieves the objective to measure the magnitude/concentration/amount of the analyte in the entity bearing evaluant role. interestingly, an analyte is still an analyte even if it is not detected. for this reason it does not bear a specified input role pH (technically the inverse log of [H+]) may be considered a quality; this remains to be tested. qualities such as weight, color are not assayed but measured, so they do not fall into this category. GROUP: Role Branch OBI Feb 10, 2009. changes after discussion at OBI Consortium Workshop Feb 2-6, 2009. accepted as core term. analyte role material to be added role drug added to a buffer contained in a tube; substance injected into an animal; material to be added role is a protocol participant role realized by a material which is added into a material bearing the target of material addition role in a material addition process Role Branch OBI 9 March 09 from discussion with PA branch material to be added role interpreting data Concluding that a gene is upregulated in a tissue sample based on the band intensity in a western blot. Concluding that a patient has a infection based on measurement of an elevated body temperature and reported headache. Concluding that there were problems in an investigation because data from PCR and microarray are conflicting. Concluding that 'defects in gene XYZ cause cancer due to improper DNA repair' based on data from experiments in that study that gene XYZ is involved in DNA repair, and the conclusion of a previous study that cancer patients have an increased number of mutations in this gene. A planned process in which data gathered in an investigation is evaluated in the context of existing knowledge with the objective to generate more general conclusions or to conclude that the data does not allow one to draw general conclusion PERSON: Bjoern Peters PERSON: Jennifer Fostel Bjoern Peters drawing a conclusion based on data planning The process of a scientist thinking about and deciding what reagents to use as part of a protocol for an experiment. Note that the scientist could be human or a "robot scientist" executing software. a process of creating or modifying a plan specification 7/18/2011 BP: planning used to itself be a planned process. Barry Smith pointed out that this would lead to an infinite regression, as there would have to be a plan to conduct a planning process, which in itself would be the result of planning etc. Therefore, the restrictions on 'planning' were loosened to allow for informal processes that result in an 'ad hoc plan '. This required changing from 'has_specified_output some plan specifiction' to 'has_participant some plan specification'. Bjoern Peters Bjoern Peters Plans and Planned Processes Branch planning light emission function A light emission function is an excitation function to excite a material to a specific excitation state that it emits light. Bill Bug Daniel Schober Frank Gibson Melanie Courtot light emission function contain function A syringe, a beaker A contain function is a function to constrain a material entities location in space Bill Bug Daniel Schober Frank Gibson Melanie Courtot contain function heat function A heat function is a function that increases the internal kinetic energy of a material Bill Bug Daniel Schober Frank Gibson Melanie Courtot heat function material separation function A material separation function is a function that increases the resolution between two or more material entities. The to distinction between the entities is usually based on some associated physical quality. Bill Bug Daniel Schober Frank Gibson Melanie Courtot material separation function excitation function A excitation function is a function to inject energy by bombarding a material with energetic particles (e.g., photons) thereby imbuing internal material components such as electrons with additional energy. These internal, 'excited' particles may lead to the rupturing of covalent chemical bonds or may quickly relax back to there unexcited state with an exponential time course thereby locally emitting energy in the form of photons. Bill Bug Daniel Schober Frank Gibson Melanie Courtot excitation function filter function A filter function is a function to prevent the flow of certain entities based on a quality or qualities of the entity while allowing entities which have different qualities to pass through Frank Gibson filter function cool function A cool function is a function to decrease the internal kinetic energy of a material below the initial kinetic energy of that type of material. Daniel Schober Frank Gibson Melanie Courtot cool function solid support function Taped, glued, pinned, dried or molecularly bonded to a solid support A solid support function is a function of a device on which an entity is kept in a defined position and prevented in its movement Daniel Schober Frank Gibson Melanie Courtot solid support function environment control function An environmental control function is a function that regulates a contained environment within specified parameter ranges. For example the control of light exposure, humidity and temperature. Bill Bug Daniel Schober Frank Gibson Melanie Courtot environment control function sort function A sort function is a function to distinguish material components based on some associated physical quality or entity and to partition the separate components into distinct fractions according to a defined order. Daniel Schober Frank Gibson Melanie Courtot sort function cloning vector role pBluescript plays the role of a cloning vector A material to be added role played by a small, self-replicating DNA or RNA molecule - usually a plasmid or chromosome - and realized in a process whereby foreign DNA or RNA is inserted into the vector during the process of cloning. JZ: related tracker: https://sourceforge.net/p/obi/obi-terms/102/ PERSON: Helen Parkinson cloning vector role cloning insert role cloning insert role is a role which inheres in DNA or RNA and is realized by the process of being inserted into a cloning vector in a cloning process. Feb 20, 2009. from Wikipedia: cloning of any DNA fragment essentially involves four steps: DNA fragmentation with restriction endonucleases, ligation of DNA fragments to a vector, transfection, and screening/selection. There are multiple processes involved, it is not just "cloning process" GROUP: Role branch OBII and Wikipedia cloning insert role extract Up-regulation of inflammatory signalings by areca nut extract and role of cyclooxygenase-2 -1195G>a polymorphism reveal risk of oral cancer. Cancer Res. 2008 Oct 15;68(20):8489-98. PMID: 18922923 an extract is a material entity which results from an extraction process PERSON: Philippe Rocca-Serra extracted material GROUP: OBI Biomatrial Branch extract transcription profiling assay Whole genome transcription profiling of Anaplasma phagocytophilum in human and tick host cells by tiling array analysis. BMC Genomics. 2008 Jul 31;9:364. PMID: 18671858 An assay which aims to provide information about gene expression and transcription activity using ribonucleic acids collected from a material entity using a range of techniques and instrument such as DNA sequencers, DNA microarrays, Northern Blot Philippe Rocca-Serra gene expression profiling OBI transcription profiling transcription profiling assay averaging objective A mean calculation which has averaging objective is a descriptive statistics calculation in which the mean is calculated by taking the sum of all of the observations in a data set divided by the total number of observations. It gives a measure of the 'center of gravity' for the data set. It is also known as the first moment. An averaging objective is a data transformation objective where the aim is to perform mean calculations on the input of the data transformation. Elisabetta Manduchi James Malone PERSON: Elisabetta Manduchi averaging objective enzyme (protein or rna) or has_part (protein or rna) and has_function some GO:0003824 (catalytic activity) MC: known issue: enzyme doesn't classify under material entity for now as it isn't stated that anything that has_part some material entity is a material entity. If we add as equivalent classes to material entity has_part some material entity and part_of some material entity (each one in his own necessary and sufficient block) Pellet in P3 doesn't classify any more. person: Melanie Courtot GROUP:OBI enzyme adding material objective creating a mouse infected with LCM virus is the specification of an objective to add a material into a target material. The adding is asymmetric in the sense that the target material largely retains its identity BP adding material objective genotyping assay High-throughput genotyping of oncogenic human papilloma viruses with MALDI-TOF mass spectrometry. Clin Chem. 2008 Jan;54(1):86-92. Epub 2007 Nov 2.PMID: 17981923 an assay which generates data about a genotype from a specimen of genomic DNA. A variety of techniques and instruments can be used to produce information about sequence variation at particular genomic positions. Philippe Rocca-Serra genotype profiling, SNP genotyping OBI Biomaterial SNP analysis genotyping assay analyte measurement objective The objective to measure the concentration of glucose in a blood sample an assay objective to determine the presence or concentration of an analyte in the evaluant PERSON: Bjoern Peters PPPB branch analyte measurement objective assay objective the objective to determine the weight of a mouse. an objective specification to determine a specified type of information about an evaluated entity (the material entity bearing evaluant role) PPPB branch PPPB branch assay objective analyte assay example of usage: In lab test for blood glucose, the test is the assay, the blood bears evaluant_role and glucose bears the analyte role. The evaluant is considered an input to the assay and the information entity that records the measurement of glucose concentration the output An assay with the objective to capture information about the presence, concentration, or amount of an analyte in an evaluant. 2013-09-23: simplify equivalent axiom Note: is_realization of some analyte role isn't always true, for example when there is none of the analyte in the evaluant. For the moment we are writing it this way, but when the information ontology is further worked out this will be replaced with a condition discussing the measurement. logical def modified to remove expression below, as some analyte assays report below the level of detection, and therefore not a scalar measurement datum, replaced by measurement datum and ('has measurement unit label' some 'measurement unit label') and ('is quality measurement of' some 'molecular concentration')) PERSON:Bjoern Peters, Helen Parkinson, Philippe Rocca-Serra, Alan Ruttenberg PERSON:Bjoern Peters PERSON:Helen Parkinson PERSON:Philippe Rocca-Serra PERSON:Alan Ruttenberg GROUP:OBI Planned process branch analyte assay target of material addition role peritoneum of an animal receiving an interperitoneal injection; solution in a tube receiving additional material; location of absorbed material following a dermal application. target of material addition role is a role realized by an entity into which a material is added in a material addition process From Branch discussion with BP, AR, MC -- there is a need for the recipient to interact with the administered material. for example, a tooth receiving a filling was not considered to be a target role. GROUP: Role Branch OBI target of material addition role normalized data set A data set that is produced as the output of a normalization data transformation. PERSON: James Malone PERSON: Melanie Courtot normalized data set measure function A glucometer measures blood glucose concentration, the glucometer has a measure function. Measure function is a function that is borne by a processed material and realized in a process in which information about some entity is expressed relative to some reference. PERSON: Daniel Schober PERSON: Helen Parkinson PERSON: Melanie Courtot PERSON:Frank Gibson measure function material transformation objective The objective to create a mouse infected with LCM virus. The objective to create a defined solution of PBS. an objective specifiction that creates an specific output object from input materials. PERSON: Bjoern Peters PERSON: Frank Gibson PERSON: Jennifer Fostel PERSON: Melanie Courtot PERSON: Philippe Rocca-Serra artifact creation objective GROUP: OBI PlanAndPlannedProcess Branch material transformation objective study design execution injecting a mouse with PBS solution, weighing it, and recording the weight according to a study design. a planned process that carries out a study design removed axiom has_part some (assay or 'data transformation') per discussion on protocol application mailing list to improve reasoner performance. The axiom is still desired. branch derived 6/11/9: edited at workshop. Used to be: study design execution is a process with the objective to generate data according to a concretized study design. The execution of a study design is part of an investigation, and minimally consists of an assay or data transformation. study design execution DNA sequencing Genomic deletions of OFD1 account for 23% of oral-facial-digital type 1 syndrome after negative DNA sequencing. Thauvin-Robinet C, Franco B, Saugier-Veber P, Aral B, Gigot N, Donzel A, Van Maldergem L, Bieth E, Layet V, Mathieu M, Teebi A, Lespinasse J, Callier P, Mugneret F, Masurel-Paulet A, Gautier E, Huet F, Teyssier JR, Tosi M, Frébourg T, Faivre L. Hum Mutat. 2008 Nov 19. PMID: 19023858 DNA sequencing is a sequencing process which uses deoxyribonucleic acid as input and results in a the creation of DNA sequence information artifact using a DNA sequencer instrument. Philippe Rocca-Serra OBI Branch derived nucleotide sequencing DNA sequencing material separation objective The objective to obtain multiple aliquots of an enzyme preparation. The objective to obtain cells contained in a sample of blood. is an objective to transform a material entity into spatially separated components. PPPB branch PPPB branch material separation objective clustered data set A clustered data set is the output of a K means clustering data transformation A data set that is produced as the output of a class discovery data transformation and consists of a data set with assigned discovered class labels. PERSON: James Malone PERSON: Monnie McGee data set with assigned discovered class labels AR thinks could be a data item instead clustered data set data set of features A data set that is produced as the output of a descriptive statistical calculation data transformation and consists of producing a data set that represents one or more features of interest about the input data set. PERSON: James Malone PERSON: Monnie McGee data set of features differential expression analysis data transformation A differential expression analysis data transformation is a data transformation that has objective differential expression analysis and that consists of James Malone Melanie Courtot Monnie McGee WEB: differential expression analysis data transformation material combination Mixing two fluids. Adding salt into water. Injecting a mouse with PBS. is a material processing with the objective to combine two or more material entities as input into a single material entity as output. created at workshop as parent class for 'adding material into target', which is asymmetric, while combination encompasses all addition processes. bp bp material combination specimen collection process drawing blood from a patient for analysis, collecting a piece of a plant for depositing in a herbarium, buying meat from a butcher in order to measure its protein content in an investigation A planned process with the objective of collecting a specimen. Note: definition is in specimen creation objective which is defined as an objective to obtain and store a material entity for potential use as an input during an investigation. Philly2013: A specimen collection can have as part a material entity acquisition, such as ordering from a bank. The distinction is that specimen collection necessarily involves the creation of a specimen role. However ordering cell lines cells from ATCC for use in an investigation is NOT a specimen collection, because the cell lines already have a specimen role. Philly2013: The specimen_role for the specimen is created during the specimen collection process. label changed to 'specimen collection process' on 10/27/2014, details see tracker: http://sourceforge.net/p/obi/obi-terms/716/ Bjoern Peters specimen collection 5/31/2012: This process is not necessarily an acquisition, as specimens may be collected from materials already in posession 6/9/09: used at workshop specimen collection process error corrected data set A data set that is produced as the output of an error correction data transformation and consists of producing a data set which has had erroneous contributions from the input to the data transformation removed (corrected for). PERSON: James Malone PERSON: Monnie McGee error corrected data set error correction data transformation An error correction data transformation is a data transformation that has the objective of error correction, where the aim is to remove (correct for) erroneous contributions from the input to the data transformation. James Malone Monnie McGee EDITORS error correction data transformation sample from organism a material obtained from an organism in order to be a representative of the whole 5/29: This is a helper class for now we need to work on this: Is taking a urine sample a material separation process? If not, we will need to specify what 'taking a sample from organism' entails. We can argue that the objective to obtain a urine sample from a patient is enough to call it a material separation process, but it could dilute what material separation was supposed to be about. sample from organism statistical hypothesis test "A statistical test provides a mechanism for making quantitative decisions about a process or processes". A statistical hypothesis test data transformation is a data transformation that has objective statistical hypothesis test. Alejandra Gonzalez-Beltran James Malone Philippe Rocca-Serra PERSON: James Malone http://www.itl.nist.gov/div898/handbook/prc/section1/prc13.htm NHST Null Hypothesis Statistical Testing statistical hypothesis testing statistical hypothesis test center value A data item that is produced as the output of a center calculation data transformation and represents the center value of the input data. PERSON: James Malone PERSON: Monnie McGee median center value statistical hypothesis test objective is a data transformation objective where the aim is to estimate statistical significance with the aim of proving or disproving a hypothesis by means of some data transformation James Malone Person:Helen Parkinson hypothesis test objective WEB: http://en.wikipedia.org/wiki/Statistical_hypothesis_testing statistical hypothesis test objective portioning objective The objective to obtain multiple aliquots of an enzyme preparation. A material separation objective aiming to separate material into multiple portions, each of which contains a similar composition of the input material. portioning objective average value A data item that is produced as the output of an averaging data transformation and represents the average value of the input data. PERSON: James Malone PERSON: Monnie McGee arithmetic mean average value separation into different composition objective The objective to obtain cells contained in a sample of blood. A material separation objective aiming to separate a material entity that has parts of different types, and end with at least one output that is a material with parts of fewer types (modulo impurities). We should be using has the grain relations or concentrations to distinguish the portioning and other sub-objectives separation into different composition objective specimen collection objective The objective to collect bits of excrement in the rainforest. The objective to obtain a blood sample from a patient. A objective specification to obtain a material entity for potential use as an input during an investigation. Bjoern Peters Bjoern Peters specimen collection objective material combination objective is an objective to obtain an output material that contains several input materials. PPPB branch bp material combination objective paired-end library PMID: 19339662. Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 2009 Apr;19(4):521-32. Fullwood MJ, Wei CL, Liu ET, Ruan Y. is a collection of short paired tags from the two ends of DNA fragments are extracted and covalently linked as ditag constructs Philippe Rocca-Serra mate-paired library paired-end tag (PET) library adapted from information provided by Solid web site paired-end library k-nearest neighbors A k-nearest neighbors is a data transformation which achieves a class discovery or partitioning objective, in which an input data object with vector y is assigned to a class label based upon the k closest training data set points to y; where k is the largest value that class label is assigned. James Malone k-NN PERSON: James Malone k-nearest neighbors recombinant vector A recombinant vector is created by a recombinant vector cloning process, and contains nucleic acids that can be amplified. It retains functions of the original cloning vector. recombinant vector single fragment library is a collection of short tags from DNA fragments, are extracted and covalently linked as single tag constructs Philippe Rocca-Serra fragment library single fragment library cloning vector A cloning vector is an engineered material that is used as an input material for a recombinant vector cloning process to carry inserted nucleic acids. It contains an origin of replication for a specific destination host organism, encodes for a selectable gene product and contains a cloning site. cloning vector 1 2 1 true 1 true 1 2 1 Student's t-test Studen't t-test is a data transformation with the objective of a statistical hypothesis test in which the test statistic has a Student's t distribution if the null hypothesis is true. It is applied when the population is assumed to be normally distributed but the sample sizes are small enough that the statistic on which inference is based is not normally distributed because it relies on an uncertain estimate of standard deviation rather than on a precisely known value. Alejandra Gonzalez-Beltran James Malone Philippe Rocca-Serra t-test WEB: http://en.wikipedia.org/wiki/T-test t.test(dependent variable ~ independant variable, data = dataset, var.equal = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/t.test.html Student's t-test material sample role a role borne by a portion of blood taken to represent all the blood in an organism; the role borne by a population of humans with HIV enrolled in a study taken to represent patients with HIV in general. A material sample role is a specimen role borne by a material entity that is the output of a material sampling process. 7/13/09: Note that this is a relational role: between the sample taken and the 'sampled' material of which the sample is thought to be representative off. material sample role material sampling process A specimen gathering process with the objective to obtain a specimen that is representative of the input material entity material sampling process material sample blood drawn from patient to measure his systemic glucose level. A population of humans with HIV enrolled in a study taken to represent patients with HIV in general. A material entity that has the material sample role OBI: workshop sample population sample material sample independent variable specification In a study in which gene expression is measured in patients between 8 month to 4 years old that have mild or severe malaria and in which the hypothesis is that gene expression in that age group is a function of disease status, disease status is the independent variable. a directive information entity that is part of a study design. Independent variables are entities whose values are selected to determine its relationship to an observed phenomenon (the dependent variable). In such an experiment, an attempt is made to find evidence that the values of the independent variable determine the values of the dependent variable (that which is being measured). The independent variable can be changed as required, and its values do not represent a problem requiring explanation in an analysis, but are taken simply as given. The dependent variable on the other hand, usually cannot be directly controlled 2/2/2009 Original definition - In the design of experiments, independent variables are those whose values are controlled or selected by the person experimenting (experimenter) to determine its relationship to an observed phenomenon (the dependent variable). In such an experiment, an attempt is made to find evidence that the values of the independent variable determine the values of the dependent variable (that which is being measured). The independent variable can be changed as required, and its values do not represent a problem requiring explanation in an analysis, but are taken simply as given. The dependent variable on the other hand, usually cannot be directly controlled. In the Philly 2013 workshop the label was chosen to distinguish it from "dependent variable" as used in statistical modelling. See: http://en.wikipedia.org/wiki/Statistical_modeling an independent variable is a variable which assumes only values set by the operator according to a plan and which are expected to (or are being tested for) influence the ranges of values assumed by one or more dependent variables (also known as 'response variables'). PERSON: Alan Ruttenberg PERSON: Bjoern Peters PERSON: Chris Stoeckert experimental factor independent variable Web: http://en.wikipedia.org/wiki/Dependent_and_independent_variables 2009-03-16: work has been done on this term during during the OBI workshop winter 2009 and the current definition was considered acceptable for use in OBI. If there is a need to modify thisdefinition please notify OBI. study factor explanatory variable factor study design independent variable dependent variable specification In a study in which gene expression is measured in patients between 8 month to 4 years old that have mild or severe malaria and in which the hypothesis is that gene expression in that age group is a function of disease status, the gene expression is the dependent variable. dependent variable specification is part of a study design. The dependent variable is the event studied and expected to change when the independent variable varies. 2/2/2009 In the design of experiments, independent variables are those whose values are controlled or selected by the person experimenting (experimenter) to determine its relationship to an observed phenomenon (the dependent variable). In such an experiment, an attempt is made to find evidence that the values of the independent variable determine the values of the dependent variable (that which is being measured). The independent variable can be changed as required, and its values do not represent a problem requiring explanation in an analysis, but are taken simply as given. The dependent variable on the other hand, usually cannot be directly controlled. In the Philly 2013 workshop the label was chosen to distinguish it from "dependent variable" as used in statistical modelling. See: http://en.wikipedia.org/wiki/Statistical_modeling PERSON: Alan Ruttenberg PERSON: Bjoern Peters PERSON: Chris Stoeckert dependent variable WEB: http://en.wikipedia.org/wiki/Dependent_and_independent_variables 2009-03-16: work has been done on this term during during the OBI workshop winter 2009 and the current definition was considered acceptable for use in OBI. If there is a need to modify thisdefinition please notify OBI. response variable study design dependent variable survival rate A measurement data that represents the percentage of people or animals in a study or treatment group who are alive for a given period of time after diagnosis or initiation of monitoring. Oliver He adapted from wikipedia http://en.wikipedia.org/wiki/Survival_rate survival rate multiple testing correction objective Application of the Bonferroni correction A multiple testing correction objectives is a data transformation objective where the aim is to correct for a set of statistical inferences considered simultaneously multiple comparison correction objective http://en.wikipedia.org/wiki/Multiple_Testing_Correction multiple testing correction objective material maintenance objective An objective specification maintains some or all of the qualities of a material over time. PERSON: Bjoern Peters PERSON: Bjoern Peters material maintenance objective primary structure of DNA macromolecule a quality of a DNA molecule that inheres in its bearer due to the order of its DNA nucleotide residues. placeholder for SO BP et al primary structure of DNA macromolecule measurement device A ruler, a microarray scanner, a Geiger counter. A device in which a measure function inheres. GROUP:OBI Philly workshop OBI measurement device material maintenance a process with that achieves the objective to maintain some or all of the characteristics of an input material over time material maintenance polyA RNA extraction A RNA extraction process typically involving the use of poly dT oligomers in which the desired output material is polyA RNA. Person: Chris Stoeckert Person: Jie Zheng UPenn Group polyA RNA extraction 1 2 Likelihood-ratio test Likelihood-ratio is a data transformation which tests whether there is evidence of the need to move from a simple model to a more complicated one (where the simple model is nested within the complicated one); tests of the goodness-of-fit between two models. date: March 2013 AGB and PRS provide formal definition expressed the test in terms of output and input, specifying the nature of the variables, the purpose of the test and the distribution used. Alejandra Gonzales-Beltran Philippe Rocca-Serra Tina Boussard lrtest() http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/lmtest/html/lrtest.html Likelihood-ratio test survival curve A survival curve is a report graph which is a graphical representation of data where the percentage of survival is plotted as a function of time. Alejandra Gonzalez-Beltran PERSON:Chris Stoeckert PERSON:James Malone PERSON:Melanie Courtot Philippe Rocca-Serra WEB: http://www.graphpad.com/www/book/survive.htm survival curve flow cytometry assay Using a flow cytometer to quantitate the percent of CD3 positive cells in a population by labeling them with a FITC tagged anti-CD3 antibody. A cytometry assay in which an input cell population is put in solution, is passed by a laser, and optical sensors are used to detect scattering of the laser light and/or fluorescence of specific markers to count and characterize the particles in solution. IEDB IEDB flow cytometry assay labeled specimen A specimen that has been modified in order to be able to detect it in future experiments added during call 3/1/2010 OBI group labeled specimen study intervention the part of the execution of an intervention design study which is varied between two or more subjects in the study PERSON: Bjoern Peters GROUP: OBI study intervention material separation device flow cytometer A device with a separation function realized in a planed process material separation device categorical measurement datum A measurement datum that is reported on a categorical scale Bjoern Peters nominal mesurement datum Bjoern Peters categorical measurement datum processed specimen A tissue sample that has been sliced and stained for a histology study. A blood specimen that has been centrifuged to obtain the white blood cells. A specimen that has been intentionally physically modified. Bjoern Peters Bjoern Peters A tissue sample that has been sliced and stained for a histology study. processed specimen categorical label The labels 'positive' vs. 'negative', or 'left handed', 'right handed', 'ambidexterous', or 'strongly binding', 'weakly binding' , 'not binding', or '+++', '++', '+', '-' etc. form scales of categorical labels. A label that is part of a categorical datum and that indicates the value of the data item on the categorical scale. Bjoern Peters Bjoern Peters categorical label in live cell assay An assay in which a measurement is made by observing entities located in a live cell. in live cell assay container A device that can be used to restrict the location of material entities over time 03/21/2010: Added to allow classification of children (similar to what we want to do for 'measurement device'. Lookint at what classifies here, we may want to reconsider a contain function assigned to a part of an entity is necessarily also a function of the whole (e.g. is a centrifuge a container because it has test tubes as parts?) PERSON: Bjoern Peters container device A voltmeter is a measurement device which is intended to perform some measure function. An autoclave is a device that sterlizes instruments or contaminated waste by applying high temperature and pressure. A material entity that is designed to perform a function in a scientific investigation, but is not a reagent. 2012-12-17 JAO: In common lab usage, there is a distinction made between devices and reagents that is difficult to model. Therefore we have chosen to specifically exclude reagents from the definition of "device", and are enumerating the types of roles that a reagent can perform. 2013-6-5 MHB: The following clarifications are outcomes of the May 2013 Philly Workshop. Reagents are distinguished from devices that also participate in scientific techniques by the fact that reagents are chemical or biological in nature and necessarily participate in some chemical interaction or reaction during the realization of their experimental role. By contrast, devices do not participate in such chemical reactions/interactions. Note that there are cases where devices use reagent components during their operation, where the reagent-device distinction is less clear. For example: (1) An HPLC machine is considered a device, but has a column that holds a stationary phase resin as an operational component. This resin qualifies as a device if it participates purely in size exclusion, but bears a reagent role that is realized in the running of a column if it interacts electrostatically or chemically with the evaluant. The container the resin is in (“the column”) considered alone is a device. So the entire column as well as the entire HPLC machine are devices that have a reagent as an operating part. (2) A pH meter is a device, but its electrode component bears a reagent role in virtue of its interacting directly with the evaluant in execution of an assay. (3) A gel running box is a device that has a metallic lead as a component that participates in a chemical reaction with the running buffer when a charge is passed through it. This metallic lead is considered to have a reagent role as a component of this device realized in the running of a gel. In the examples above, a reagent is an operational component of a device, but the device itself does not realize a reagent role (as bearing a reagent role is not transitive across the part_of relation). In this way, the asserted disjointness between a reagent and device holds, as both roles are never realized in the same bearer during execution of an assay. PERSON: Helen Parkinson instrument OBI development call 2012-12-17. device sequence data example of usage: the representation of a nucleotide sequence in FASTA format used for a sequence similarity search. A measurement datum that representing the primary structure of a macromolecule(it's sequence) sometimes associated with an indicator of confidence of that measurement. Person:Chris Stoeckert GROUP: OBI sequence data dose An organism has been injected 1ml of vaccine A measurement datum that measures the quantity of something that may be administered to an organism or that an organism may be exposed to. Quantities of nutrients, drugs, vaccines and toxins are referred to as doses. dose nucleic acid extract An extract that is the output of an extraction process in which nucleic acid molecules are isolated from a specimen. PERSON: Jie Zheng UPenn Group nucleic acid extract light emission device A light source is an optical subsystem that provides light for use in a distant area using a delivery system (e.g., fiber optics) a device which has a function to emit light. Person:Helen Parkinson OBI light emission device environmental control device A growth chamber is an environmental control device. An environmental control device is a device which has the function to control some aspect of the environment such as temperature, or humidity. Helen Parkinson OBI environmental control device labeled nucleic acid extract a labeled specimen that is the output of a labeling process and has grain labeled nucleic acid for detection of the nucleic acid in future experiments. Person: Jie Zheng labeled extract MO_221 labeledExtract labeled extract labeled nucleic acid extract dose response curve A data item of paired values, one indicating the dose of a material, the other quantitating a measured effect at that dose. The dosing intervals are chosen so that effect values be interpolated by a plotting a curve. Bjoern Peters; Randi Vita Philippe Rocca-Serra, Alejandra Gonzalez-Beltran dose response curve genetic population background information genotype information 'C57BL/6J Hnf1a+/-' in this case, C57BL/6J is the genetic population background information a genetic characteristics information which is a part of genotype information that identifies the population of organisms proposed and discussed on San Diego OBI workshop, March 2011 Group: OBI group Group: OBI group genetic population background information FWER adjusted p-value http://ugrad.stat.ubc.ca/R/library/LPE/html/mt.rawp2adjp.html A quantitative confidence value resulting from a multiple testing error correction method which adjusts the p-value used as input to control for Type I error in the context of multiple pairwise tests Addition of restriction 'output of null hypothesis testing' and specified output by AGB and PRS while working on STATO PERS:Philippe Rocca-Serra adapted from wikipedia (http://en.wikipedia.org/wiki/Familywise_error_rate) Family-wise type I error rate FWER adjusted p-value RNA-seq assay An assay in which sequencing technology (e.g. Solexa/454) is used to generate RNA sequence, analyse the transcibed regions of the genome, and or to quantitate transcript abundance PERSON: James Malone transcription profiling by high throughput sequencing EFO_0002770 transcription profiling by high throughput sequencing JZ: should be inferred as 'DNA sequencing'. Will check in the future. an assay that uses high-throughput sequencing technologies to sequence cDNA in order to get information about a sample's RNA content. RNA-Seq provides researchers with efficient ways to measure transcriptome data experimentally, allowing them to get information such as how different alleles of a gene are expressed, detect post-transcriptional mutations or identify gene fusions. WEB: http://en.wikipedia.org/wiki/RNA-Seq RNA-seq assay genotype information Genotype information can be: Mus musculus wild type (in this case the genetic population background information is Mus musculus), C57BL/6J Hnf1a+/- (in this case, C57BL/6J is the genetic population background information and Hnf1a+/- is the allele information a genetic characteristics information that is about the genetic material of an organism and minimally includes information about the genetic background and can in addition contain information about specific alleles, genetic modifications, etc. discussed on San Diego OBI workshop, March 2011 Group: OBI group Group: OBI group genotype information transcription profiling identification objective A molecular feature identification objective that aims to characterize the abundance of transcripts Person: Chris Stoeckert, Jie Zheng Group: Penn Group transcription profiling identification objective allele information genotype information 'C57BL/6J Hnf1a+/-' in this case, Hnf1a+/- is the allele information a genetic alteration information that about one of two or more alternative forms of a gene or marker sequence and differing from other alleles at one or more mutational sites based on sequence. Polymorphisms are included in this definition. discussed on San Diego OBI workshop, March 2011 Person: Chris Stoeckert, Jie Zheng MO_58 Allele allele information genetic alteration information a genetic characteristics information that is about known changes or the lack thereof from the genetic background, including allele information, duplication, insertion, deletion, etc. proposed and discussed on San Diego OBI workshop, March 2011 Group: OBI group Group: OBI group genetic alteration information genetic characteristics information a data item that is about genetic material including polymorphisms, disease alleles, and haplotypes. Person: Chris Stoeckert, Jie Zheng MO_66 IndividualGeneticCharacteristics MO definition: The genotype of the individual organism from which the biomaterial was derived. Individual genetic characteristics include polymorphisms, disease alleles, and haplotypes. examples in ArrayExpress wild_type MutaMouse (CD2F1 mice with lambda-gt10LacZ integration) AlfpCre; SNF5 flox/knockout p53 knock out C57Bl/6 gp130lox/lox MLC2vCRE/+ fer-15; fem-1 df/df pat1-114/pat1-114 ade6-M210/ade6-M216 h+/h+ (cells are diploid) genetic characteristics information q-value PMID: 20483222. Comp Biochem Physiol Part D Genomics Proteomics. 2008 Sep;3(3):234-42. Analysis of Sus scrofa liver proteome and identification of proteins differentially expressed between genders, and conventional and genetically enhanced lines. "After controlling the false discovery rate (FDR</=0.1) using the Storey q value only four proteins (EPHX1, CAT, PAH, ST13) were shown to be differentially expressed between genders (Males/Females) and two proteins (SELENBP2, TAGLN) were differentially expressed between two lines (Transgenic/Conventional pigs)" A quantitative confidence value that measures the minimum false discovery rate that is incurred when calling that test significant. To compute q-values, it is necessary to know the p-value produced by a test and possibly set a false discovery rate level. Addition of restriction 'output of null hypothesis testing' by AGB and PRS while working on STATO PERS:Philippe Rocca-Serra FDR adjusted p-value Adapted from several sources, including http://.en/wikipedia.org/wiki/False_discovery_rate http://svitsrv25.epfl.ch/R-doc/library/qvalue.html q q-value genotyping design A study design that classifies an individual or group of individuals on the basis of alleles, haplotypes, SNPs. Person: Chris Stoeckert, Jie Zheng MO_560 genotyping_design genotyping design specimen from organism A specimen that derives from an anatomical part or substance arising from an organism. Examples of tissue specimen include tissue, organ, physiological system, blood, or body location (arm). PERSON: Chris Stoeckert, Jie Zheng tissue specimen MO_954 organism_part specimen from organism fluorescence detection assay Using a laser to stimulate a cell culture that was previously labeled with fluorescent antibodies to detect light emmission at a different wavelength in order to determine the presence of surface markers the antibodies are specific for. An assay in which a material's fluorescence is determined. IEDB IEDB fluorescence detection assay rate measurement datum The rate of disassociation of a peptide from a complex with an MHC molecule measured by the ratio of bound and unbound peptide per unit of time. A scalar measurement datum that represents the number of events occuring over a time interval PERSON: Bjoern Peters, Randi Vita IEDB rate measurement datum DNA sequence data The part of a FASTA file that contains the letters ACTGGGAA A sequence data item that is about the primary structure of DNA OBI call; Bjoern Peters OBI call; Melanie Courtout 8/29/11 call: This is added after a request from Melanie and Yu. They should review it further. This should be a child of 'sequence data', and as of the current definition will infer there. DNA sequence data selection criterion rats should be aged between 6 and 8 weeks and weight between 180-250grams A directive information entity which defines and states a principle of standard by which selection process may take place. Person: Philippe Rocca-Serra selection rule OBI discussion summarized under the following tracker item : http://sourceforge.net/p/obi/obi-terms/678/ selection criterion drawing a conclusion Concluding that the length of the hypotenuse is equal to the square root of the sum of squares of the other two sides in a right-triangle. Concluding that a gene is upregulated in a tissue sample based on the band intensity in a western blot. Concluding that a patient has a infection based on measurement of an elevated body temperature and reported headache. Concluding that there were problems in an investigation because data from PCR and microarray are conflicting. A planned process in which new information is inferred from existing information. drawing a conclusion assay array A device made to be used in an analyte assay for immobilization of substances that bind the analyte at regular spatial positions on a surface. PERSON: Chris Stoeckert, Jie Zheng, Alan Ruttenberg Penn Group assay array conclusion based on data The conclusion that a gene is upregulated in a tissue sample based on the band intensity in a western blot. The conclusion that a patient has a infection based on measurement of an elevated body temperature and reported headache. The conclusion that there were problems in an investigation because data from PCR and microarray are conflicting. The following are NOT conclusions based on data: data themselves; results from pure mathematics, e.g. "13 is prime". An information content entity that is inferred from data. In the Philly 2013 workshop, we recognized the limitations of "conclusion textual entity", and we introduced this as more general. The need for the 'textual entity' term going forward is up for future debate. Group:2013 Philly Workshop group Group:2013 Philly Workshop group conclusion based on data cell freezing medium A processed material that serves as a liquid vehicle for freezing cells for long term quiescent stroage, which contains chemicls needed to sustain cell viability across freeze-thaw cycles. PERSON: Matthew Brush cell freezing medium categorical value specification A value specification that is specifies one category out of a fixed number of nominal categories PERSON:Bjoern Peters categorical value specification 1 1 scalar value specification A value specification that consists of two parts: a numeral and a unit label PERSON:Bjoern Peters scalar value specification value specification The value of 'positive' in a classification scheme of "positive or negative"; the value of '20g' on the quantitative scale of mass. An information content entity that specifies a value within a classification scheme or on a quantitative scale. This term is currently a descendant of 'information content entity', which requires that it 'is about' something. A value specification of '20g' for a measurement data item of the mass of a particular mouse 'is about' the mass of that mouse. However there are cases where a value specification is not clearly about any particular. In the future we may change 'value specification' to remove the 'is about' requirement. PERSON:Bjoern Peters value specification molecular-labeled material a material entity that is the specified output of an addition of molecular label process that aims to label some molecular target to allow for its detection in a detection of molecular label assay PERSON:Matthew Brush OBI developer call, 3-12-12 molecular-labeled material cytometry assay An intracellular material detection by flow cytometry assay measuring peforin inside a culture of T cells. An assay that measures properties of cells. IEDB IEDB cytometry assay physical store a freezer. a humidity controlled box. A container with an environmental control function. For details see tracker item: http://sourceforge.net/p/obi/obi-terms/793/ Chris Stoeckert Duke Biobank, OBIB Biobank physical store measurand role A role borne by a material entity and realized in an assay which achieves the objective to measure the magnitude/concentration/amount of the measurand in the entity bearing evaluant role. Person: Alan Ruttenberg, Jie Zheng https://en.wiktionary.org/wiki/measurand https://github.com/obi-ontology/obi/issues/778 measurand role organism animal fungus plant virus A material entity that is an individual living system, such as animal, plant, bacteria or virus, that is capable of replicating or reproducing, growth and maintenance in the right environment. An organism may be unicellular or made up, like humans, of many billions of cells divided into specialized tissues and organs. 10/21/09: This is a placeholder term, that should ideally be imported from the NCBI taxonomy, but the high level hierarchy there does not suit our needs (includes plasmids and 'other organisms') 13-02-2009: OBI doesn't take position as to when an organism starts or ends being an organism - e.g. sperm, foetus. This issue is outside the scope of OBI. GROUP: OBI Biomaterial Branch WEB: http://en.wikipedia.org/wiki/Organism organism specimen Biobanking of blood taken and stored in a freezer for potential future investigations stores specimen. A material entity that has the specimen role. Note: definition is in specimen creation objective which is defined as an objective to obtain and store a material entity for potential use as an input during an investigation. PERSON: James Malone PERSON: Philippe Rocca-Serra GROUP: OBI Biomaterial Branch specimen cultured cell population A cultured cell population applied in an experiment: "293 cells expressing TrkA were serum-starved for 18 hours and then neurotrophins were added for 10 min before cell harvest." (Lee, Ramee, et al. "Regulation of cell survival by secreted proneurotrophins." Science 294.5548 (2001): 1945-1948). A cultured cell population maintained in vitro: "Rat cortical neurons from 15 day embryos are grown in dissociated cell culture and maintained in vitro for 8–12 weeks" (Dichter, Marc A. "Rat cortical neurons in cell culture: culture methods, cell morphology, electrophysiology, and synapse formation." Brain Research 149.2 (1978): 279-293). A processed material comprised of a collection of cultured cells that has been continuously maintained together in culture and shares a common propagation history. 2013-6-5 MHB: This OBI class was formerly called 'cell culture', but label changed and definition updated following CLO alignment efforts in spring 2013, during which the intent of this class was clarified to refer to portions of a culture or line rather than a complete cell culture or line. PERSON:Matthew Brush cell culture sample PERSON:Matthew Brush The extent of a 'cultured cell population' is restricted only in that all cell members must share a propagation history (ie be derived through a common lineage of passages from an initial culture). In being defined in this way, this class can be used to refer to the populations that researchers actually use in the practice of science - ie are the inputs to culturing, experimentation, and sharing. The cells in such populations will be a relatively uniform population as they have experienced similar selective pressures due to their continuous co-propagation. And this population will also have a single passage number, again owing to their common passaging history. Cultured cell populations represent only a collection of cells (ie do not include media, culture dishes, etc), and include populations of cultured unicellular organisms or cultured multicellular organism cells. They can exist under active culture, stored in a quiescent state for future use, or applied experimentally. cultured cell population screening library PMID: 15615535.J Med Chem. 2004 Dec 30;47(27):6864-74.A screening library for peptide activated G-protein coupled receptors. 1. The test set. [cdna_library, phage display library] a screening library is a collection of materials engineered to identify qualities of a subset of its members during a screening process? PRS: 22-02-2008: while working on definition of cDNA library and looking at current example of usage, a screening library should be a defined class -> any material library which has input_role in a screening protocol application change biomaterial to material in definition PERSON: Bjoern Peters GROUP: IEDB 7/13/09: Need to clarify if this meets reagent role definition screening library data transformation The application of a clustering protocol to microarray data or the application of a statistical testing method on a primary data set to determine a p-value. A planned process that produces output data from input data. Elisabetta Manduchi Helen Parkinson James Malone Melanie Courtot Philippe Rocca-Serra Richard Scheuermann Ryan Brinkman Tina Hernandez-Boussard data analysis data processing Branch editors data transformation differential expression analysis objective Analyses implemented by the SAM (http://www-stat.stanford.edu/~tibs/SAM), PaGE (www.cbil.upenn.edu/PaGE) or GSEA (www.broad.mit.edu/gsea/) algorithms and software A differential expression analysis objective is a data transformation objective whose input consists of expression levels of entities (such as transcripts or proteins), or of sets of such expression levels, under two or more conditions and whose output reflects which of these are likely to have different expression across such conditions. Elisabetta Manduchi PERSON: Elisabetta Manduchi differential expression analysis objective Benjamini and Hochberg false discovery rate correction method Statistical significance of the 8 most represented biological processes (GO level 4) among E7 6 month upregulated genes following analysis with DAVID software; Benjamini-Hochberg FDR (false discovery rate) A data transformation process in which the Benjamini and Hochberg method sequential p-value procedure is applied with the aim of correcting false discovery rate 2011-03-31: [PRS]. specified input and output of dt which were missing Helen Parkinson Philippe Rocca-Serra Helen Parkinson Benjamini and Hochberg false discovery rate correction method k-means clustering A k-means clustering is a data transformation which achieves a class discovery or partitioning objective, which takes as input a collection of objects (represented as points in multidimensional space) and which partitions them into a specified number k of clusters. The algorithm attempts to find the centers of natural clusters in the data. The most common form of the algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data. It then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and the algorithm repeated by alternate applications of these two steps until convergence, which is obtained when the points no longer switch clusters (or alternatively centroids are no longer changed). Elisabetta Manduchi James Malone Philippe Rocca-Serra WEB: http://en.wikipedia.org/wiki/K-means k-means clustering hierarchical clustering A hierarchical clustering is a data transformation which achieves a class discovery objective, which takes as input data item and builds a hierarchy of clusters. The traditional representation of this hierarchy is a tree (visualized by a dendrogram), with the individual input objects at one end (leaves) and a single cluster containing every object at the other (root). James Malone WEB: http://en.wikipedia.org/wiki/Data_clustering#Hierarchical_clustering hierarchical clustering average linkage hierarchical clustering An average linkage hierarchical clustering is an agglomerative hierarchical clustering which generates successive clusters based on a distance measure, where the distance between two clusters is calculated as the average distance between objects from the first cluster and objects from the second cluster. Elisabetta Manduchi PERSON: Elisabetta Manduchi average linkage hierarchical clustering complete linkage hierarchical clustering an agglomerative hierarchical clustering which generates successive clusters based on a distance measure, where the distance between two clusters is calculated as the maximum distance between objects from the first cluster and objects from the second cluster. Elisabetta Manduchi PERSON: Elisabetta Manduchi complete linkage hierarchical clustering single linkage hierarchical clustering A single linkage hierarchical clustering is an agglomerative hierarchical clustering which generates successive clusters based on a distance measure, where the distance between two clusters is calculated as the minimum distance between objects from the first cluster and objects from the second cluster. Elisabetta Manduchi PERSON: Elisabetta Manduchi single linkage hierarchical clustering Benjamini and Yekutieli false discovery rate correction method The expression set was compared univariately between the stroke patients and controls, gene list was generated using False Discovery Rate correction (Benjamini and Yekutieli) A data transformation in which the Benjamini and Yekutieli method is applied with the aim of correcting false discovery rate 2011-03-31: [PRS]. specified input and output of dt which were missing Helen Parkinson Philippe Rocca-Serra Helen Parkinson Benjamini and Yekutieli false discovery rate correction method dimensionality reduction A dimensionality reduction is data partitioning which transforms each input m-dimensional vector (x_1, x_2, ..., x_m) into an output n-dimensional vector (y_1, y_2, ..., y_n), where n is smaller than m. Elisabetta Manduchi James Malone Melanie Courtot Philippe Rocca-Serra data projection PERSON: Elisabetta Manduchi PERSON: James Malone PERSON: Melanie Courtot dimensionality reduction principal components analysis dimensionality reduction A principal components analysis dimensionality reduction is a dimensionality reduction achieved by applying principal components analysis and by keeping low-order principal components and excluding higher-order ones. Elisabetta Manduchi James Malone Melanie Courtot Philippe Rocca-Serra pca data reduction PERSON: Elisabetta Manduchi PERSON: James Malone PERSON: Melanie Courtot principal components analysis dimensionality reduction Holm-Bonferroni family-wise error rate correction method t-tests were used with the type I error adjusted for multiple comparisons, Holm's correction (HOLM 1979), and false discovery rate, http://www.genetics.org/cgi/content/full/172/2/1179 a data transformation that performs more than one hypothesis test simultaneously, a closed-test procedure, that controls the familywise error rate for all the k hypotheses at level α in the strong sense. Objective: multiple testing correction 2011-03-14: [PRS]. Class Label has been changed to address the conflict with the definition Also added restriction to specify the output to be a FWER adjusted p-value The 'editor preferred term' should be removed Person:Helen Parkinson Philippe Rocca-Serra WEB: http://en.wikipedia.org/wiki/Holm%E2%80%93Bonferroni_method Bonferroni adjustment method Holm-Bonferroni family-wise error rate correction method family wise error rate correction method A family wise error rate correction method is a multiple testing procedure that controls the probability of at least one false positive. 2011-03-31: [PRS]. creating a defined class by specifying the necessary output of dt allows correct classification of FWER dt Monnie McGee Philippe Rocca-Serra FWER correction Dudoit, Sandrine and van der Laan, Mark J. (2008) Multiple Testing Procedures with Applications to Genomics. New York: Springer , p. 19 family wise error rate correction method descriptive statistical calculation objective A descriptive statistical calculation objective is a data transformation objective which concerns any calculation intended to describe a feature of a data set, for example, its center or its variability. Elisabetta Manduchi James Malone Melanie Courtot Monnie McGee PERSON: Elisabetta Manduchi PERSON: James Malone PERSON: Melanie Courtot PERSON: Monnie McGee descriptive statistical calculation objective survival analysis objective Kaplan meier data transformation A data transformation objective which has the data transformation aims to model time to event data (where events are e.g. death and or disease recurrence); the purpose of survival analysis is to model the underlying distribution of event times and to assess the dependence of the event time on other explanatory variables PERSON: James Malone PERSON: Tina Boussard survival analysis http://en.wikipedia.org/wiki/Survival_analysis survival analysis objective multiple testing correction method A multiple testing correction method is a hypothesis test performed simultaneously on M > 1 hypotheses. Multiple testing procedures produce a set of rejected hypotheses that is an estimate for the set of false null hypotheses while controlling for a suitably define Type I error rate Monnie McGee multiple testing procedure PAPER: Dudoit, Sandrine and van der Laan, Mark J. (2008) Multiple Testing Procedures with Applications to Genomics. New York: Springer , p. 9-10. multiple testing correction method logarithmic transformation A logarithmic transformation is a data transformation consisting in the application of the logarithm function with a given base a (where a>0 and a is not equal to 1) to a (one dimensional) positive real number input. The logarithm function with base a can be defined as the inverse of the exponential function with the same base. See e.g. http://en.wikipedia.org/wiki/Logarithm. Elisabetta Manduchi WEB: http://en.wikipedia.org/wiki/Logarithm logarithmic transformation regression analysis method Regression analysis is a descriptive statistics technique that examines the relation of a dependent variable (response variable) to specified independent variables (explanatory variables). Regression analysis can be used as a descriptive method of data analysis (such as curve fitting) without relying on any assumptions about underlying processes generating the data. Date:2013-11-15 Person: AGB,PRS Adding restrictions, specifying model + parameter estimation process change of label from 'regression analysis method' to 'regression analysis' Alejandra Gonzalez-Beltran Philippe Rocca-Serra Tina Hernandez-Boussard BOOK: Richard A. Berk, Regression Analysis: A Constructive Critique, Sage Publications (2004) 978-0761929048 regression analysis regression analysis method principal component regression The Principal Component Regression method is a regression analysis method that combines the Principal Component Analysis (PCA)spectral decomposition with an Inverse Least Squares (ILS) regression method to create a quantitative model for complex samples. Unlike quantitation methods based directly on Beer's Law which attempt to calculate the absorbtivity coefficients for the constituents of interest from a direct regression of the constituent concentrations onto the spectroscopic responses, the PCR method regresses the concentrations on the PCA scores. Tina Hernandez-Boussard WEB: : http://www.thermo.com/com/cda/resources/resources_detail/1,2166,13414,00.html principal component regression data visualization Generation of a heatmap from a microarray dataset An planned process that creates images, diagrams or animations from the input data. Elisabetta Manduchi James Malone Melanie Courtot Tina Boussard data encoding as image visualization PERSON: Elisabetta Manduchi PERSON: James Malone PERSON: Melanie Courtot PERSON: Tina Boussard Possible future hierarchy might include this: information_encoding >data_encoding >>image_encoding data visualization mode calculation A mode calculation is a descriptive statistics calculation in which the mode is calculated which is the most common value in a data set. It is most often used as a measure of center for discrete data. James Malone Monnie McGee PERSON: James Malone PERSON: Monnie McGee From Monnie's file comments - need to add center_calculation role but it doesn't exist yet - (editor note added by James Jan 2008) mode calculation median calculation A median calculation is a descriptive statistics calculation in which the midpoint of the data set (the 0.5 quantile) is calculated. First, the observations are sorted in increasing order. For an odd number of observations, the median is the middle value of the sorted data. For an even number of observations, the median is the average of the two middle values. James Malone Monnie McGee PERSON: James Malone PERSON: Monnie McGee From Monnie's file comments - need to add center_calculation role but it doesn't exist yet - (editor note added by James Jan 2008) median calculation agglomerative hierarchical clustering An agglomerative hierarchical clustering is a hierarchical clustering which starts with separate clusters and then successively combines these clusters until there is only one cluster remaining. Elisabetta Manduchi James Malone bottom-up hierarchical clustering PERSON: Elisabetta Manduchi agglomerative hierarchical clustering divisive hierarchical clustering A divisive hierarchical clustering is a hierarchical clustering which starts with a single cluster and then successively splits resulting clusters until only clusters of individual objects remain. Elisabetta Manduchi James Malone top-down hierarchical clustering PERSON: Elisabetta Manduchi divisive hierarchical clustering false discovery rate correction method The false discovery rate is a data transformation used in multiple hypothesis testing to correct for multiple comparisons. It controls the expected proportion of incorrectly rejected null hypotheses (type I errors) in a list of rejected hypotheses. It is a less conservative comparison procedure with greater power than familywise error rate (FWER) control, at a cost of increasing the likelihood of obtaining type I errors. . 2011-03-31: [PRS]. creating a defined class by specifying the necessary output of dt allows correct classification of FDR dt Monnie McGee Philippe Rocca-Serra FDR correction method Dudoit, Sandrine and van der Laan, Mark J. (2008) Multiple Testing Procedures with Applications to Genomics. New York: Springer , p. 21 and http://www.wikidoc.org/index.php/False_discovery_rate false discovery rate correction method data transformation objective normalize objective An objective specification to transformation input data into output data Modified definition in 2013 Philly OBI workshop James Malone PERSON: James Malone data transformation objective data normalization objective Quantile transformation which has normalization objective can be used for expression microarray assay normalization and it is referred to as "quantile normalization", according to the procedure described e.g. in PMID 12538238. A normalization objective is a data transformation objective where the aim is to remove systematic sources of variation to put the data on equal footing in order to create a common base for comparisons. Elisabetta Manduchi Helen Parkinson James Malone PERSON: Elisabetta Manduchi PERSON: Helen Parkinson PERSON: James Malone data normalization objective correction objective Type I error correction A correction objective is a data transformation objective where the aim is to correct for error, noise or other impairments to the input of the data transformation or derived from the data transformation itself James Malone PERSON: James Malone PERSON: Melanie Courtot correction objective normalization data transformation A normalization data transformation is a data transformation that has objective normalization. James Malone PERSON: James Malone normalization data transformation averaging data transformation An averaging data transformation is a data transformation that has objective averaging. James Malone PERSON: James Malone averaging data transformation partitioning data transformation A partitioning data transformation is a data transformation that has objective partitioning. James Malone PERSON: James Malone partitioning data transformation partitioning objective A k-means clustering which has partitioning objective is a data transformation in which the input data is partitioned into k output sets. A partitioning objective is a data transformation objective where the aim is to generate a collection of disjoint non-empty subsets whose union equals a non-empty input set. Elisabetta Manduchi James Malone PERSON: Elisabetta Manduchi partitioning objective class discovery data transformation A class discovery data transformation (sometimes called unsupervised classification) is a data transformation that has objective class discovery. James Malone clustering data transformation unsupervised classification data transformation PERSON: James Malone class discovery data transformation center calculation objective A mean calculation which has center calculation objective is a data transformation in which the center of the input data is discovered through the calculation of a mean average. A center calculation objective is a data transformation objective where the aim is to calculate the center of an input data set. James Malone PERSON: James Malone center calculation objective class discovery objective A class discovery objective (sometimes called unsupervised classification) is a data transformation objective where the aim is to organize input data (typically vectors of attributes) into classes, where the number of classes and their specifications are not known a priori. Depending on usage, the class assignment can be definite or probabilistic. James Malone clustering objective discriminant analysis objective unsupervised classification objective PERSON: Elisabetta Manduchi PERSON: James Malone class discovery objective center calculation data transformation A center calculation data transformation is a data transformation that has objective of center calculation. James Malone PERSON: James Malone center calculation data transformation descriptive statistical calculation data transformation A descriptive statistical calculation data transformation is a data transformation that has objective descriptive statistical calculation and which concerns any calculation intended to describe a feature of a data set, for example, its center or its variability. James Malone PERSON: James Malone descriptive statistical calculation data transformation error correction objective Application of a multiple testing correction method An error correction objective is a data transformation objective where the aim is to remove (correct for) erroneous contributions arising from the input data, or the transformation itself. James Malone, Helen Parkinson PERSON: James Malone error correction objective gene list visualization Adata visualization which has input of a gene list and produces an output of a report graph which is capable of rendering data of this type. James Malone gene list visualization survival analysis data transformation A data transformation which has the objective of performing survival analysis. James Malone PERSON: James Malone survival analysis data transformation chi square test The chi-square test is a data transformation with the objective of statistical hypothesis testing, in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-square distribution as closely as desired by making the sample size large enough. negociation with OBI hence definition and definition source are missing from this class PERSON: James Malone PERSON: Tina Boussard chi square test 1 1 true true 2 1 ANOVA ANOVA or analysis of variance is a data transformation in which a statistical test of whether the means of several groups are all equal. AGB and PRS augmented the class with formal definitions as part of STATO extension Alejandra Gonzalez-Beltran James Malone Philippe Rocca-Serra Analysis of Variance stat.anova() ANOVA observation design PMID: 12387964.Lancet. 2002 Oct 12;360(9340):1144-9.Deficiency of antibacterial peptides in patients with morbus Kostmann: an observation study. observation design is a study design in which subjects are monitored in the absence of any active intervention by experimentalists. Philippe Rocca-Serra OBI branch derived observation design extraction nucleic acid extraction using phenol chloroform A material separation in which a desired component of an input material is separated from the remainder Current the output of material processing defined as the molecular entity, main component in the output material entity, rather than the material entity that have grain molecular entity. 'nucleic acid extract' is the output of 'nucleic acid extraction' and has grain 'nucleic acid'. However, the output of 'nucleic acid extraction' is 'nucleic acid' rather than 'nucleic acid extract'. We are aware of this issue and will work it out in the future. Person:Bjoern Peters Philippe Rocca-Serra extraction group randomization PMID: 18349405. Randomization reveals unexpected acute leukemias in Southwest Oncology Group prostate cancer trial. J Clin Oncol. 2008 Mar 20;26(9):1532-6. A group assignment which relies on chance to assign materials to a group of materials in order to avoid bias in experimental set up. Philippe Rocca-Serra adapted from wikipedia [http://en.wikipedia.org/wiki/Randomization] group randomization nucleic acid hybridization PMID: 18555787.Quantitative analysis of DNA hybridization in a flowthrough microarray for molecular testing. Anal Biochem. 2008 May 27. a planned process by which totally or partially complementary, single-stranded nucleic acids are combined into a single molecule called heteroduplex or homoduplex to an extent depending on the amount of complementarity. Philippe Rocca-Serra adapted from wikipedia [http://en.wikipedia.org/wiki/Nucleic_acid_hybridization] hybridization assay nucleic acid hybridization flow cell Biofilm Flow Cell Aparatus in the fluidic subsystem where the sheath and sample meet. Can be one of several types; jet-in-air, quartz cuvette, or a hybrid of the two. The sample flows through the center of a fluid column of sheath fluid in the flow cell. Person:John Quinn flow_cell http://www.flocyte.com/FRTP/Resources/flow_cytometry_glossary.htm flow cell flow cytometer FACS Calibur A flow_cytometer is an instrument for counting, examining and sorting microscopic particles in suspension. It allows simultaneous multiparametric analysis of the physical and/or chemical characteristics of single cells flowing through an optical and/or electronic detection apparatus. A flow cytometer is an instrument that can be used to quantitatively measure the properties of individual cells in a flowing medium. John Quinn http://en.wikipedia.org/wiki/Flow_cytometer flow cytometer light source A light source is an optical subsystem that provides light for use in a distant area using a delivery system (e.g., fiber optics). Light sources may include one of a variety of lamps (e.g., xenon, halogen, mercury). Most light sources are operated from line power, but some may be powered from batteries. They are mostly used in endoscopic, microscopic, and other examination and/or in surgical procedures. The light source is part of the optical subsystem. In a flow cytometer the light source directs high intensity light at particles at the interrogation point. The light source in a flow cytometer is usually a laser. Elizabeth M. Goralczyk John Quinn Olga Tchuvatkina Practical Flow Cytometry 4th Edition, Howard Shapiro, ISBN-10: 0471411256, ISBN-13: 978-0471411253 light source obscuration bar obscuration bar in a flow cytometer An obscuration bar is a an optical subsystem which is a strip of metal or other material that serves to block out direct light from the illuminating beam. The obscuration bar prevents the bright light scattered in the forward directions from burning out the collection device. Daniel Schober Flow Cytometry: First Principles, by Alice Longobardi Givan, ISBN-10: 0471382248, ISBN-13: 978-0471382249 John Quinn obscuration bar optical filter 720 LP filter, 580/30 BP filter An optical filter is an optical subsystem that selectively transmits light having certain properties (often, a particular range of wavelengths, that is, range of colours of light), while blocking the remainder. They are commonly used in photography, in many optical instruments, and to colour stage lighting Optical filters can be arranged to segregate and collect light by wave length. John Quinn http://en.wikipedia.org/wiki/Optical_filter optical filter photodetector A photomultiplier tube, a photo diode A photodetector is a device used to detect and measure the intensity of radiant energy through photoelectric action. In a cytometer, photodetectors measure either the number of photons of laser light scattered on impact with a cell (for example), or the flourescence emitted by excitation of a fluorescent dye. John Quinn http://einstein.stanford.edu/content/glossary/glossary.html photodetector DNA sequencer ABI 377 DNA Sequencer, ABI 310 DNA Sequencer A DNA sequencer is an instrument that determines the order of deoxynucleotides in deoxyribonucleic acid sequences. Trish Whetzel MO DNA sequencer hybridization chamber Glass Array Hybridization Cassette A device which is used to maintain constant contact of a liquid on an array. This can be either a glass vial or slide. Trish Whetzel MO_563 hybridization_chamber hybridization chamber cytometer A cytometer is an instrument for counting and measuring cells. Melanie Courtot http://medical.merriam-webster.com/medical/cytometer cytometer microarray An affymetrix U133 array is a microarray. Microarrays include 1 and 2-color arrays, custom and commercial arrays (e.g, Affymetrix, Agilent, Nimblegen, Illumina, etc.) for expression profiling, DNA variant detection, protein binding, and other genomic and functional genomic assays. A processed material that is made to be used in an analyte assay. It consists of a physical immobilisation matrix in which substances that bind the analyte are placed in regular spatial position. Daniel Schober PERSON: Chris Stoeckert microarray DNA microarray Moran G, Stokes C, Thewes S, Hube B, Coleman DC, Sullivan D (2004). "Comparative genomics using Candida albicans DNA microarrays reveals absence and divergence of virulence-associated genes in Candida dubliniensis". Microbiology 150: 3363-3382. doi:10.1099/mic.0.27221-0. PMID 15470115 A DNA-microarray is a microarray that is used as a physical 2D immobilisation matrix for DNA sequences. DNA microarray-bound DNA fragments are used as targets for a hybridising probed sample. PERSON: Daniel Schober PERSON: Frank Gibson DNA Chip DNA-array Web:<http://en.wikipedia.org/wiki/DNA_microarray>@2008/03/03 DNA microarray droplet sorter A droplet sorter is part_of a flow cytometer sorter that converts the carrier fluid stream into individual droplets, and these droplets are directed into separate locations for recovery (enriching the original sample for particles of interest based on qualities determined by gating) or disposal. OBI Instrument branch OBI Instrument branch droplet sorter study design a matched pairs study design describes criteria by which subjects are identified as pairs which then undergo the same protocols, and the data generated is analyzed by comparing the differences between the paired subjects, which constitute the results of the executed study design. A plan specification comprised of protocols (which may specify how and what kinds of data will be gathered) that are executed as part of an investigation and is realized during a study design execution. Editor note: there is at least an implicit restriction on the kind of data transformations that can be done based on the measured data available. PERSON: Chris Stoeckert experimental design rediscussed at length (MC/JF/BP). 12/9/08). The definition was clarified to differentiate it from protocol. study design This statement can actually be inferred from 'plan specification', because 'independent variable specification' is a subclass of 'is part of' some 'plan specification' repeated measure design PMID: 10959922.J Biopharm Stat. 2000 Aug;10(3):433-45.Equivalence in test assay method comparisons for the repeated-measure, matched-pair design in medical device studies: statistical considerations. a study design which use the same individuals and exposure them to a set of conditions. The effect of order and practice can be confounding factor in such designs PlanAndPlannedProcess Branch http://www.holah.karoo.net/experimentaldesigns.htm repeated measure design cross over design PMID: 17601993-Objective: HIV-infected patients with lipodystrophy (HIV-lipodystrophy) are insulin resistant and have elevated plasma free fatty acid (FFA) concentrations. We aimed to explore the mechanisms underlying FFA-induced insulin resistance in patients with HIV-lipodystrophy. Research Design and Methods: Using a randomized placebo-controlled cross-over design, we studied the effects of an overnight acipimox-induced suppression of FFA on glucose and FFA metabolism by using stable isotope labelled tracer techniques during basal conditions and a two-stage euglycemic, hyperinsulinemic clamp (20 mU insulin/m(2)/min; 50 mU insulin/m(2)/min) in nine patients with nondiabetic HIV-lipodystrophy. All patients received antiretroviral therapy. Biopsies from the vastus lateralis muscle were obtained during each stage of the clamp. Results: Acipimox treatment reduced basal FFA rate of appearance by 68.9% (52.6%-79.5%) and decreased plasma FFA concentration by 51.6 % (42.0%-58.9%), (both, P < 0.0001). Endogenous glucose production was not influenced by acipimox. During the clamp the increase in glucose-uptake was significantly greater after acipimox treatment compared to placebo (acipimox: 26.85 (18.09-39.86) vs placebo: 20.30 (13.67-30.13) mumol/kg/min; P < 0.01). Insulin increased phosphorylation of Akt (Thr(308)) and GSK-3beta (Ser(9)), decreased phosphorylation of glycogen synthase (GS) site 3a+b and increased GS-activity (I-form) in skeletal muscle (P < 0.01). Acipimox decreased phosphorylation of GS (site 3a+b) (P < 0.02) and increased GS-activity (P < 0.01) in muscle. Conclusion: The present study provides direct evidence that suppression of lipolysis in patients with HIV-lipodystrophy improves insulin-stimulated peripheral glucose-uptake. The increased glucose-uptake may in part be explained by increased dephosphorylation of GS (site 3a+b) resulting in increased GS activity. a repeated measure design which ensures that experimental units receive, in sequence, the treatment (or the control), and then, after a specified time interval (aka *wash-out periods*), switch to the control (or treatment). In this design, subjects (patients in human context) serve as their own controls, and randomization may be used to determine the ordering which a subject receives the treatment and control Philippe Rocca-Serra (source: http://www.sbu.se/Filer/Content0/publikationer/1/literaturesearching_1993/glossary.html) cross over design matched pairs design PMID: 17288613-BSTRACT: BACKGROUND: Physicians in Canadian emergency departments (EDs) annually treat 185,000 alert and stable trauma victims who are at risk for cervical spine (C-spine) injury. However, only 0.9% of these patients have suffered a cervical spine fracture. Current use of radiography is not efficient. The Canadian C-Spine Rule is designed to allow physicians to be more selective and accurate in ordering C-spine radiography, and to rapidly clear the C-spine without the need for radiography in many patients. The goal of this phase III study is to evaluate the effectiveness of an active strategy to implement the Canadian C-Spine Rule into physician practice. Specific objectives are to: 1) determine clinical impact, 2) determine sustainability, 3) evaluate performance, and 4) conduct an economic evaluation. METHODS: We propose a matched-pair cluster design study that compares outcomes during three consecutive 12-months before, after, and decay periods at six pairs of intervention and control sites. These 12 hospital ED sites will be stratified as teaching or community hospitals, matched according to baseline C-spine radiography ordering rates, and then allocated within each pair to either intervention or control groups. During the after period at the intervention sites, simple and inexpensive strategies will be employed to actively implement the Canadian C-Spine Rule. The following outcomes will be assessed: 1) measures of clinical impact, 2) performance of the Canadian C-Spine Rule, and 3) economic measures. During the 12-month decay period, implementation strategies will continue, allowing us to evaluate the sustainability of the effect. We estimate a sample size of 4,800 patients in each period in order to have adequate power to evaluate the main outcomes. DISCUSSION: Phase I successfully derived the Canadian C-Spine Rule and phase II confirmed the accuracy and safety of the rule, hence, the potential for physicians to improve care. What remains unknown is the actual change in clinical behaviors that can be affected by implementation of the Canadian C-Spine Rule, and whether implementation can be achieved with simple and inexpensive measures. We believe that the Canadian C-Spine Rule has the potential to significantly reduce health care costs and improve the efficiency of patient flow in busy Canadian EDs. A matched pair design is a study design which use groups of individuals associated (hence matched) to each other based on a set of criteria, one member going to one treatment, the other member receiving the other treatment. Philippe Rocca-Serra http://www.holah.karoo.net/experimentaldesigns.htm matched pairs design parallel group design PMID: 17408389-Purpose: Proliferative vitreoretinopathy (PVR) is the most important reason for blindness following retinal detachment. Presently, vitreous tamponades such as gas or silicone oil cannot contact the lower part of the retina. A heavier-than-water tamponade displaces the inflammatory and PVR-stimulating environment from the inferior area of the retina. The Heavy Silicone Oil versus Standard Silicone Oil Study (HSO Study) is designed to answer the question of whether a heavier-than-water tamponade improves the prognosis of eyes with PVR of the lower retina. Methods: The HSO Study is a multicentre, randomized, prospective controlled clinical trial comparing two endotamponades within a two-arm parallel group design. Patients with inferiorly and posteriorly located PVR are randomized to either heavy silicone oil or standard silicone oil as a tamponading agent. Three hundred and fifty consecutive patients are recruited per group. After intraoperative re-attachment, patients are randomized to either standard silicone oil (1000 cSt or 5000 cSt) or Densiron((R)) as a tamponading agent. The main endpoint criteria are complete retinal attachment at 12 months and change of visual acuity (VA) 12 months postoperatively compared with the preoperative VA. Secondary endpoints include complete retinal attachment before endotamponade removal, quality of life analysis and the number of retina affecting re-operation within 1 year of follow-up. Results: The design and early recruitment phase of the study are described. Conclusions: The results of this study will uncover whether or not heavy silicone oil improves the prognosis of eyes with PVR. A parallel group design or independent measure design is a study design which uses unique experimental unit each experimental group, in other word no two individuals are shared between experimental groups, hence also known as parallel group design. Subjects of a treatment group receive a unique combination of independent variable values making up a treatment Philippe Rocca-Serra independent measure design http://www.holah.karoo.net/experimentaldesigns.htm parallel group design randomized complete block design http://www.stats.gla.ac.uk/steps/glossary/anova.html,(A researcher is carrying out a study of the effectiveness of four different skin creams for the treatment of a certain skin disease. He has eighty subjects and plans to divide them into 4 treatment groups of twenty subjects each. Using a randomised blocks& design, the subjects are assessed and put in blocks of four according to how severe their skin condition is; the four most severe cases are the first block, the next four most severe cases are the second block, and so on to the twentieth block. The four &members of each block are then randomly assigned, one to each of the four treatment groups. http://www.stats.gla.ac.uk/steps/glossary/anova.html#rbd)) A randomized complete block design is_a study design which assigns randomly treatments to block. The number of units per block equals the number of treatment so each block receives each treatment exactly once (hence the qualifier 'complete'). The design was originally devised from field trials used in agronomy and agriculture. The analysis assumes that there is no interaction between block and treatment. The method was then used in other settings So The randomised complete block design is a design in which the subjects are matched according to a variable which the experimenter wishes to control. The subjects are put into groups (blocks) of the same size as the number of treatments. The members of each block are then randomly assigned to different treatment groups. Philippe Rocca-Serra http://www.tufts.edu/~gdallal/ranblock.htm randomized complete block design 2 latin square design PMID: 17582121-Our objective was to examine the effects of dietary cation-anion difference (DCAD) with different concentrations of dietary crude protein (CP) on performance and acid-base status in early lactation cows. Six lactating Holstein cows averaging 44 d in milk were used in a 6 x 6 Latin square design with a 2 x 3 factorial arrangement of treatments: DCAD of -3, 22, or 47 milliequivalents (Na + K - Cl - S)/100 g of dry matter (DM), and 16 or 19% CP on a DM basis. Linear increases with DCAD occurred in DM intake, milk fat percentage, 4% fat-corrected milk production, milk true protein, milk lactose, and milk solids-not-fat. Milk production itself was unaffected by DCAD. Jugular venous blood pH, base excess and HCO3(-) concentration, and urine pH increased, but jugular venous blood Cl- concentration, urine titratable acidity, and net acid excretion decreased linearly with increasing DCAD. An elevated ratio of coccygeal venous plasma essential AA to nonessential AA with increasing DCAD indicated that N metabolism in the rumen was affected, probably resulting in more microbial protein flowing to the small intestine. Cows fed 16% CP had lower urea N in milk than cows fed 19% CP; the same was true for urea N in coccygeal venous plasma and urine. Dry matter intake, milk production, milk composition, and acid-base status did not differ between the 16 and 19% CP treatments. It was concluded that DCAD affected DM intake and performance of dairy cows in early lactation. Feeding 16% dietary CP to cows in early lactation, compared with 19% CP, maintained lactation performance while reducing urea N excretion in milk and urine. Latin square design is_a study design which allows in its simpler form controlling 2 levels of nuisance variables (also known as blocking variables).he 2 nuisance factors are divided into a tabular grid with the property that each row and each column receive each treatment exactly once. Philippe Rocca-Serra Adapted from: http://www.itl.nist.gov/div898/handbook/pri/section3/pri3321.htm and latin square design 3 graeco latin square design PMID: 6846242-Beaton et al (Am J Clin Nutr 1979;32:2546-59) reported on the partitioning of variance in 1-day dietary data for the intake of energy, protein, total carbohydrate, total fat, classes of fatty acids, cholesterol, and alcohol. Using the same food intake data and the expanded National Heart, Lung and Blood Institute food composition data base, these analyses of sources of variance have been expanded to include classes of carbohydrate, vitamin A, vitamin C, thiamin, riboflavin, niacin, calcium, iron, total ash, caffeine, and crude fiber. The analyses relate to observed intakes (replicated six times) of 30 adult males and 30 adult females obtained under a paired Graeco-Latin square design with sequence of interview, interviewer, and day of the week as determinants. Neither sequence nor interviewer made consistent contribution to variance. In females, day of the week had a significant effect for several nutrients. The major partitioning of variance was between interindividual variation (between subjects) and intraindividual variation (within subjects) which included both true day-to-day variation in intake and methodological variation. For all except caffeine, the intraindividual variability of 1-day data was larger than the interindividual variability. For vitamin A, almost all of the variance was associated with day-to-day variability. One day data provide a very inadequate estimate of usual intake of individuals. In the design of nutrition studies it is critical that the intended use of dietary data be a major consideration in deciding on methodology. There is no ideal dietary method. There may be preferred methods for particular purposes. Greco-Latin square design is a study design which relates to Latin square design Philippe Rocca-Serra Adapted from: http://www.itl.nist.gov/div898/handbook/pri/section3/pri3321.htm and only 2 articles in pubmed ->probably irrelevant Euler square design orthogonal latin squares design graeco latin square design 4 hyper graeco latin square design PRS to do Philippe Rocca-Serra Adapted from: http://www.itl.nist.gov/div898/handbook/pri/section3/pri3321.htm and no example found in pubmed->not in use in the community hyper graeco latin square design 1 2 factorial design PMID: 17582121-Our objective was to examine the effects of dietary cation-anion difference (DCAD) with different concentrations of dietary crude protein (CP) on performance and acid-base status in early lactation cows. Six lactating Holstein cows averaging 44 d in milk were used in a 6 x 6 Latin square design with a 2 x 3 factorial arrangement of treatments: DCAD of -3, 22, or 47 milliequivalents (Na + K - Cl - S)/100 g of dry matter (DM), and 16 or 19% CP on a DM basis. Linear increases with DCAD occurred in DM intake, milk fat percentage, 4% fat-corrected milk production, milk true protein, milk lactose, and milk solids-not-fat. Milk production itself was unaffected by DCAD. Jugular venous blood pH, base excess and HCO3(-) concentration, and urine pH increased, but jugular venous blood Cl- concentration, urine titratable acidity, and net acid excretion decreased linearly with increasing DCAD. An elevated ratio of coccygeal venous plasma essential AA to nonessential AA with increasing DCAD indicated that N metabolism in the rumen was affected, probably resulting in more microbial protein flowing to the small intestine. Cows fed 16% CP had lower urea N in milk than cows fed 19% CP; the same was true for urea N in coccygeal venous plasma and urine. Dry matter intake, milk production, milk composition, and acid-base status did not differ between the 16 and 19% CP treatments. It was concluded that DCAD affected DM intake and performance of dairy cows in early lactation. Feeding 16% dietary CP to cows in early lactation, compared with 19% CP, maintained lactation performance while reducing urea N excretion in milk and urine. factorial design is_a study design which is used to evaluate two or more factors simultaneously. The treatments are combinations of levels of the factors. The advantages of factorial designs over one-factor-at-a-time experiments is that they are more efficient and they allow interactions to be detected. In statistics, a factorial design experiment is an experiment whose design consists of two or more factors, each with discrete possible values or levels, and whose experimental units take on all possible combinations of these levels across all such factors. Such an experiment allows studying the effect of each factor on the response variable, as well as the effects of interactions between factors on the response variable. Philippe Rocca-Serra http://www.stats.gla.ac.uk/steps/glossary/anova.html#facdes And from wikipedia (01/03/2007): http://en.wikipedia.org/wiki/Factorial_experiment) factorial design 2 2x2 factorial design PMID: 17561240-The present experiment evaluates the effects of intermittent exposure to a social stimulus on ethanol and water drinking in rats. Four groups of rats were arranged in a 2x2 factorial design with 2 levels of Social procedure (Intermittent Social vs Continuous Social) and 2 levels of sipper Liquid (Ethanol vs Water). Intermittent Social groups received 35 trials per session. Each trial consisted of the insertion of the sipper tube for 10 s followed by lifting of the guillotine door for 15 s. The guillotine door separated the experimental rat from the conspecific rat in the wire mesh cage during the 60 s inter-trial interval. The Continuous Social groups received similar procedures except that the guillotine door was raised during the entire duration of the session. For the Ethanol groups, the concentrations of ethanol in the sipper [3, 4, 6, 8, 10, 12, 14, and 16% (vol/vol)] increased across sessions, while the Water groups received 0% ethanol (water) in the sipper throughout the experiment. Both Social procedures induced more intake of ethanol than water. The Intermittent Social procedure induced more ethanol intake at the two highest ethanol concentration blocks (10-12% and 14-16%) than the Continuous Social procedure, but this effect was not observed with water. Effects of social stimulation on ethanol drinking are discussed. a factorial design which has 2 experimental factors (aka independent variables) and 2 factor levels per experimental factors Philippe Rocca-Serra PMID: 17561240 2x2 factorial design fractional factorial design A fractional factorial design is_a study design in which only an adequately chosen fraction of the treatment combinations required for the complete factorial experiment is selected to be run Philippe Rocca-Serra http://www.itl.nist.gov/div898/handbook/pri/section3/pri334.htm From ASQC (1983) Glossary & Tables for Statistical Quality Control fractional factorial design dye swap design PMID: 17411393-Dye-specific bias effects, commonly observed in the two-color microarray platform, are normally corrected using the dye swap design. This design, however, is relatively expensive and labor-intensive. We propose a self-self hybridization design as an alternative to the dye swap design. In this design, the treated and control samples are labeled with Cy5 and Cy3 (or Cy3 and Cy5), respectively, without dye swap, along with a set of self-self hybridizations on the control sample. We compare this design with the dye swap design through investigation of mouse primary hepatocytes treated with three peroxisome proliferator-activated receptor-alpha (PPARalpha) agonists at three dose levels. Using Agilent's Whole Mouse Genome microarray, differentially expressed genes (DEG) were determined for both the self-self hybridization and dye swap designs. The DEG concordance between the two designs was over 80% across each dose treatment and chemical. Furthermore, 90% of DEG-associated biological pathways were in common between the designs, indicating that biological interpretations would be consistent. The reduced labor and expense for the self-self hybridization design make it an efficient substitute for the dye swap design. For example, in larger toxicogenomic studies, only about half the chips are required for the self-self hybridization design compared to that needed in the dye swap design. An experiment design type where the label orientations are reversed. exact synonym: flip dye, dye flip Philippe Rocca-Serra on behalf of MO MO_858 dye swap design time series design PMID: 14744830-Microarrays are powerful tools for surveying the expression levels of many thousands of genes simultaneously. They belong to the new genomics technologies which have important applications in the biological, agricultural and pharmaceutical sciences. There are myriad sources of uncertainty in microarray experiments, and rigorous experimental design is essential for fully realizing the potential of these valuable resources. Two questions frequently asked by biologists on the brink of conducting cDNA or two-colour, spotted microarray experiments are 'Which mRNA samples should be competitively hybridized together on the same slide?' and 'How many times should each slide be replicated?' Early experience has shown that whilst the field of classical experimental design has much to offer this emerging multi-disciplinary area, new approaches which accommodate features specific to the microarray context are needed. In this paper, we propose optimal designs for factorial and time course experiments, which are special designs arising quite frequently in microarray experimentation. Our criterion for optimality is statistical efficiency based on a new notion of admissible designs; our approach enables efficient designs to be selected subject to the information available on the effects of most interest to biologists, the number of arrays available for the experiment, and other resource or practical constraints, including limitations on the amount of mRNA probe. We show that our designs are superior to both the popular reference designs, which are highly inefficient, and to designs incorporating all possible direct pairwise comparisons. Moreover, our proposed designs represent a substantial practical improvement over classical experimental designs which work in terms of standard interactions and main effects. The latter do not provide a basis for meaningful inference on the effects of most interest to biologists, nor make the most efficient use of valuable and limited resources. Groups of assays that are related as part of a time series. PRS-AGB adding formal restriction on independent variable specification about time (march 2013) and making time series design class a defined class. Philippe Rocca-Serra on behalf of MO MO_887 time series design collecting specimen from organism taking a sputum sample from a cancer patient, taking the spleen from a killed mouse, collecting a urine sample from a patient a process with the objective to obtain a material entity that was part of an organism for potential future use in an investigation PERSON:Bjoern Peters IEDB collecting specimen from organism material component separation Using a cell sorter to separate a mixture of T cells into two fractions; one with surface receptor CD8 and the other lacking the receptor, or purification a material processing in which components of an input material become segregated in space Bjoern Peters IEDB material component separation group assignment Assigning' to be treated with active ingredient role' to an organism during group assignment. The group is those organisms that have the same role in the context of an investigation group assignment is a process which has an organism as specified input and during which a role is assigned Philippe Rocca-Serra cohort assignment study assignment OBI Plan group assignment maintaining cell culture When harvesting blood from a human, isolating T cells, and then limited dilution cloning of the cells, the maintaining_cell_culture step comprises all steps after the initial dilution and plating of the cells into culture, e.g. placing the culture into an incubator, changing or adding media, and splitting a cell culture a protocol application in which cells are kept alive in a defined environment outside of an organism. part of cell_culturing PlanAndPlannedProcess Branch OBI branch derived maintaining cell culture 'establishing cell culture' a process through which a new type of cell culture or cell line is created, either through the isolation and culture of one or more cells from a fresh source, or the deliberate experimental modification of an existing cell culture (e.g passaging a primary culture to become a secondary culture or line, or the immortalization or stable genetic modification of an existing culture or line). PERSON:Matthew Brush PERSON:Matthew Brush A 'cell culture' as used here referes to a new lineage of cells in culture deriving from a single biological source.. New cultures are established through the initial isolation and culturing of cells from an organismal source, or through changes in an existing cell culture or line that result in a new culture with unique characteristics. This can occur through the passaging/selection of a primary culture into a secondary culture or line, or experimental modifications of an existing cell culture or line such as an immortalization process or other stable genetic modification. This class covers establishment of cultures of either multicellular organism cells or unicellular organisms. establishing cell culture addition of molecular label The addition of phycoerytherin label to an anti-CD8 antibody, to label all antibodies. The addition of anti-CD8-PE to a population of cells, to label the subpopulation cells that are CD8+. a material processing technique intended to add a molecular label to some input material entity, to allow detection of the molecular target of this label in a detection of molecular label assay PERSON:Matthew Brush labeling OBI developer call, 3-12-12 addition of molecular label sequencing assay The use of the Sanger method of DNA sequencing to determine the order of the nucleotides in a DNA template the use of a chemical or biochemical means to infer the sequence of a biomaterial has_output should be sequence of input; we don't have sequence well defined yet PlanAndPlannedProcess Branch OBI branch derived sequencing assay recombinant vector cloning a planned process with the objective to insert genetic material into a cloning vector for future replication of the inserted material pa_branch (Alan, Randi, Kevin, Jay, Bjoern) molecular cloning OBI branch derived recombinant vector cloning RNA extraction A RNA extraction is a nucleic acid extraction where the desired output material is RNA PlanAndPlannedProcess Branch OBI branch derived requested by Helen Parkinson for MO RNA extraction nucleic acid extraction Phenol / chlorophorm extraction disolvation of protein content folllowed by ethanol precipitation of the nucleic acid fraction over night in the fridge followed by centrifugation to obtain a nucleic acid pellet. a material separation to recover the nucleic acid fraction of an input material PlanAndPlannedProcess Branch OBI branch derived requested by Helen Parkinson for MO. Could be defined class nucleic acid extraction phage display library PMID: 15905471.Nucleic Acids Res. 2005 May 19;33(9):e81.Oligonucleotide-assisted cleavage and ligation: a novel directional DNA cloning technology to capture cDNAs. Application in the construction of a human immune antibody phage-display library. [Phage display library encoding fragments of human antibodies. m-rna library encoding for 9-mer peptides] a phage display library is a collection of materials in which a mixture of genes or gene fragments is expressed and can be individually selected and amplified. PERSON: Bjoern Peters PERSON: Philippe Rocca-Serra display library WEB: http://www.immuneepitope.org/home.do PRS: 22022008. class moved under population, modification of definition and replacement of biomaterials in previous definition with 'material' addition of has_role restriction phage display library material to be added A mixture of peptides that is being added into a cell culture. a material that is added to another one in a material combination process 10/26/09: This defined class is used as a 'macro expression' to reduce the size of the IEDB export 2010/02/24 Alan Ruttenberg: I think this might generate confusion as the common use of the term would consider something to be a specimen during the realization of the role, not only if it bears it. However having this class as a probe, or for display, or as a macro might be useful. Ideally we would mark or segregate such classes IEDB material to be added target of material addition A cell culture into which a mixture of peptides is being added. A material entity into which another is being added in a material combinatino process 10/26/09: This defined class is used as a 'macro' to reduce the size of the IEDB export. IEDB target of material addition phenotype A (combination of) quality(ies) of an organism determined by the interaction of its genetic make-up and environment that differentiates specific instances of a species from other instances of the same species. phenotype fluorescence A luminous flux quality inhering in a bearer by virtue of the bearer's emitting longer wavelength light following the absorption of shorter wavelength radiation; fluorescence is common with aromatic compounds with several rings joined together. fluorescence mass A physical quality that inheres in a bearer by virtue of the proportion of the bearer's amount of matter. mass protein antithrombin III is a protein An amino acid chain that is produced de novo by ribosome-mediated translation of a genetically-encoded mRNA. protein molecular label role a reagent role inhering in a molecular entity intended to associate with some molecular target to serve as a proxy for the presence, abundance, or location of this target in a detection of molecular label assay. MHB (9-29-13): 'molecular label role' imported from the Reagent Ontology and replaced OBI:OBI_0000140 (label role) molecular tracer role OBI developer call, 3-12-12 molecular label role molecular label a molecular reagent intended to associate with some molecular target to serve as a proxy for the presence, abundance, or location of this target in a detection of molecular label assay molecular tracer OBI developer call, 3-12-12 molecular label region A sequence_feature with an extent greater than zero. A nucleotide region is composed of bases and a polypeptide region is composed of amino acids. primary structure of sequence macromolecule sequence region digital images may be stored as electronic file in TIFF format on mass memory storage devices an electronic file is an information content entity which conforms to a specification or format and which is meant to hold data and information in digital form, accessible to software agents Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO digital file a balanced design is a an experimental design where all experimental group have the an equal number of subject observations Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO balanced design 1 a single factor design is a study design which declares exactly 1 independent variable Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO single factor design x-axis is a cartesian coordinate axis which is orthogonal to the y-axis and the z-axis Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO x-axis an axis is a line graph used as reference line for the measurement of coordinates. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.oxforddictionaries.com/definition/english/axis axis y-axis is a cartesian coordinate axis which is orthogonal to the x-axis and the z-axis Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO y-axis A Cartesian coordinate system is a coordinate system that specifies each point uniquely in a plane by a pair of numerical coordinates, which are the signed distances from the point to two fixed perpendicular directed lines, measured in the same unit of length. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Cartesian_coordinate_system cartesian coordinate system In geometry, a coordinate system is a system which uses one or more numbers, or coordinates, to uniquely determine the position of a point or other geometric element on a manifold such as Euclidean space. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Coordinate_system coordinate system a cartesian axis is one of 3 the axis in a cartesian coordinate system defining a referential in 3 dimensions. each of the axis is orthogonal to the other 2 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra rectangular coordinate axis adapted from Wolfram Alpha: https://www.wolframalpha.com/input/?i=cartesian+coordinates&lk=4&num=6&lk=4&num=6 cartesian coordinate axis z-axis is a cartesian coordinate axis which is orthogonal to the x-axis and the y-axis Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO z-axis a 2 dimensional cartesian coordinate system is a cartesian coordinate system which defines 2 orthogonal one dimensional axes and which may be used to describe a 2 dimensional spatial region. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra two dimensional cartesian coordinate system In mathematics, a spherical coordinate system is a coordinate system for three-dimensional space where the position of a point is specified by three numbers: the radial distance of that point from a fixed origin, its polar angle measured from a fixed zenith direction, and the azimuth angle of its orthogonal projection on a reference plane that passes through the origin and is orthogonal to the zenith, measured from a fixed reference direction on that plane. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra https://en.wikipedia.org/wiki/Spherical_coordinate_system spherical coordinate system A cylindrical coordinate system is a three-dimensional coordinate system that specifies point positions by the distance from a chosen reference axis, the direction from the axis relative to a chosen reference direction, and the distance from a chosen reference plane perpendicular to the axis. The latter distance is given as a positive or negative number depending on which side of the reference plane faces the point. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra https://en.wikipedia.org/wiki/Cylindrical_coordinate_system cylindrical coordinate system In mathematics, the polar coordinate system is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a fixed point and an angle from a fixed direction. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Polar_coordinate_system polar coordinate system Wilks' lambda distribution (named for Samuel S. Wilks), is a probability distribution used in multivariate hypothesis testing, especially with regard to the likelihood-ratio test and Multivariate analysis of variance. It is a multivariate generalization of the univariate F-distribution, and generalizes the F-distribution in the same way that the Hotelling's T-squared distribution generalizes Student's t-distribution. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra wikipedia: last accessed: 2013-09-11 http://en.wikipedia.org/wiki/Wilks%27_lambda_distribution Wilk's lambda distribution A cartesian spatial coordinate datum chosen as a fixed point of reference in a three dimensional spatial region. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra three dimensional cartesian spatial coordinate origin normal distribution hypothesis is a goodness of fit hypothesis stating that the distribution computed from the sample population fits a normal distribution. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO normal distribution hypothesis A cartesian spatial coordinate datum chosen as a fixed point of reference in a two dimensional spatial region. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra two dimensional cartesian spatial coordinate origin 90 a confidence interval which covers 90% of the sampling distribution, meaning that there is a 90% risk of false positive (type I error) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO 90% confidence interval A one dimensional cartesian coordinate system is a cartesian coordinate system which defines a one dimensional axis and which may be used to describe a one dimensional spatial region, i.e. a straight line. It is defined by a point O, the origin, a unit of length and the orientation for the one dimensional space. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra one dimensional cartesian coordinate system http://www.stat.duke.edu/courses/Spring98/sta110c/qtable.html The studentized range (q) distribution is a probability distribution used by the Tukey Honestly Significant Difference test. The distribution of the statistic [x̄(k)- x̄(1)]/(s/√n) where random samples of size n have been taken from k independent and identically distributed normal populations, with x̄(1) and x̄(k) being, respectively, the smallest and largest of the k sample means, and s2 being the pooled estimate of the common variance. This statistic is particularly used in multiple comparison tests. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra q distribution A Dictionary of Statistics (2 rev ed.), OUP. ISBN-13: 9780199541454 http://www.oxfordreference.com/view/10.1093/acref/9780199541454.001.0001/acref-9780199541454-e-1588 http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Tukey.html studentized range distribution a three dimensional cartesian coordinate system is a cartesian coordinate system which defines 3 orthogonal one dimensional axes and which may be used to describe a 3 dimensional spatial region. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra three dimensional cartesian coordinate system A cartesian spatial coordinate datum chosen as a fixed point of reference in a one dimensional spatial region. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra one dimensional cartesian spatial coordinate origin A cartesian spatial coordinate datum chosen as a fixed point of reference in a spatial region. placeholder, more work needed Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra cartesian spatial coordinate origin linkage between 2 categorical variable test is a statistical test which evaluates if there is an association between a predictor variable assuming discrete values and a response variable also assuming discrete values Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra test of association STATO test of independence test of independence between variables test of association between categorical variables measure of variation or statistical dispersion is a data item which describes how much a theoritical distribution or dataset is spread. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO measure of dispersion measure of variation measure of variation a measure of central tendency is a data item which attempts to describe a set of data by identifying the value of its centre. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra measure of central tendency measure of central tendency Chi-squared statistic is a statistic computed from observations and used to produce a p-value in statistical test when compared to a Chi-Squared distribution. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO Chi-Squared statistic binary classification (or binomial classification) is a data transformation which aims to cast members of a set into 2 disjoint groups depending on whether the element have a given property/feature or not. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia: http://en.wikipedia.org/wiki/Binary_classifier last accessed: 2013-11-21 binomial classification binary classification The mode is a data item which corresponds to the most frequently occurring number in a set of numbers. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.sagepub.com/upm-data/47775_ch_3.pdf mode scipy.stats.mode(a, axis=0) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html#scipy.stats.mode source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L586 mode a model parameter is a data item which is part of a model and which is meant to characterize an theoritecal or unknown population. a model parameter may be estimated by considering the properties of samples presumably taken from the theoritecal population Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO model parameter the range is a measure of variation which describes the difference between the lowest score and the highest score in a set of numbers (a data set) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.sagepub.com/upm-data/47775_ch_3.pdf range(..., na.rm = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/base/html/range.html range Outliers are deviant scores that have been legitimately gathered and are not due to equipment failures. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.sagepub.com/upm-data/47775_ch_3.pdf outlier http://stats.stackexchange.com/questions/50623/r-calculating-mean-and-standard-error-of-mean-for-factors-with-lm-vs-direct The standard error of the mean (SEM) is data item denoting the standard deviation of the sample-mean's estimate of a population mean. It is calculated by dividing the sample standard deviation (i.e., the sample-based estimate of the standard deviation of the population) by the square root of n , the size (number of observations) of the sample. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra SEM adapted from wikipedia (https://en.wikipedia.org/wiki/Standard_error) scipy.stats.sem(a, axis=0, ddof=1) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html#scipy.stats.sem source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L1928 standard error of the mean a set of 2 subjects which result from a pairing process which assigns subject to a set based on a pairing rule/criteria possibly submit to 'Population and Community Ontology' Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO matched pair of subjects a statistic is a measurement datum to describe a dataset or a variable. It is generated by a calculation on set of observed data. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Statistic). statistic statistic an MA plot is a scatter plot of the log intensity ratios M = log_2(T/R) versus the average log intensities A = log_2(T*T)/2, where T and R represent the signal intensities in the test and reference channels respectively. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra M vs A plot http://www.stat.berkeley.edu/users/terry/zarray/Software/SMAcode/html/plot.mva.html MA plot plot.mva() MA plot 1 The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Anderson_Darling_test ad.test(x) function, where x is a numeric vector scipy.stats.anderson(x, dist='norm') http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html#scipy.stats.anderson source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/morestats.py#L1017 Anderson-Darling test true true 1 1 one-way anova is an analysis of variance where the different groups being compared are associated with the factor levels of only one independent variable. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra one factor ANOVA STATO http://statland.org/R/R/R1way.htm http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html#scipy.stats.f_oneway one-way ANOVA true true 1 2 two-way anova is an analysis of variance where the different groups being compared are associated the factor levels of exatly 2 independent variables. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra two factor ANOVA STATO http://courses.statistics.com/software/R/Rtwoway.htm two-way ANOVA a block design is a kind of study design which declares a blocking variable (also known as nuisance variable) in order to account for a known source of variation and reduce its impact on the acquisition of the signal Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from several sources including Wikipedia block design 1 a count of 4 resulting from counting limbs in humans a count is a data item denoted by an integer and represented the number of instances or occurences of an entity Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO count true true 1 3 Multi-way anova is an analysis of variance where the difference groups being compared are associated to the factor levels of more than 2 independent variables. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO multiway ANOVA http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581961/ Hardy-Weinberg equilibrium hypothesis is a good of fit hypothesis which states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences (non-random mating, mutation, selection, genetic drift, gene flow and meiotic drive). Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Hardy–Weinberg_principle) Hardy-Weinberg equilibrium hypothesis signal to noise ratio is a measurement datum comparing the amount of meaningful, useful or interesting data (the signal) to the amount of irrelevant or false data (the noise). Depending on the field and domain of application, different variables will be used to determinate a 'signal to noise ratio'. In statistics, the definition of signal to noise ratio is the ratio of the mean of a measurement to its standard deviation. It thus corresponds to the inverse of the coefficient of variation Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wikipedia: http://en.wikipedia.org/wiki/Signal-to-noise_ratio#Alternative_definition last accessed: 2013-10-18 S/N SNR http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.signaltonoise.html#scipy.stats.signaltonoise signal to noise ratio Poisson distribution is a probability distribution used to model the number of events occurring within a given time interval. It is defined by a real number (λ) and an integer k representing the number of events and a function. The expected value of a Poisson-distributed random variable is equal to λ and so is its variance. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra dpois(x, lambda, log = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Poisson.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html#scipy.stats.poisson NIST: http://www.itl.nist.gov/div898/handbook/eda/section3/eda366j.htm Poisson distribution true Z-test is a statistical test which evaluate the null hypothesis that the means of 2 populations are equal and returns a p-value. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://reference.wolfram.com/mathematica/ref/ZTest.html simple.z.test(x, sigma, conf.level=0.95) http://www.inside-r.org/packages/cran/UsingR/docs/simple.z.test Z-test a false positive rate is a data item which accounts for the proportion of incorrect rejection of a true null hypothesis. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra PRS,AGB adapted from wikipedia and wolfram alpha significance level type I error rate α false positive rate homoskedasticity states that all variances under consideration are homogenous. definition edited according to the discussion documented in: https://github.com/ISA-tools/stato/issues/39 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra equality of variance STATO homoskedasticity hypothesis http://www.ncbi.nlm.nih.gov/assembly/model/ chrX:35,000,000-36,000,000. chromosome coordinate system is a genomic coordinate which uses chromosome of a particular assembly build process to define start and end positions. This coordinate system is unstable and will change with each new genome sequence assembly build. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra chromosome coordinate system a null hypothesis which states that no linkage exists between 2 categorical variables Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO no relationship between the variables variables are independent absence of association hypothesis A null hypothesis is a statistical hypothesis that is tested for possible rejection under the assumption that it is true (usually that observations are the result of chance). The concept was introduced by R. A. Fisher. The hypothesis contrary to the null hypothesis, usually that the observations are the result of a real effect, is known as the alternative hypothesis.[wolfram alpha] Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://mathworld.wolfram.com/NullHypothesis.html null hypothesis goodness of fit hypothesis is a null hypothesis stating that the distribution computed from the sample population fits a theoretical distribution or that a dataset can be correctly explained by a model Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO goodness of fit hypothesis 0 the Student's t distribution is a continuous probability distribution which arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Student's_t-distribution) t distribution dt(x, df, ncp, log = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/TDist.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html#scipy.stats.t Student's t distribution hypergeometric distribution is a probability distribution that describes the probability of k successes in n draws from a finite population of size N containing K successes without replacement Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Hypergeometric_distribution dhyper(x, m, n, k, log = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Hypergeometric.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.hypergeom.html#scipy.stats.hypergeom hypergeometric distribution It is a null hypothesis stating that there are no differences observed between group of subjects. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO absence of between group difference hypothesis is a null hypothesis stating that there are no difference observed across a series of measurements made one same subject. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO absence of within subject difference hypothesis genomic coordinate datum is a data item which denotes a genomic position expressed using a genomic coordinate system Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO genomic coordinate datum http://left.subtree.org/2012/04/13/counting-the-number-of-reads-in-a-bam-file/ sequence read count is a data item determining how many sequence reads generated by a DNA sequencing assay for a given stretch of DNA can counted Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB-PRS, STATO sequence read count In statistics, a statement that can be tested.[wolfram alpha] Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO http://mathworld.wolfram.com/Hypothesis.html hypothesis Cleveland dot plot is a dot plot which plots points that each belong to one of several categories. They are an alternative to bar charts or pie charts, and look somewhat like a horizontal bar chart where the bars are replaced by a dots at the values associated with each category. Compared to (vertical) bar charts and pie charts, Cleveland argues that dot plots allow more accurate interpretation of the graph by readers by making the labels easier to read, reducing non-data ink (or graph clutter) and supporting table look-up.which Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wikipedia: http://en.wikipedia.org/wiki/Dot_plot_(statistics) and Cleveland, William S. (1993). Visualizing Data. Hobart Press. ISBN 0-9634884-0-6. hdl:2027/mdp.39015026891187. http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/dotchart.html dotchart(x, labels = NULL, groups = NULL, gdata = NULL, cex = par("cex"), pch = 21, gpch = 21, bg = par("bg"), color = par("fg"), gcolor = par("fg"), lcolor = "gray", xlim = range(x[is.finite(x)]), main = NULL, xlab = NULL, ylab = NULL, ...) Cleveland dot plot a continuousprobability distribution is a probability distribution which is defined by a probability density function Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wikipedia http://en.wikipedia.org/wiki/Probability_distribution#Continuous_probability_distribution last accessed: 14/01/2014 continuous probability distribution Skewness is a data item indicating of the degree of asymmetry of a distribution. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://mathworld.wolfram.com/Skewness.html skewness(x, na.rm = FALSE, type = 3) http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/e1071/html/skewness.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skew.html#scipy.stats.skew skewness The number degree of freedom is a count evaluating the number of values in a calculation that can vary. In statistics, the number of degrees of freedom ν is equal to N-1 in the case of the direct measurement of a quantity estimated by the arithmetic mean of N independent observations. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://stats.stackexchange.com/questions/16921/how-to-understand-degrees-of-freedom http://www.optique-ingenieur.org/en/courses/OPI_ang_M07_C01/co/Contenu_07.html the rank of the quadratic form (mathematical definition) number of degrees of freedom 2 Yate's corrected Chi-Squared test is a statistical test which is used to test the association/linkage/independence of 2 dichotomous variables while introducing a correction for using the continous Chi-squared distribution for the test. To reduce the error in approximation, Frank Yates, an English statistician, suggested a correction for continuity that adjusts the formula for Pearson's chi-squared test by subtracting 0.5 from the difference between each observed value and its expected value in a 2 × 2 contingency table. This reduces the chi-squared value obtained and thus increases its p-value. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Yates's_correction_for_continuity) polled in June 2013 Yate's correction for continuity chisq.test(x, y = NULL, correct = TRUE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html Yate's corrected Chi-Squared test reaction rate is a measurement datum which represents the speed of a chemical reaction turning reactive species into product species of event (i.e the number of such conversions)s occuring over a time interval Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO reaction rate substrate concentration is a scalar measurement datum which denotes the amount of molecular entity involved in an enzymatic reaction (or catalytic chemical reaction) and whose role in that reaction is as substrate. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO substrate concentration 1 2 2 5 Fisher's exact test is a statistical test used to determine if there are nonrandom associations between two categorical variables. duplicate with OBI_0200176. so either MIREOT and add metadata and axioms or move from OBI Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://mathworld.wolfram.com/FishersExactTest.html fisher.test(x) function, where x is a matrix scipy.stats.fisher_exact(table, alternative='two-sided') http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html#scipy.stats.fisher_exact source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L2485 Fisher's exact test true 2 1 1 2 2 Cochran-Mantel-Haenzel test for repeated tests of independence is a statitiscal test which allows the comparison of two groups on a dichotomous/categorical response. It is used when the effect of the explanatory variable on the response variable is influenced by covariates that can be controlled. It is often used in observational studies where random assignment of subjects to different treatments cannot be controlled, but influencing covariates can. The null hypothesis is that the two nominal variables that are tested within each repetition are independent of each other. So there are 3 variables to consider: two categorical variables to be tested for independence of each other, and the third variable identifies the repeats. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO adapted from wikipedia (http://en.wikipedia.org/wiki/Cochran–Mantel–Haenszel_statistics) and from the Handbook of Biological Statistics (http://udel.edu/~mcdonald/statcmh.html) CHM test Mantel–Haenszel test cmh.test(x,y,z) Cochran-Mantel-Haenzel test for repeated tests of independence a rarefaction curve is a graph used for estimating species richness in ecology studies Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO >library(vegan) >rarefaction(x, subsample=5, plot=TRUE, color=TRUE, error=FALSE, legend=TRUE, symbol) http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/vegan/html/vegan-package.html rarefaction curve 1 1 2 1 The Mann-Whitney U-test is a null hypothesis statistical testing procedure which allows two groups (or conditions or treatments) to be compared without making the assumption that values are normally distributed. The Mann-Whitney test is the non-parametric equivalent of the t-test for independent samples Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra U test Wilcoxon rank-sum test rank-sum test for the comparison of two samples adapted from http://udel.edu/~mcdonald/statkruskalwallis.html and from http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U last accessed [2014-03-04] Wilcoxon Rank-Sum test wilcox.test(dependent variable ~ independant variable, data = dataset) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/wilcox.test.html scipy.stats.mannwhitneyu(x, y, use_continuity=True) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html#scipy.stats.mannwhitneyu source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L4049 scipy.stats.ranksums(x, y) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ranksums.html#scipy.stats.ranksums source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L4103 Mann-Whitney U-test Shapiro-Wilk test is a goodness of fit test which evaluates the null hypothesis that the sample is drawn from a population following a normal distribution Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra S-W test STATO, adapted from wikipedia (https://en.wikipedia.org/wiki/Shapiro–Wilk_test) shapiro.test(x) function, where x is a numeric vector https://stat.ethz.ch/R-manual/R-devel/library/stats/html/shapiro.test.html scipy.stats.shapiro(x, a=None, reta=False) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html#scipy.stats.shapiro source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/morestats.py#L944 Shapiro-Wilk test Levene's test is a null hypothesis statistical test which evaluates the null hypothesis of equality of variance in several populations. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Levene_test levene.test(x) function, where x is a numeric vector scipy.stats.levene(*args, **kwds) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html#scipy.stats.levene source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/morestats.py#L1496 Levene's test Bartlett's test (see Snedecor and Cochran, 1989) is used to test if k samples are from populations with equal variances. Equal variances across samples is called homoscedasticity or homogeneity of variances. Some statistical tests, for example the analysis of variance, assume that variances are equal across groups or samples. The Bartlett test can be used to verify that assumption. Bartlett's test is sensitive to departures from normality. That is, if the samples come from non-normal distributions, then Bartlett's test may simply be testing for non-normality. Levene's test and the Brown–Forsythe test are alternatives to the Bartlett test that are less sensitive to departures from normality. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Bartlett_test bartlett.test(x) function, where x is a numeric vector scipy.stats.bartlett(*args) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bartlett.html#scipy.stats.bartlett source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/morestats.py#L1450 Barlett's test the Brown Forsythe test is a statistical test which evaluates if the variance of different groups are equal. It relies on computing the median rather than the mean, as used in the Levene's test for homoschedacity. This test maybe used to, for instance, ensure that the conditions of applications of ANOVA are met. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wikipedia and Brown, M. B., and A. B. Forsythe. 1974a. The small sample behavior of some statistics which test the equality of several means. Technometrics, 16, 129-132. http://www.statmethods.net/stats/anovaAssumptions.html The hovPlot( ) function in the HH package provides a graphic test of homogeneity of variances based on Brown-Forsyth. In the following example, y is numeric and G is a grouping factor. Note that G must be of type factor. # Homogeneity of Variance Plot library(HH) hov(y~G, data=mydata) hovPlot(y~G,data=mydata) Brown Forsythe test 2 Pearson's Chi-Squared test is a statistical null hypothesis test which is used to either evaluate goodness of fit of dataset to a Chi-Squared distribution or used to test independence of 2 categorical variables (ie absence of association between those variables). Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Chi2 test for independence adapted from: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html and http://en.wikipedia.org/wiki/Pearson's_chi-squared_test http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p = FALSE, simulate.p.value = FALSE, B = 2000) http://www.inside-r.org/packages/cran/nortest/docs/pearson.test pearson.test(x) function, where x is a numeric vector http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html#scipy.stats.chi2_contingency Pearson's Chi square test of independence between categorical variables 2 1 1 a fixed effect model is a statistical model which represents the observed quantities in terms of explanatory variables that are treated as if the quantities were non-random. PRS: this is a stub and more work is needed to reconcile conflicting definitions Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia: http://en.wikipedia.org/wiki/Fixed_effects_model fixed effect model Kolmogorov-Smirnov test is a goodness of fit test which evaluates the null hypothesis that a sample is drawn from a population that follows a specific continuous probability distribution. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra K-S test STATO, adapted from wikipedia (https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test) http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm ks.test(dataset, distribution) scipy.stats.kstwobign = <scipy.stats._continuous_distns.kstwobign_gen object at 0x7f6169f842d0> http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstwobign.html#scipy.stats.kstwobign source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/_continuous_distns.py scipy.stats.mstats.ks_twosamp(data1, data2,alternative='two-sided') http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.ks_twosamp.html source code: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/mstats_basic.py#L821 Kolmogorov-Smirnov test multinomial logistic regression model is a model which attempts to explain data distribution associated with *polychotomous* response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is probit function. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Multinomial_probit) polled in June 2013 http://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf multinomial probit regression for analysis of polychotomous dependent variable http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2689604/ effect size estimate is a data item about the direction and strength of the consequences of a causative agent as explored by statistical methods. Those methods produce estimates of the effect size, e.g. confidence interval Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB,PRS effect size effect size estimate an F-test is a statistical test which evaluates that the computed test statistics follows an F-distribution under the null hypothesis. The F-test is sensitive to departure from normality. F-test arise when decomposing the variability in a data set in terms of sum of squares. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO F-test 2 a polychotomous variable is a categorical variable which is defined to have minimally 2 categories or possible values Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO http://udel.edu/~mcdonald/statvartypes.html polychotomous variable statistical sample size is a count evaluating the number of individual experimental units Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB-PRS statistical sample size study group population size 2 1 a case-control study design is a observation study design which assess the risk of particular outcome (a trait or a disease) associated with an event (either an exposure or endogenous factor). A case-control study design therefore declares an exposure variable which is dichotomous in nature (exposed/non-exposed) and an outcome variable, which is also dichotomous (case or control), thus giving the name to the design. During the execution of the design, a case control study defines a population and counts the events to determine their frequency. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from: http://www.drcath.net/toolkit/casecontrol.html case-control study design 2 a dichotomous variable is a categorical variable which is defined to have only 2 categories or possible values Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB-PRS http://udel.edu/~mcdonald/statvartypes.html 'has part' exactly 1 ('categorical measurement datum' and ('has category label' exactly 2 'categorical label')) dichotomous variable Genome wide association study is a kind of study whose objective is to detect association between genetic markers (SNP or otherwise) accross the genome and a trait which may be a disease or another phenotype (e.g. trait of agronomic relevance in animal or plant studies). Genome wide association study compare the allele frequencies in 2 populations, one free of the trait used as control, the other one showing the trait use as 'case'. GWAS studies implement case-control design Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB, PRS GWAS study whole genome association study genome-wide association study 1 2 The Wilcoxon signed rank test is a statistical test which tests the null hypothesis that the median difference between pairs of observations is zero. This is the non-parametric analogue to the paired t-test, and should be used if the distribution of differences between pairs may be non-normally distributed. The procedure involves a ranking, hence the name. The absolute value of the differences between observations are ranked from smallest to largest, with the smallest difference getting a rank of 1, then next larger difference getting a rank of 2, etc. Ties are given average ranks. The ranks of all differences in one direction are summed, and the ranks of all differences in the other direction are summed. The smaller of these two sums is the test statistic, W (sometimes symbolized Ts). Unlike most test statistics, smaller values of W are less likely under the null hypothesis. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://udel.edu/~mcdonald/statsignedrank.html signrank() scipy.stats.wilcoxon(x, y=None, zero_method='wilcox', correction=False) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html#scipy.stats.wilcoxon source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L4103 Wilcoxon signed rank test Information about a calendar date or timestamp indicating day, month, year and time of an event. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO date 1 2 1 1 The Kruskal–Wallis test is a null hypothesis statistical testing objective which allows multiple (n>=2) groups (or conditions or treatments) to be compared, without making the assumption that values are normally distributed. The Kruskal–Wallis test is the non-parametric equivalent of the independent samples ANOVA. The Kruskal–Wallis test is most commonly used when there is one nominal variable and one measurement variable, and the measurement variable does not meet the normality assumption of an anova. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra H test rank-sum test for the comparison of multiple (more than 2) samples. http://udel.edu/~mcdonald/statkruskalwallis.html kruskal.test() http://stat.ethz.ch/R-manual/R-patched/library/stats/html/kruskal.test.html scipy.stats.mstats.kruskalwallis(*args) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.kruskalwallis.html source code: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/mstats_basic.py#L800 Kruskal Wallis test true 1 true 1 paired t-test is a statistical test which is specifically designed to analysis differences between paired observations in the case of studies realizing repeated measures design with only 2 repeated measurements per subject (before and after treatment for example) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://udel.edu/~mcdonald/statpaired.html http://udel.edu/~mcdonald/statsignedrank.html t-test for dependent means t-test for repeated measures http://stat.ethz.ch/R-manual/R-patched/library/stats/html/t.test.html t.test(dependent variable ~ independant variable, data = dataset, var.equal = FALSE, paired= TRUE) scipy.stats.ttest_rel(a, b, axis=0) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html#scipy.stats.ttest_rel source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L3389 paired t-test 2 1 stratification is a planned process which executes a stratification rule using as input a population and assign it member to mutually exclusive subpopulation based on the values defined by the stratification rule Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra PRS+AGB adapted from wikipedia: http://en.wikipedia.org/wiki/Stratified_sampling polled on June 7th,2013 stratifying population population stratification prior to sampling A stastical test power analysis is a data transformation which aims to determine the size of a statistical sample required to reach a desired significance level given a particular statistical test Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.statmethods.net/stats/power.html http://www.statmethods.net/stats/power.html statistical test power analysis 2 2 http://arxiv.org/pdf/1007.1094.pdf Hotelling's T2 test is a statistical test which is a generalization of Student's T-test to a assess if the means of a set of variables remains unchanged when studying 2 populations. It is a type of multivariate analysis Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO http://svitsrv25.epfl.ch/R-doc/library/rrcov/html/T2.test.html two sample Hotelling T2 test 1 1 a random effect(s) model, also called a variance components model, is a kind of hierarchical linear model. It assumes that the dataset being analysed consists of a hierarchy of different populations whose differences relate to that hierarchy. PRS: this is a stub and more work is needed to reconcile conflicting definitions Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra variance components model adapted from wikipedia: http://en.wikipedia.org/wiki/Random_effects_model#Qualitative_description random effect model 2 standardized mean difference is data item computed by forming the difference between two means, divided by an estimate of the within-group standard deviation. It is used to provide an estimatation of the effect size between two treatments when the predictor (independent variable) is categorical and the response(dependent) variable is continuous Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra SMD adapted from "Effect size, confidence interval and statistical significance: a practical guide for biologists" Nakagawa and Cuthill DOI: 10.1111/j.1469-185X.2007.00027.x adapted from http://htaglossary.net/standardised+mean+difference+(SMD) Cohen's d statistic standardized mean difference the multinomial distribution is a probability distribution which gives the probability of any particular combination of numbers of successes for various categories defined in the context of n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from http://mathworld.wolfram.com/MultinomialDistribution.html and http://en.wikipedia.org/wiki/Multinomial_distribution dmultinom(x, size = NULL, prob, log = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Multinom.html multinomial distribution A z-score (also known as z-value, standard score, or normal score) is a measure of the divergence of an individual experimental result from the most probable result, the mean. Z is expressed in terms of the number of standard deviations from the mean value. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra https://controls.engin.umich.edu/wiki/index.php/Basic_statistics:_mean,_median,_average,_standard_deviation,_z-scores,_and_p-value#Z-Scores normal score standard score scipy.stats.zscore(a, axis=0, ddof=0) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html#scipy.stats.zscore source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L1977 z-score log signal intensity ratio is a data item which corresponding the logarithmitic base 2 of the ratio between 2 signal intensity, each corresponding to a condition. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia: http://en.wikipedia.org/wiki/MA_plot last accessed: 2014-03-13 M-value log signal intensity ratio probit regression model is a model which attempts to explain data distribution associated with *dichotomous* response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is the probit function aka the quantile function, i.e., the inverse cumulative distribution function (CDF), associated with the standard normal distribution. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Probit_model) polled in June 2013 probit regression for analysis of polychotomous dependent variable a statistical model is an information content entity which is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more other variables. The model is statistical as the variables are not deterministically but stochastically related. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wikipedia: http://en.wikipedia.org/wiki/Statistical_model last accessed: 14/01/2014 statistical model statistical model linear regression model is a model which attempts to explain data distribution associated with response/dependent variable in terms of values assumed by the independent variable uses a linear function or linear combination of the regression parameters and the predictor/independent variable(s). linear regression modeling makes a number of assumptions, which includes homoskedasticity (constance of variance) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Linear_regression) polled in June 2013 linear regression for analysis of continuous dependent variable multinomial logistic regression model is a model which attempts to explain data distribution associated with *polychotomous* response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is logistic function. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Multinomial_logistic_regression) polled in June 2013 http://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf multinomial logistic regression for analysis of dichotomous dependent variable a sequence read is a DNA sequence data which is generated by a DNA sequencer Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra sequence read a Funnel plot is a scatter plot of treatment effect versus a measure of study size and aims to provide a visual aid to detecting bias or systematic heterogeneity. A symmetric inverted funnel shape arises from a ‘well-behaved’ data set, in which publication bias is unlikely. An asymmetric funnel indicates a relationship between treatment effect and study size. Known caveats: If high precision studies really are different from low precision studies with respect to effect size (e.g., due to different populations examined) a funnel plot may give a wrong impression of publication bias. The appearance of the funnel plot can change quite dramatically depending on the scale on the y-axis — whether it is the inverse square error or the trial size. Funnel plot was introduced by Light and Palmer in 1984. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wikipedia: http://en.wikipedia.org/wiki/Funnel_plot Funnel plot variance is a data item about a random variable or probability distribution. it is equivalent to the square of the standard deviation. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean (expected value).The variance is the second moment of a distribution. Alejandra Gonzalez-Belran Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra σ2 var(x, y = NULL, na.rm = FALSE, use) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html variance the process of using statistical analysis for interpreting and communicating "what the data say". Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra From "The strenght of statistical evidence" by Richard Royall. https://www.stat.fi/isi99/proceedings/arkisto/varasto/roya0578.pdf assess stastistical evidence assess statistical evidence a discrete probability distribution is a probability distribution which is defined by a probability mass function where the random variable can only assume a finite number of values or infinitely countable values Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wikipedia http://en.wikipedia.org/wiki/Probability_distribution#Discrete_probability_distribution last accessed: 14/01/2014 discrete probability distribution ranking is a data transformation which turns a non-ordinal variable into a Ordinal variable by sorting the values of the input variable and replacing their value by their position in the sorting result Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO ranking model parameter estimation is a data transformation that finds parameter values (the model parameter estimates) most compatible with the data as judged by the model. textual definition modified following contributiong by Thomas Nichols: https://github.com/ISA-tools/stato/issues/18 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra model parameter estimation http://www.r-bloggers.com/boxplots-beyond-iv-beanplots/ beanplot is a plot in which (one or) multiple batches ("beans") are shown. Each bean consists of a density trace, which is mirrored to form a polygon shape. Next to that, a one-dimensional scatter plot shows all the individual measurements, like in a stripchart. The name beanplot stems from green beans. The density shape can be seen as the pod of a green bean, while the scatter plot shows the seeds inside the pod. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.jstatsoft.org/v28/c01/paper http://cran.r-project.org/web/packages/beanplot/index.html bean plot the objective of a data transformation to evaluate a null hypothesis of absence of linkage between variables. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO association between categorical variables testing objective a pedigree chart is a graph which plots parent child relations Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO adapted from wikipedia (https://en.wikipedia.org/wiki/Pedigree_chart) family tree plot.pedigree {kinship} http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/kinship/html/plot.pedigree.html pedigree chart 2 r2 is a correlation coefficient which is computed over the frequency of 2 dichotomous variable and is used as a measure of Linkage Disequilibrium and as input data item to the creation of an LD plot Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra R squared measure of LD http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2580747/ r2 measure of LD r2 measure of linkage desequilibrium a stratification rule/criteria is a criteria used to determine population strata so that a stratification process implementing the rule can result in any member of the total population being assigned to one and only one stratum Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO adapted from wikipedia: http://en.wikipedia.org/wiki/Stratified_sampling polled on June 7th,2013 stratification rule The dot plot as a representation of a distribution consists of group of data points plotted on a simple scale. Dot plots are used for continuous, quantitative, univariate data. Data points may be labelled if there are few of them. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wikipedia: Wilkinson, Leland (1999). "Dot plots". The American Statistician (American Statistical Association) 53 (3): 276–281. doi:10.2307/2686111 Wilkinson dot plot volcano plot is a kind of scatter plot which graphs the negative log of the p-value (significance) on the y-axis versus log2 of fold-change between 2 conditions on the x-axis. It is a popular method for visualizing differential occurence of variables between 2 conditions. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Volcano_plot_(statistics) volcanoplot(fit, coef=1, highlight=0, names=fit$genes$ID, ...) http://rss.acs.unt.edu/Rdoc/library/limma/html/volcanoplot.html volcano plot 99 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2689604/ a confidence interval which covers 99% of the sampling distribution, meaning that there is a 1% risk of false positive (type I error) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra confidence interval at 1% of type I error rate STATO 99% confidence interval Altman Box and Whisker plot is a variation of Tukey Box and Whisker plot which use the criteria of Altman to create the 'whisker' of the plot. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Altman, D.G. Practical Statistics for Medical Research (Chapman and Hall, 1991). Altman box and whisker plot 2 2 http://www.biomedcentral.com/1471-2288/11/58#B9 the Breslow-Day test is a statistical test which evaluates if the odds ratios are homogenous across N 2x2 contingency tables, for instance several 2x2 contingency tables associated with different strata of a stratified population when evaluating the relationship between exposure and outcome or associated with the different samples coming from several centres in a multicentric study in clinical trial context. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia: http://en.wikipedia.org/wiki/Odds_ratio#Statistical_inference polled on June 8th,2013 Breslow-Day test http://www.math.montana.edu/~jimrc/classes/stat524/Rcode/breslowday.test.r Breslow-Day test for homogeneity of odds ratio a sphericity test is a null hypothesis statistical testing procedure which posits a null hypothesis of equality of the variances of the differences between levels of the repeated measures factor Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Sphericity#Sphericity_in_statistics) test of data sphericity sphericity test Hotelling T squared distribution is a probability distribution used in multivariate hypothesis testing, which is a univariate distribution proportional to the F-distribution and arises importantly as the distribution of a set of statistics which are natural generalizations of the statistics underlying Student's t-distribution. In particular, the distribution arises in multivariate statistics in undertaking tests of the differences between the (multivariate) means of different populations, where tests for univariate problems would make use of a t-test. The distribution is named for Harold Hotelling, who developed it[1] as a generalization of Student's t-distribution. This distribution is commonly used to describe the sample Mahalanobis distance between two populations. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wikipedia "http://en.wikipedia.org/wiki/Hotelling's_T-squared_distribution" last polled: 2013-11-09 Hotelling T2 distribution A post-hoc analysis is a statistical test carried out following an analysis of variance which ruled out the null hypothesis of absence of difference between group which allows identifying which groups differ. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra a posteriori test adapted from wikipedia: http://en.wikipedia.org/wiki/Post-hoc_analysis last accessed: 2013-11-15 post-hoc analysis specificity is a measurement datum qualifying a binary classification test and is computed by substracting the false positive rate to the integral numeral 1 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra specificity true negative rate http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2789971/ strictly standardized mean difference (SSMS) is a standardized mean difference which corresponds to the ratio of mean to the standard deviation of the difference between two groups. SSMD directly measures the magnitude of difference between two groups. SSMD is widely used in High Content Screen for hit selection and quality control. When the data is preprocessed using log-transformation as normally done in HTS experiments, SSMD is the mean of log fold change divided by the standard deviation of log fold change with respect to a negative reference. In other words, SSMD is the average fold change (on the log scale) penalized by the variability of fold change (on the log scale). For quality control, one index for the quality of an HTS assay is the magnitude of difference between a positive control and a negative reference in an assay plate. For hit selection, the size of effects of a compound (i.e., a small molecule or an siRNA) is represented by the magnitude of difference between the compound and a negative reference. SSMD directly measures the magnitude of difference between two groups. Therefore, SSMD can be used for both quality control and hit selection in HTS experiments. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/SSMD http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2789971/ strictly standardized mean difference 2 Tarone's test for homogeneity of odds ratio is a statistical test which evaluates the null hypothesis that odds ratio are homogeneous Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Tarone, R. E. ‘On heterogeneity tests based on efficient scores’, Biometrika, 72, 91-95 (1985). > library("metafor") > calcTaronesTest <- function(mylist,referencerow=2) http://a-little-book-of-r-for-biomedical-statistics.readthedocs.org/en/latest/src/biomedicalstats.html#calculating-the-mantel-haenszel-odds-ratio-when-there-is-a-stratifying-variable Tarone's test for homogeneity of odds ratio 2 an homoskedasticity test is a statistical test aiming at evaluate if the variances from several random samples are similar Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO equivariance test homoskedasticity test 1 1 2 a 2x2 contingency table is a contingency table build for 2 dichotomous variables (i.e. 2 categorical variables, each with only 2 possible outcomes). It is the simplest of contingency tables Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO 2x2 contingency table xtabs(formula = ~., data = parent.frame(), subset, sparse = FALSE, na.action, exclude = c(NA, NaN), drop.unused.levels = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/xtabs.html flat contingency tables: ftable(x, ...) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ftable.html 2 by 2 contingency table pairing patients by age, pairing animals by body weight range a subject pairing is a planned process which executes a pairing rule and results in the creation of sets of 2 subjects meeting the pairing criteria Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO subject pairing 2 a contigency table is a data item which displays the (multivariate) frequency distribution of the possible values of categorical variables. The first row of the table corresponds to categories of one categorical variable, the first column of the table corresponds to categories of the other categorical variable, the cells corresponding to each combination of categories is filled with the observed occurences in the sample being considered. The table also contains marginal total (marginal sums) and grand total of the occurences The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Contingency_table) xtabs(formula = ~., data = parent.frame(), subset, sparse = FALSE, na.action, exclude = c(NA, NaN), drop.unused.levels = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/xtabs.html flat contingency tables: ftable(x, ...) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ftable.html contingency table acute toxicity study is an investigation which use interventions organized according to a factorial design and a parallel group design to observe the effect of use of high dose xenobiotics in animal models or cellular models Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra acute toxicity study acute toxicity study 2 2 -1 1 The correlation coefficient of two variables in a data sample is their covariance divided by the product of their individual standard deviations. It is a normalized measurement of how the two are linearly related. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO r r statistics correlation coefficient 2 A Bayesian model selection is a data transformation which is based on Bayesian statistics to compute Bayes factor in order to evaluate which model best explains data. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia Bayesian model selection for example , model parameter estimates could be produced using regression analysis (used in a model estimation process) , which attempts to express the response variable in terms of function of predictor variable and model parameters. a model parameter estimate is a data item which results from a model parameter estimation process and which provides a numerical value about a model parameter. textual definition modified following contributiong by Thomas Nichols: https://github.com/ISA-tools/stato/issues/18 Alejandra Gonzalez Beltran Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO model parameter estimate the geometric distribution is a negative binomial distribution where r is 1. It is useful for modeling the runs of consecutive successes (or failures) in repeated independent trials of a system. The geometric distribution models the number of successes before one failure in an independent succession of tests where each test results in success or failure. The geometric distribution with prob = p has density p(x) = p (1-p)^x for x = 0, 1, 2, …, 0 < p ≤ 1. If an element of x is not integer, the result of dgeom is zero, with a warning. The quantile is defined as the smallest value x such that F(x) ≥ p, where F is the distribution function. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Geometric.html http://www.mathworks.co.uk/help/stats/geometric-distribution.html dgeom(x, prob, log = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Geometric.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.geom.html#scipy.stats.geom geometric distribution a null hypothesis stating that there are differences observed between group of subjects Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO presence of between group difference hypothesis Linkage Disequilibrium plot is a graph which represents pairwise linkage disequilibrium measures between SNP as a heatmap Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from R documentation (http://cran.r-project.org/web/packages/LDheatmap/index.html) LD plot Linkage Disequilibrium plot LD plot 1 1 1 The Cochran-Armitage test is a statistical test used in categorical data analysis when the aim is to assess for the presence of an association between a dichotomous variable (variable with two categories) and a polychotomous variable (a variable with k categories). The two-level variable represents the response, and the other represents an explanatory variable with ordered levels. The null hypothesis is the hypothesis of no trend, which means that the binomial proportion is the same for all levels of the explanatory variable For example, doses of a treatment can be ordered as 'low', 'medium', and 'high', and we may suspect that the treatment benefit cannot become smaller as the dose increases. The trend test is often used as a genotype-based test for case-control genetic association studies. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra CATT http://en.wikipedia.org/wiki/Cochran%E2%80%93Armitage_test_for_trend Cochran-Armitage test for trend binomial logistic regression model is a model which attempts to explain data distribution associated with *dichotomous* response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is logistic function. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Logistic_regression) polled in June 2013 binomial logistic regression for analysis of dichotomous dependent variable a minimum value is a data item which denotes the smallest value found in a dataset or resulting from a calculation. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO minimum value maximum value is a data item which denotes the largest value found in a dataset or resulting from a calculation. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO maximum value a quartile is a quantile which splits data into sections accrued of 25% of data, so the first quartile delineates 25% of the data, the second quartile delineates 50% of the data and the third quartile, 75 % of the data Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Quartile) quartile 1 2 http://arxiv.org/pdf/1007.1094.pdf The one-sample Hotelling’s T2 is the multivariate extension of the common one-sample or paired Student’s t-test. In a one-sample t-test, the mean response is compared against a specific value. Hotelling’s one-sample T2 is used when the number of response variables is two or more, although it can be used when there is only one response variable. T2 makes the usual assumption that the data are approximately multivariate normal. Randomization tests are provided that do not rely on this assumption. These randomization tests should be used whenever you want exact results that do not rely on several assumptions. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from: https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Hotellings_One-Sample_T2.pdf http://svitsrv25.epfl.ch/R-doc/library/rrcov/html/T2.test.html one sample Hotelling T2 test a violin plot is a plot combining the features of box plot and kernel density plot. The violin plot is therefore similar to box plot but it incorporated in the display the probability density of the data at different values. Typically violin plots will include a marker for the median of the data and a box indicating the interquartile range, as in standard box plots. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Violin_plot and Hintze, J. L. and R. D. Nelson (1998). Violin plots: a box plot-density trace synergism. The American Statistician, 52(2):181-4. http://www.inside-r.org/packages/cran/vioplot/docs/vioplot vioplot( x, ..., range=1.5, h, ylim, names, horizontal=FALSE, col="magenta", border="black", lty=1, lwd=1, rectCol="black", colMed="white", pchMed=19, at, add=FALSE, wex=1, drawRect=TRUE) violin plot 2 meta-analysis is a data transformation which uses the effect size estimates from several independent quantitative scientific studies addressing the same question in order to assess finding consistency. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia: http://en.wikipedia.org/wiki/Metaanalysis last accessed: 2013-11-15 meta analysis the Scheffe test is a data transformation which evaluates all possible contrasts and adjusting the levels significance by accounting for multiple comparison. The test is therefore conservative. Confidence intervals can be constructed for the corresponding linear regression. It was developped by American statistician Henry Scheffe in 1959. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Scheffé's_method) http://www.inside-r.org/packages/cran/agricolae/docs/scheffe.test Scheffe test the LSD test is a statistical test for multiple comparisons of treatments by means of least significant difference following an ANOVA analysis Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra R LSD test http://rss.acs.unt.edu/Rdoc/library/agricolae/html/LSD.test.html Least significance different test a null hypothesis which states that a linkage exists between 2 categorical variables Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO presence of association hypothesis Stacked bar chart is a bar which is used to compare overall quantities across items while showing the contribution of category to the total amount. Stacked bar chart can be used for highlighting the total as they visually aggregate all of the categories in a group while indicating a part to whole relationship. The downside is that it becomes harder to compare the sizes of the individual categories. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapter from http://www2.le.ac.uk/offices/ld/resources/numeracy/bar-charts and http://blog.visual.ly/how-groups-stack-up-when-to-use-grouped-vs-stacked-column-charts/ [last accessed: 2014-03-04] barplot(height....) set argument " beside = FALSE " http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/barplot.html stacked bar chart 2 The exponential distribution (a.k.a. negative exponential distribution) is the probability distribution that describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Exponential.html dexp(x, rate = 1, log = FALSE) pexp(q, rate = 1, lower.tail = TRUE, log.p = FALSE) qexp(p, rate = 1, lower.tail = TRUE, log.p = FALSE) rexp(n, rate = 1) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html#scipy.stats.expon exponential distribution variable distribution is data item which denotes the spatial resolution of data point making up a variable. variable distribution may be compared to a known probability distribution using goodness of fit test or plotting a quantile-quantile plot for visual assessment of the fit. TODO: Probably need to drop it Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO data distribution data distribution the role played by an entity part of study group as defined by an experimental design and realized in a data analysis and data interpretation Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra experimental unit role trimmed mean or truncated mean is a measure of central tendency which involves the calculation of the mean after discarding given parts of a probability distribution or sample at the high and low end, and typically discarding an equal amount of both Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia [last accessed 2014-03-04] http://en.wikipedia.org/wiki/Truncated_mean truncated mean scipy.stats.tmean(a, limits=None, inclusive=(True, True)) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.tmean.html#scipy.stats.tmean source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L684 trimmed mean The interquartile range is a data item which corresponds to the difference between the upper quartile (3rd quartile) and lower quartile (1st quartile). The interquartile range contains the second quartile or median. The interquartile range is a data item providing a measure of data dispersion Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO adapted from wikipedia, wolfram alpha and oxford dictionary of statistics IQR(x, na.rm = FALSE, type = 7) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/IQR.html inter quartile range a pie chart is a graph in which a circular graph is divided into sector illustrating numerical proportion, meaning that the arc length of each sector (and consequently its central angle and area), is proportional to the quantity it represents. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wikipedia, last accessed [2014-03-05] http://en.wikipedia.org/wiki/Pie_chart pie(x, labels = names(x), edges = 200, radius = 0.8, clockwise = FALSE, init.angle = if(clockwise) 90 else 0, density = NULL, angle = 45, col = NULL, border = NULL, lty = NULL, main = NULL, ...) https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/pie.html pie chart A bar chart is appropriate to represent counts of data. the bart chart is a graph resulting from plotting rectangular bars with lengths proportional to the values that they represent. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia (http://en.wikipedia.org/wiki/Bar_chart) polled in June 2013 bar plot barplot(height, ...) http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/barplot.html bar chart the first quartile is a quartile which splits the lower 25 % of the data Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO first quartile a real time quantitative pcr plot is a line graph which plots the signal fluorescence intensity as a function of the number of PCR cycle Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra real time quantitative pcr plot Fold change is a number describing how much a quantity changes going from an initial to a final value or one condition to another condition 30/04/2014 - removed restriction: 'is about' exactly 2 'study group population' - need more discussion for the relationship of fold change to study group populations for particular examples. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Fold_change fold change the first quartile is a quartile which splits the 75 % of the data Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO third quartile Spear Box and Whisker plot is a variation of Tukey Box and Whisker plot which use the criteria of Spear to create the 'whisker' of the plot. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Spear, M.E. Charting Statistics (McGraw-Hill, 1952) Spear box and whisker plot expected fragments per kilobase of transcript per million fragments mapped is a metric used to report transcript expression event as generated by RNA-Seq using paired-end library. The calculated value results from 2 types of normalization, one to take into account the difference in reads counts associated with transcript length (at equal abundance, longer transcripts will have more reads than shorter transcripts) , (hence the 'per kilobase of transcript') and the other one to take into account different sequencing depth during distinct sequencing runs (hence the 'per millions mapped fragment'. The metric is specifically produced by cufflink software. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra FPKM adapted from: http://seqanswers.com/forums/showthread.php?t=3254 and from http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html fragments per kilobase of transcript per million fragments mapped homogeneity testing objective is the objective of a data transformation to test a null hypothesis that two or more sub-groups of a population share the same distribution of a single categorical variable. For example, do people of different countries have the same proportion of smokers to non-smokers Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO homogeneity test objective A forest plot is a graph designed to illustrate the relative strength of treatment effects in multiple quantitative scientific studies addressing the same question. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Forest_plot metaplot(mn, se, nn=NULL, labels=NULL, conf.level=0.95, xlab="Odds ratio", ylab="Study Reference",xlim=NULL, summn=NULL, sumse=NULL, sumnn=NULL, summlabel="Summary", logeffect=FALSE, lwd=2, boxsize=1, zero=as.numeric(logeffect), colors=meta.colors(), xaxt="s", logticks=TRUE, ...) http://rss.acs.unt.edu/Rdoc/library/rmeta/html/metaplot.html Forest plot http://stat.ethz.ch/R-manual/R-patched/library/stats/html/confint.html confidence interval calculation is a data transformation which determines a confidence interval for a given statistical parameter Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO confidence interval calculation t-statistic is a statistic computed from observations and used to produce a p-value in statistical test when compared to a Student's t distribution. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO T t-statistic the beta distribution is a continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wikipedia: http://en.wikipedia.org/wiki/Beta_distribution http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Beta.html dbeta(x, shape1, shape2, ncp = 0, log = FALSE) pbeta(q, shape1, shape2, ncp = 0, lower.tail = TRUE, log.p = FALSE) qbeta(p, shape1, shape2, ncp = 0, lower.tail = TRUE, log.p = FALSE) rbeta(n, shape1, shape2, ncp = 0) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html#scipy.stats.beta beta distribution Kurtosis is a data item which denotes the degree of peakedness of a distribution. It is defined as a normalized form of the fourth central moment of a distribution. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://mathworld.wolfram.com/Kurtosis.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kurtosis.html#scipy.stats.kurtosis kurtosis 1 1 ANCOVA or analysis of covariance is a data transformation which evaluates if population means of a dependent variable are equal across levels of a categorical independent variables while controlling for the effects of other continuous variable s, known as covariates. Therefore, when performing ANCOVA, we are adjusting the dependent variable means to what they would be if all groups were equal on the covariates. It augments the ANOVA model with one or more additional quantitative variables, called covariates, which are related to the response variable. The covariates are included to reduce the variance in the error terms and provide more precise measurement of the treatment effects. ANCOVA is used to test the main and interaction effects of the factors, while controlling for the effects of the covariate Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia ANCOVA 1 0 standard normal distribution is a normal distribution with variance = 1 and mean=0 we need to formally set value for mean and variance Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra dnorm(x, mean = 0, sd = 1, log = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Normal.html standard normal distribution Hardy-Weinberg equilibrium test is a statistical test which aims to evaluate if a population's proportion of allele is stable or not. It is used as means of quality control to evaluate possibility of genotyping error or population structure. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO: adapted from wikipedia (http://en.wikipedia.org/wiki/Hardy–Weinberg_principle) > library(HardyWeinberg) > x <- c(298,489,213) > HW.test <- HWChisq(x,verbose=TRUE) http://cran.r-project.org/web/packages/HardyWeinberg/index.html Hardy-Weinberg equilibrium testing 2 4 Odds ratio is a ratio that measures effect size, that is the strength of association between 2 dichotomous variables, one describing an exposure and one describing an outcome. It represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure ( the probability of the event occuring divided by the probability of an event not occurring). The odds ratio is a ratio of describing the strength of association or non-independence between two binary data values by forming the ratio of the odds for the first group and the odds for the second group. Odds ratio are used when one wants to compare the odds of something occurring to two different groups. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/ http://www.stats.org/stories/2008/odds_ratios_april4_2008.html OR odds ratio sphericity testing objective is a statistical objective of a data transformation which aims to test a null hypothesis of sphericity holds. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO sphericity testing objective sphericity testing objective A ratio is a data item which is formed with two numbers r and s is written r/s, where r is the numerator and s is the denominator. The ratio of r to s is equivalent to the quotient r/s. review formal definition as both numerator and denominator should be of the same type, not just some data item Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wolfram Alpha: https://www.wolframalpha.com/share/clip?f=d41d8cd98f00b204e9800998ecf8427efdcsig76g7 ratio 1 2 1 a 2 by n contingency table is a contingency table built for one dichotomous variable (a categorical variable with only 2 outcomes) and one polychotomous variable (a polychomotomous variable with at least 2 outcomes) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO xtabs(formula = ~., data = parent.frame(), subset, sparse = FALSE, na.action, exclude = c(NA, NaN), drop.unused.levels = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/xtabs.html flat contingency tables: ftable(x, ...) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ftable.html 2 by n contingency table Lineweaver-Burk plot is a graph which is the graphical representation of the Lineweaver–Burk equation of enzyme kinetics, described by Hans Lineweaver and Dean Burk in 1934. The plot provides a useful graphical method for analysis of the Michaelis–Menten equation. It was widely used to determine important terms in enzymology and enzyme kinetics as the x-intercept of the graph represents −1/Km and the y-intercept of such a graph is equivalent to the inverse of Vmax TODO: create 'inverse function' and replace 'data transformation' in the assertions Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra double reciprocal plot Lineweaver-Burk plot 2 1 Tukey Honestly Significant Difference (HSD) test is a statistical test used following an ANOVA test yielding a statistically significant p-value in order to determine which means are different, to a given level of significance. The Tukey HSD test relies on the q-distribution. The procedure is conservative, meaning that if sample sizes (the sizes of different study groups) are equal, the risk of a Type I error is exactly α, and if sample sizes are unequal it’s less than α. IMPORTANT: Do Not to confuse the Tukey HSD test with Tukey Mean Difference Test (Bland-Altman test) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.tc3.edu/instruct/sbrown/stat/anova1.htm#ANOVAprereq Tukey's honestly significant difference http://stat.ethz.ch/R-manual/R-patched/library/stats/html/TukeyHSD.html Tukey HSD for Post-Hoc Analysis average log signal intensity is a data time which corresponds to the sum of 2 distinct logarithm base 2 transformed signal intensity, each corresponding to a distinct condition of signal acquisition, divided by 2. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia: http://en.wikipedia.org/wiki/MA_plot last accessed: 2014-03-13 A-value average log signal intensity 1 A mixed model is a statistical model containing both fixed effects and random effects. These models are useful in a wide variety of disciplines in the physical, biological and social sciences. They are particularly useful in settings where repeated measurements are made on the same statistical units (longitudinal study), or where measurements are made on clusters of related statistical units. Because of their advantage in dealing with missing values, mixed effects models are often preferred over more traditional approaches such as repeated measures ANOVA. PRS: this is a stub and more work is needed to reconcile conflicting definitions Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia mixed effect model Threshold cycle (or Ct or Cq) is a count which is defined as the fractional PCR cycle number at which the reporter fluorescence is greater than the threshold in the context of the RT-qPCR assay. The Ct is a basic principle of real time PCR and is an essential component in producing accurate and reproducible data. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Cq Ct http://www.ncbi.nlm.nih.gov/genome/probe/doc/TechQPCR.shtml threshold cycle a goodness of fit statistical test is a statistical test which aim to evaluate if a sample distribution can be considered equivalent to a theoretical distribution used as input Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra goodness of fit statistical test a cartesian product is a data transformation which operates on a n Sets to produce a set of all possible ordered n-tuples where each element of the tuple comes from a Set Alejandra Gonzalez-Beltran Orlaith Burke PERSON: Philippe Rocca-Serra adapted from math wolfram (http://mathworld.wolfram.com/CartesianProduct.html) cartesian product is a population whose individual members realize (may be expressed as) a combination of inclusion rule values specifications or resulting from a sampling process (e.g. recruitment followed by randomization to group) on which a number of measurements will be carried out, which may be used as input to statistical tests and statistical inference. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO statistical sample study group population self explanatory Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra cartesian product 2 sets A non-negative integer defining how many combination of factor levels (or treatments in the statistical sense) are to be used in a study. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO number of factor level combinations http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2689604/ A confidence interval is a data item which defines an range of values in which a measurement or trial falls corresponding to a given probability. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://mathworld.wolfram.com/ConfidenceInterval.html confidence interval a genomic coordinate system is a coordinate system to describe position of sequence on a genomic scaffold (assembly of chromosome, contig....) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra ensembl, ucsc genomic coordinate system a statistical test which makes no assumption about the underlying data distribution Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO non-parametric test the Mauchly's test for sphericity is a statistical test which evaluates if the variance of the differences between all combinations of the groups are equal, a property known as 'sphericity' in the context of repeated measures. It is used for instance prior to repeated measure ANOVA. The test works by assessing if a Wishart-distributed covariance matrix (or transformation thereof) is proportional to a given matrix. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB-PRS, adapted from wikipedia (http://en.wikipedia.org/wiki/Mauchly's_sphericity_test) polled on june,10th, 2013 and from R manual: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/mauchly.test.html Mauchly's test for sphericity mauchly.test(object, ...) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/mauchly.test.html Mauchly's test for sphericity the statistical test power is data item which is about a statistical test and is obtained by subtracting the false negative rate (type II error rate) to 1. The power of a statistical test is the probability that it will correctly lead to the rejection of a false null hypothesis (Greene 2000). The statistical power is the ability of a test to detect an effect, if the effect actually exists (High 2000). Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia (http://en.wikipedia.org/wiki/Statistical_power), polled June 10th, 2013 statistical test power 2 Spearman's rank correlation coefficient is a correlation coefficient which is a nonparametric measure of statistical dependence between two ranked variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other. Spearman's coefficient may be used when the conditions for computing Pearson's correlation are not met (e.g linearity, normality of the 2 continuous variables) but may require a ranking transformation of the variables Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Spearman's rho http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient cor(x, y = NULL, use = "everything",method = c("spearman")) scipy.stats.spearmanr(a, b=None, axis=0) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html#scipy.stats.spearmanr source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L2643 Spearman's rank correlation coefficient within subject comparison statistical test is a kind of statistical test which evaluates if a change occurs within one experimental unit over time following a treatment or an event Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO within subject comparison statistical test a cohort is a study group population where the members are human beings which meet inclusion criteria and undergo a longitudinal design possibly submit to 'Population and Community Ontology' Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO cohort the F-distribution is a continuous probability distribution which arises in the testing of whether two observed samples have the same variance. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Fisher distribution Snedecor Fisher distribution http://mathworld.wolfram.com/F-Distribution.html df(x, df1, df2, ncp, log = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Fdist.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html#scipy.stats.f F-distribution RPKM is a kind of count which numbers the sequence reads found per kilobase of transcript reported to million of sequence reads. RPKM is a metric generated by ERANGE software tool as reported by Mortazi et al, in 2008. The metric has been enhanced and replaced by FPKM to better take into account splice variant. FKPM uses a statistical model to perform the computation. Alejandra Gonzalez Beltran Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra RPKM http://www.nature.com/nmeth/journal/v5/n7/full/nmeth.1226.html reads per kilobase of transcript per million fragments mapped a planned process which etablishes and states the different hypothesis to be evaluated during a null hypothesis statistical test Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO specifying null and alternate hypothesis An alternative hypothesis is an hypothesis defined in a statistical test that is the opposite of the null hypothesis. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO alternative hypothesis PMID:12892658 "Two formulas for computation of the area under the curve represent measures of total hormone concentration versus time-dependent change." area under curve is a measurement datum which corresponds to the surface define by the x-axis and bound by the line graph represented in a 2 dimensional plot resulting from an integration or integrative calculus. The interpretation of this measurement datum depends on the variables plotted in the graph PRS: submit 'integral calculus' as a kind of data transformation in OBI:DT branch Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra area under curve is a data item formed by dividing the fluorescence intensity obtained in one channel to that obtained in the other channel, typically the case when considering 2-color microarray data when imaging is done for Cy3 and Cy5 dyes. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO channel1/channel2 fluorescence intensity ratio channel1/channel2 fluorescence intensity ratio odds ratio homogeneity hypothesis is a null hypothesis stating that all odds ratio are homogenous, that is remain within the same range. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO odds ratio homogeneity hypothesis odds ratio homogeneity hypothesis 2 a tetrachoric correlation coefficient is a polychoric correlation coefficient for 2 dichotomous variables used as proxy for correlation between 2 continuous latent variables. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from: http://www.rasch.org/rmt/rmt193c.htm and http://en.wikipedia.org/wiki/Polychoric_correlation tetrachoric correlation coefficient discretization as a processing converting a continuous variable into a polychotomous variable by concretizing a set of discretization rules Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB,PRS adapted from wikipedia (http://en.wikipedia.org/wiki/Discretization) http://cran.r-project.org/web/packages/discretization/index.html continuous variable discretization 50 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2689604/ a confidence interval which covers 50% of the sampling distribution, meaning that there is a 50% risk of false positive (type I error) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra confidence interval at 10% of type I error rate STATO 50% confidence interval probit regression model is a model which attempts to explain data distribution associated with *ordinal* response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is the ordered probit function. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Probit_model) polled in June 2013 ordered probit regression for analysis of ordinal dependent variable a stratum population is a population resulting from a population stratification prior to sampling process which aims to produce homogenous subpopulations from an heterogeneous population by applying one or more stratification criteria Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO stratum population a null hypothesis which states that a given matrix is proportional to a Wishart-distributed covariance matrix Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO hypothesis of sphericity sphericity hypothesis Model fitting is a data transformation process which evaluates if a model appropriately represents a dataset. A model fitting process tests the goodness of fit of the model to the data Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra model fitting a real time pcr standard curve is a line graph which plots the fluorescence intensity signal as a function of the concentration of a sample used as reference and used to determine relative abundance of test samples Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from: http://www.sigmaaldrich.com/content/dam/sigma-aldrich/docs/Sigma/General_Information/qpcr_technical_guide.pdf and http://www.lifetechnologies.com/uk/en/home/life-science/pcr/real-time-pcr/qpcr-education/absolute-vs-relative-quantification-for-qpcr.html RT-PCR standard curve the false negative rate is a data item which denotes the proportion of missed detection of elements known to be meeting the detection criteria Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO, adapted from type II error rate β false negative rate a random variable (or aleatory variable or stochastic variable) in probability and statistics, is a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra aleatory variable stochastic variable wikipedia: http://en.wikipedia.org/wiki/Random_variable random variable 3 graeco-latin square design is_a study design which allows in its simpler form controlling 3 levels of nuisance variables (also known as blocking variables). The 3 nuisance factors are divided into a tabular grid with the property that each row and each column receive each treatment exactly once. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra graeco-latin square design group assignment based on blocking variable specification is a kind of group assignment process which takes into account the levels assumed by a blocking variable to allocate subjects or experimental units to a treatment group Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO group assignment based on blocking variable specification A testing objective to ensure that the sample used in a statistical test actually follows a normal distribution. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO goodness of fit testing objective A probability distribution is a information content entity that specifies the probability of the value of a random variable. For a discrete random variable, a mathematical formula that gives the probability of each value of the variable. For a continuous random variable, a curve described by a mathematical formula which specifies, by way of areas under the curve, the probability that the variable falls within a particular interval. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra probability distribution It is a testing objective to ensure the variances of the different groups used in a statistical test are similar (i.e. not too different). Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra homoschedacity testing objective STATO equal variance testing objective 0 a normal distribution is a continuous probability distribution described by a probability distribution function described here: http://mathworld.wolfram.com/NormalDistribution.html Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Gaussian distribution http://mathworld.wolfram.com/NormalDistribution.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm normal distribution ordinal variable is a categorical variable where the discrete possible values are ordered or correspond to an implicit ranking Alejandra Gonzalez-Beltan Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra ranked variable http://udel.edu/~mcdonald/statvartypes.html ordinal variable Chi-square probability distribution with k degrees of freedom is a theoretical probability distribution which corresponds to the distribution of a sum of the squares of k independent standard normal random variables. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra dchisq(x, df, ncp = 0, log = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Chisquare.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html#scipy.stats.chi2 Chi-square probability distribution the expected value (or expectation, mathematical expectation, EV, mean, or the first moment) of a random variable is a data item which corresponds to the weighted average of all possible values that this random variable can take on. The weights used in computing this average correspond to the probabilities in case of a discrete random variable, or densities in case of a continuous random variable. From a rigorous theoretical standpoint, the expected value is the integral of the random variable with respect to its probability measure. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra first moment mean μ http://en.wikipedia.org/wiki/Expected_value expected value 95 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2689604/ a confidence interval which covers 95% of the sampling distribution, meaning that there is a 5% risk of false positive (type I error). If the number of observations made is large enough, the sampling distribution can be assumed to be normal, which entails that 95% of the sampling distributions falls within roughly2 (1.96) standard deviations from the mean. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra confidence interval at 5% of type I error rate STATO 95% confidence interval number of PCR cycle is a count which enumerates how many iterations of 'annealing, renaturation, amplification,' rounds (or cycles) are performed during a polymerase chain reaction (PCR) or an assay relying on PCR. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from various sources including: http://www.ncbi.nlm.nih.gov/genome/probe/doc/TechQPCR.shtml number of PCR cycle sensitivity is a measurement datum qualifying a binary classification test and is computed by substracting the false negative rate to the integral numeral 1 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra recall sensitivity adapted from: http://en.wikipedia.org/wiki/Sensitivity_and_specificity and http://mathworld.wolfram.com/Sensitivity.html true positive rate a residual is a data item which is the output of an error estimate or model fitting process and which is an observable estimate of the unobservable error Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra residual A genetic association study is a kind of study whose objective is to detect associations between phenotypes, between a phenotype and a genetic polymorphism or between two genetic polymorphisms. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Genetic_association genetic association study the coefficient of variation is a normalized measure of dispersion of a probability distribution of frequency distribution. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Coefficient_of_variation last accessed: 2013-10-18 scipy.stats.variation(a, axis=0) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.variation.html#scipy.stats.variation source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L951 coefficient of variation The standard deviation of a random variable, statistical population, data set, or probability distribution is a measure of variation which correspond to the average distance from the mean of the data set to any given point of that dataset. It also corresponds to the square root of its variance. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra σ http://en.wikipedia.org/wiki/Standard_deviation sd(x, na.rm = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/sd.html standard deviation high content screening is a kind of investigation which uses a standardized cellular assays to test the effect of substances (RNAi or small molecules) held in libraries on a cellular phenotype. it relies on microscopy imaging and or flow-cytometry, robotic handling to ensure fast and high-throughput. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra high throughput screening adapted from: http://en.wikipedia.org/wiki/High-content_screening high-content screening high throughput screening is a kind of investigation which uses a standardized assays (cell based, enzymatic or chemometric) to test the effect of substances (RNAi or small molecules) held in libraries on a very specific and measureable outcome (e.g fluorence intensity). it relies on robotic handling to ensure fast and high-throughput in assay performance, data acquisition and hit selection. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB,PRS high throughput screening 2 Kendall's correlation coefficient is a correlation coefficient between 2 ordinal variables (natively or following a ranking procedure) and may be used when the conditions for computing Pearson's correlation are not met (e.g linearity, normality of the 2 continuous variables) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Kendall rank correlation coefficient Kendall's tau (τ) coefficient STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient), polled in june 2013 and from: http://stamash.org/pearsons-correlation-coefficient/ http://stamash.org/kendalls-tau-correlation/ cor(x, y = NULL, use = "everything",method = c("kendall")) scipy.stats.kendalltau(x, y, initial_lexsort=True) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html#scipy.stats.kendalltau source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L2827 Kendall's correlation coefficient 2 Q-Q plot or quantile-quantile plot is the output of a graphical method for comparing two probability distributions by plotting their quantiles against each other PRS,AGB: need to add the notion of quantile Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO quantile-quantile plot qqplot(x, y, plot.it = TRUE, xlab = deparse(substitute(x)), ylab = deparse(substitute(y)), ...) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/qqnorm.html Q-Q plot statistical error is an data item denoting the amount by which an observation differs from the expected value, being based on the whole statistical population from which the statistical unit was chosen randomly Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia: http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics last accessed: 18-11-2013 disturbance statistical error A box and whisker plot is appropriate to represent the characteristics oc a distribution. a box plot is a graph which plots datasets relying on their quartiles and the interquartile range to create the box and the whiskers. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Tukey box and whisker plot box plot Tukey, J. W. "Box-and-Whisker Plots." §2C in Exploratory Data Analysis. Reading, MA: Addison-Wesley, pp. 39-43, 1977. boxplot boxplot(x, ...) http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/boxplot.html box and whisker plot (Rn +) − (Rn −), where Rn + = (emission intensity of reporter dye)/(emission intensity of passive reference dye) in PCR with template and Rn − = (emission intensity of reporter dye)/(emission intensity of passive reference dye) in PCR without template or early cycles of a real-time reaction. Ct = threshold cycle, i.e., cycle at which a statistically significant increase in ΔRn is first detected Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://jcm.asm.org/content/38/7/2516.figures-only ΔRn 4 Relative risk is a measurement datum which denotes the risk of an 'event' relative to an 'exposure'. Relative risk is calculated by forming the ratio of the probability of the event occurring in the exposed group versus the probability of this event occurring in the non-exposed group. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra risk ratio relative risk 2 2 Woolf's test is a statistical test which evaluates the null hypothesis that odds ratio are the same accross all strata of population under investigation Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://people.umass.edu/biep640w/pdf/4.%20%20Categorical%20Data%20Analysis%202012.pdf woolf_test(x) where x is 2 x 2 x k contingency table http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/vcd/html/woolf_test.html Woolf's test odds ratio homogeneity test is a statistical test which aims to evaluate that null the hypothesis of consistency odds ratio accross different strata of population is true or not Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO odds ratio homogeneity test https://onlinecourses.science.psu.edu/stat503/node/19 Often in medical studies, the blocking factor used is the type of institution. This provides a very useful blocking factor, hopefully removing institutionally related factors such as size of the institution, types of populations served, hospitals versus clinics, etc., that would influence the overall results of the experiment. a blocking variable is a independent variable which is used in a blocking process part of an experiment with the purpose of maximizing the signal coming from the main variable. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra nuisance variable https://onlinecourses.science.psu.edu/stat503/node/18 blocking variable a DNA microarray hybridization is an assay relying on nucleic acid hybridization , which uses a DNA microarray device and a nucleic acid as input. It precedes a data acquisition process Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra DNA microarray hybridization group comparison objective is a data transformation objective which aims to determine if 2 or more study group differ with respect to the signal of a response variable Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO group comparison objective "Time to solve an anagram problem" is continuous since it could take 2 minutes, 2.13 minutes etc. to finish a problem A continuous variable is one for which, within the limits the variable ranges, any value is possible. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://davidmlane.com/hyperstat/A97418.html http://udel.edu/~mcdonald/statvartypes.html continuous variable a categorical variable is a variable which that can only assume a finite number of value and cast observation in a small number of categories Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra discrete variable nominal variable qualitative factor http://udel.edu/~mcdonald/statvartypes.html https://onlinecourses.science.psu.edu/stat503/node/7 categorical variable the objective of a data transformation to test a null hypothesis of absence of difference within subject holds. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO within subject comparison objective The allele frequency is a data item which denotes the incidence of a gene variant in a population. It is calculated as a ratio, by dividing the number of copies of a particular allele by the number of copies of all alleles at the genetic place (locus) in a population. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.nature.com/scitable/definition/allele-frequency-298 allele frequency the objective of a data transformation to test a null hypothesis of absence of difference withing subject holds. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO between group comparison objective a manhattan plot for gwas is a kind of scatter plot used to facilitate presentation of genome-wide association study (GWAS) data. Genomic coordinates are displayed along the X-axis, with the negative logarithm of the association P-value for each single nucleotide polymorphism displayed on the Y-axis. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Manhattan_plot plotGrandLinear(obj, ..., facets, space.skip = 0.01, geom = NULL, cutoff = NULL, cutoff.color = "red", cutoff.size = 1, legend = FALSE, xlim, ylim, xlab, ylab, main) http://www.tengfei.name/ggbio/docs/man/plotGrandLinear.html manhattan plot for gwas A domestic group, or a number of domestic groups linked through descent (demonstrated or stipulated) from a common ancestor, marriage, or adoption. import from Population and Community Ontology: http://www.ontobee.org/browser/rdf.php?o=PCO&iri=http://purl.obolibrary.org/obo/PCO_0000020 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://purl.obolibrary.org/obo/PCO_0000020 family a variable is a data item which can assume any of a set of values, either as determined by an agent or as randomly occuring through observation. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO adapted from wolfram-alpha (http://www.wolframalpha.com/input/?i=variable) definition 2. and from Oxford English Dictionary: http://www.oed.com/view/Entry/221514?redirectedFrom=variable#eid, definition B,1 variable true true 1 repeated measure ANOVA is a kind of ANOVA specifically developed for non-independent observations as found when repeated measurements on the sample experimental unit. repeated measure ANOVA is sensitive to departure from normality (evaluation using Bartlett's test), more so in the case of unbalanced groups (i.e. different sizes of sample populations). Departure from sphericity (evaluation using Mauchly'test) used to be an issue which is now handled robustly by modern tools such as R's lme4 or nlme, which accommodate dependence assumptions other than sphericity. discussion in https://github.com/ISA-tools/stato/issues/28 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Thomas Nichols ANOVA for correlated samples adapted from wikipedia and https://statistics.laerd.com/statistical-guides/repeated-measures-anova-statistical-guide-3.php http://www.ats.ucla.edu/stat/sas/library/repeated_ut.htm ANOVA for correlated samples http://cran.r-project.org/doc/contrib/Lemon-kickstart/kr_repms.html repeated measure ANOVA 2 1 The Newman–Keuls or Student–Newman–Keuls (SNK) method is a stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other. It was named after Student (1927), D. Newman, and M. Keuls. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance (ANOVA). The Newman–Keuls method is similar to Tukey's range test as both procedures use Studentized range statistics.Compared to Tukey's range test, the Newman–Keuls method is more powerful but less conservative. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia: last accessed: 2013-11-15 SNK.test(y, trt, DFerror, MSerror, alpha = 0.05, group=TRUE, main = NULL) http://artax.karlin.mff.cuni.cz/r-help/library/agricolae/html/SNK.test.html Newman-Keuls test post-hoc analysis Bernoulli distribution is a binomial distribution where the number of trials is equal to 1. notation: B(1,p) The mean is p The variance is p*q Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Bernoulli distribution Galbraith (Radial) plot is a scatter plot which can be used in the meta-analytic context to examine the data for heterogeneity. For a fixed-effects model, the plot shows the inverse of the standard errors on the horizontal axis against the individual observed effect sizes or outcomes standardized by their corresponding standard errors on the vertical axis. Radial plots were introduced by Rex Galbraith (1988a, 1988b, 1994). Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Galbraith, Rex (1988). "Graphical display of estimates having differing standard errors". Technometrics (Technometrics, Vol. 30, No. 3) 30 (3): 271–281. doi:10.2307/1270081. JSTOR 1270081 radial Galbraith plot http://www.inside-r.org/packages/cran/Luminescence/docs/plot_RadialPlot plot_RadialPlot(data, na.exclude = TRUE, negatives = "remove", log.z = TRUE, central.value, centrality = "mean.weighted", plot.ratio, bar.col, grid.col, legend.text, summary = FALSE, stats, line, line.col, line.label, output = FALSE, ...) http://www.metafor-project.org/doku.php/plots:radial_plot Galbraith plot http://isogenic.info/html/9__treatments.html#factorial a factor level combination is one a possible sets of factor levels resulting from the cartesian product of sets of factor and their levels as defined in a factorial design Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra treatment combination STATO factor level combination http://isogenic.info/html/9__treatments.html#factorial A factor level is data item which corresponds to one of the value assumed by a factor or independent variable manipulated and set by the experimentalist. In the context of factorial design, a factor level is assumed to be or treated as a category in a categorical variable Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra treatment AGB-PRS https://onlinecourses.science.psu.edu/stat503/node/7 factor level Bayes factor is a ratio between 2 probabilities of observing data according 2 distinct models. It is used in Bayes model selection to evaluate which model best explains the data. if K<0, the model used in the denominator term is supported, if K>1, the model used in the numerator term is supported. The Bayes factor is about the plausibility of 2 different models Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from Wikipedia last accessed 2013-11-13 Bayes factor grouped bar chart is a kind of bar chart which juxtaposes the discrete values for each of the possible value of a given categorical variable, thus providing within group comparison. Grouped bar charts are good for comparing between each element in the categories, and comparing elements across categories. However, the grouping can make it harder to tell the difference between the total of each group. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from http://www2.le.ac.uk/offices/ld/resources/numeracy/bar-charts and http://blog.visual.ly/how-groups-stack-up-when-to-use-grouped-vs-stacked-column-charts/ [last accessed: 2014-03-04] barplot(height....) set argument " beside = TRUE " http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/barplot.html grouped bar chart A gamma distribution is a general type of continous statistical distribution (related to the beta distribution) that arises naturally in processes for which the waiting times between Poisson distributed events are relevant. Gamma distributions have two free parameters shape k and scale denoted theta . Alejandra Gonzalez-Beltran Philippe Rocca-Serra Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://mathworld.wolfram.com/GammaDistribution.html dgamma(x, shape, rate = 1, scale = 1/rate, log = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/GammaDist.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gamma.html#scipy.stats.gamma Gamma distribution 2 true 2 polychoric correlation coefficient is a correlation coefficient which is computed over 2 variables to characterise an association by proxy with 2 (latent) variables which are assumed to be continuous and normally distributed. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from: http://www.rasch.org/rmt/rmt193c.htm and http://en.wikipedia.org/wiki/Polychoric_correlation http://cran.r-project.org/web/packages/polycor/ polychor(x, y, ML = FALSE, control = list(), std.err = FALSE, maxcor=.9999) polychoric correlation coefficient 1 a full factorial design is a factorial design which ensures that all possible factor level combinations are defined and used so all between group differences can be explored Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra full factorial design permutation numbering is a data tranformation allowing to count the number of possible permutations of elements in a set of size n, each element occurring exactly once. This number is factorial n. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO permutation numbering The Michaelis constant is the substrate concentration at which the reaction rate is at half-maximum, and is an inverse measure of the substrate's affinity for the enzyme—as a small indicates high affinity, meaning that the rate will approach more quickly.[5] The value of is dependent on both the enzyme and the substrate, as well as conditions such as temperature and pH. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia: http://en.wikipedia.org/wiki/Michaelis–Menten_constant last accessed: 22-11-2013 half maximal reaction rate substrate concentration (Km) Michaelis-Menten constant A population of two parents and a child. possibly submit to 'Population and Community Ontology' Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra child-parent trio parent-child trio parents-child trio child-parents trio receiver operational characteristics curve is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold (aka cut-off point) is varied by plotting sensitivity vs (1 − specificity) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Receiver_operating_characteristic roc.from.table(table, graph = TRUE, add = FALSE, title = FALSE, line.col = "red", auc.coords = NULL, ...) http://rss.acs.unt.edu/Rdoc/library/epicalc/html/roc.html receiver operational characteristics curve 2 The transmission disequilibrium test is a statistical test for genetic linkage between genetic marker and a trait in families. The test is robust to population structure. TODO: need to modify restrictions to include family and trio Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra TDT STATO , adapted wikipedia (from http://en.wikipedia.org/wiki/Transmission_disequilibrium_test), polled on June,2013 transmission disequilibrium test The binomial distribution is a discrete probability distribution which describes the probability of k successes in n draws with replacement from a finite population of size N. The binomial distribution is frequently used to model the number of successes in a sample of size n drawn with replacement from a population of size N. The binomial distribution gives the discrete probability distribution of obtaining exactly n successes out of N Bernoulli trials (where the result of each Bernoulli trial is true with probability p and false with probability q=1-p ) notation: B(n,p) The mean is N*p The variance is N*p*q Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Binomial_distribution dbinom(x, size, prob, log = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Binomial.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html#scipy.stats.binom binomial distribution hit selection is a planned process which in screening processes such as high-throughput screening, lead to the identification of perturbing agent which cause the typical signal generated by a standardized assay to significantly differ from the negative control. The selection hitself results from meeting or exceeding selection threshold (for instance 6 sigma from the mean or SSMD value beyond 5 when compared to positive controls or below -5 when compared to negative controls Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra AGB, PRS adapted from: http://en.wikipedia.org/wiki/SSMD adapted from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2789971/ hit selection TODO pairing rule is a rule which is specifies the criteria for deciding on how to associated any 2 entities. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO pairing rule between group comparison statistical test is a statistical test which aims to detect difference between the means computing for each of the study group populations Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra between group comparison statistical test 2 true The Pearson's correlation coefficient is a correlation coefficient which evaluates two continuous variables for association strength in a data sample. It assumes that both variables are normally distributed and linearity exists. The coefficient is calculated by dividing their covariance with the product of their individual standard deviations. It is a normalized measurement of how the two are linearly related. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Pearson product-moment correlation coefficient Pearson's r r statistics STATO, adapted from http://www.r-tutor.com/elementary-statistics/numerical-measures/correlation-coefficient and from: http://stamash.org/pearsons-correlation-coefficient/ http://stamash.org/kendalls-tau-correlation/ cor(x, y = NULL, use = "everything",method = c("pearson")) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html scipy.stats.pearsonr(x, y) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html#scipy.stats.pearsonr source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L2427 Pearson's correlation coefficient F statistic is a statistic computed from observations and used to produce a p-value in statistical test when compared to a F distribution. the F statistic is the ratio of two scaled sums of squares reflecting different sources of variability Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO F F-statistic negative binomial probability distribution is a discrete probability distribution of the number of successes in a sequence of Bernoulli trials before a specified (non-random) number of failures (denoted r) occur. The negative binomial distribution, also known as the Pascal distribution or Pólya distribution, gives the probability of r-1 successes and x failures in x+r-1 trials, and success on the (x+r)th trial. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Pascal distribution Pólya distribution http://mathworld.wolfram.com/NegativeBinomialDistribution.html dnbinom(x, size, prob, mu, log = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/NegBinomial.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.nbinom.html#scipy.stats.nbinom negative binomial distribution Breusch-Pagan test is a statistical test which computes a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the response (fitted values), or with a linear combination of predictors. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Breusch, T. S. and Pagan, A. R. (1979) A simple test for heteroscedasticity and random coefficient variation. Econometrica 47, 1287--1294. and adapted from: http://www.inside-r.org/packages/cran/car/docs/ncvTest last accessed [2014-03-15] http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/lmtest/html/bptest.html bptest(formula, varformula = NULL, studentize = TRUE, data = list()) or http://www.inside-r.org/packages/cran/car/docs/ncvTest Breusch-Pagan test http://www.ncbi.nlm.nih.gov/pubmed/?term=17182697 Bioinformatics. 2007 Feb 15;23(4):401-7. Enrichment or depletion of a GO category within a class of genes: which test? Rivals I1, Personnaz L, Taing L, Potier MC. hypergeometric test is a null hypothesis test which evaluates if a random variable follows a hypergeometric distribution. It is a test of goodness of fit to that distribution. The test is suited for situation aimed at assessing cases of sampling from a finite set without replacements. For instance, testing for enrichment or depletion of elements (e.g GO categories, genes) Added following a term request by Chris Mungall: https://github.com/ISA-tools/stato/issues/6 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE) lower.tail logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x]. http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Hypergeometric.html hypergeometric test 0 a one-tailed test is a statistical test which, assuming an unskewed probability distribution, allocates all of the significance level to evaluate only one hypothesis to explain a difference. The one-tailed test provides more power to detect an effect in one direction by not testing the effect in the other direction. one-tailed test should be preceded by two-tailed test in order to avoid missing out on detecting alternate effect explaining an observed difference. Added following a term request by Chris Mungall: https://github.com/ISA-tools/stato/issues/6 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra one sided test adapted from: http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests.htm one tailed test 0 For example, we may wish to compare the mean of a sample to a given value x using a t-test. Our null hypothesis is that the mean is equal to x. A two-tailed test will test both if the mean is significantly greater than x and if the mean significantly less than x. The mean is considered significantly different from x if the test statistic is in the top 2.5% or bottom 2.5% of its probability distribution, resulting in a p-value less than 0.05. a two tailed test is a statistical test which assess the null hypothesis of absence of difference assuming a symmetric (not skewed) underlying probability distribution by allocating half of the significance level selected to each of the direction of change which could explain a difference (for example, a difference can be an excess or a loss). Added following a term request by Chris Mungall: https://github.com/ISA-tools/stato/issues/6 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra two sided test adapted from: http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests.htm two tailed test A null hypothesis which states that no difference exists between 2 or more groups being considered. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO absence of difference hypothesis let's consider an experiment evaluating 2 compounds (aspirin & ibuprofen) at 3 distinct dose levels (low, medium, high) and 4 time points post exposure (0h, 6h, 12h, 24h). Assuming the treatments are applied only once (no replication), the number of observation in a full factorial design is 2 x 3 x 4 = 24 so the design matrix would have 24 rows and 3 columns (1 per factor (independent variable). a design matrix is an information content entity which denotes a study design. The design matrix is a n by m matrix where n the number of rows, corresponds to the number of observations (4 rows if quadruplicates) and where m, the number of columns corresponds to the number of independent variables. Each element in the matrix correspond to a discretized value representing one of the factor levels for a given factor. A design matrix can be used as input to statistical modeling or statistical analysis. The design matrix contains data on the independent variables (also called explanatory variables) in statistical models which attempt to explain observed data on a response variable (often called a dependent variable) in terms of the explanatory variables. The theory relating to such models makes substantial use of matrix manipulations involving the design matrix: see for example linear regression. A notable feature of the concept of a design matrix is that it is able to represent a number of different experimental designs and statistical models, e.g., ANOVA, ANCOVA, and linear regression Added following a term request by Nolan Nichols: https://github.com/ISA-tools/stato/issues/9 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra model matrix adapted from: Design of Experiments: Principles and Applications edited by Lennart Eriksson, 1999-2008 Umetrics. ISBN-13:978-91-973730-4-3 and http://en.wikipedia.org/wiki/Design_matrix [last accessed: 22-05-2014] model.matrix(object, data = environment(object), contrasts.arg = NULL, xlev = NULL, ...) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/model.matrix.html design matrix A contrast is the weighted sum of group means, the c_j coefficients represent the assigned weights of the means (these must sum to 0 for orthogonal contrasts) Term request by Nolan Nichols via https://github.com/ISA-tools/stato/issues/9 Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://en.wikipedia.org/wiki/Contrast_%28statistics%29 contrasts(x, contrasts = TRUE, sparse = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/contrasts.html contrast a quantile is a data item which corresponds to specific elements x in the range of a variate X. the k-th n-tile P_k is that value of x, say x_k, which corresponds to a cumulative frequency of Nk/n (Kenney and Keeping 1962). If n=4, the quantity is called a quartile, and if n=100, it is called a percentile. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Evans, M.; Hastings, N.; and Peacock, B. Statistical Distributions, 3rd ed. New York: Wiley, 2000. http://mathworld.wolfram.com/Quantile.html quantile a decile is a quantile where n=10 and which splits data into sections accrued of 10% of data, so the first decile delineates 10% of the data, the second decile delineates 20% of the data and the nineth decile, 90 % of the data Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO decile a percentile is a quantile which splits data into sections accrued of 1% of data, so the first percentile delineates 1% of the data, the second quartile delineates 2% of the data and the 99th percentile, 99 % of the data Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO percentile absence of negative difference hypothesis is a hypothesis which assumes that a difference significantly less than a threshold does not exist. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO absence of negative difference hypothesis absence of negative difference hypothesis is a hypothesis which assumes that a difference significantly greater than a threshold does not exist. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO absence of positive difference hypothesis absence of depletion difference hypothesis is a hypothesis which assumes that the representation of an element significantly greater than a threshold does not exist. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra absence of over representation hypothesis STATO absence of enrichment hypothesis absence of depletion difference hypothesis is a hypothesis which assumes that the representation of an element significantly less than a threshold does not exist. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra absence of under representation hypothesis STATO absence of depletion hypothesis a binomial test is a statistical hypothesis test which evaluates if the observations made about a Bernoulli experiment , that is an experiment which tests the statistical significance of deviations from a theoretically expected distribution (the binomial distribution) of observations into 2 categories. It is a goodness of fit test. Added following a term request by Chris Mungall: https://github.com/ISA-tools/stato/issues/6 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from: http://en.wikipedia.org/wiki/Binomial_test binomial test binom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"), conf.level = 0.95) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/binom.test.html scipy.stats.binom_test(x, n=None, p=0.5) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom_test.html#scipy.stats.binom_test source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/morestats.py#L1605 exact binomial test Evaluation of statistical inference on empirical resting state fMRI. IEEE Trans Biomed Eng. 2014 Apr;61(4):1091-9. doi: 10.1109/TBME.2013.2294013. http://www.ncbi.nlm.nih.gov/pubmed/24658234 Statistical inference is the process of deducing properties of an underlying probability distribution by analysis of data. Added following a term request by Nolan Nichols: https://github.com/ISA-tools/stato/issues/12 Definition changed according to discussions in https://github.com/ISA-tools/stato/issues/55 Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra The definition cited from Wikipedia is from: Upton, G., Cook, I. (2008) Oxford Dictionary of Statistics, OUP. ISBN 978-0-19-954145-4 https://en.wikipedia.org/wiki/Statistical_inference statistical inference A ratio where the numerator and denominator are expressed in the same unit. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO dimensionless ratio dimensionless ratio 2 2 The covariance is a measurement data item about the strength of correlation between a set (2 or more) of random variables. The covariance is obtained by forming: cov(X,Y)=E([X-E(X)][Y-E(Y)] where E(X), E(Y) is the expected value (mean) of variable X and Y respectively. covariance is symmetric so cov(X,Y)=cov(Y,X). The covariance is usefull when looking at the variance of the sum of the 2 random variables since: var(X+Y) = var(X) +var(Y) +2cov(X,Y) The covariance cov(x,y) is used to obtain the coefficient of correlation cor(x,y) by normalizing (dividing) cov(x,y) but the product of the standard deviations of x and y. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from: http://mathworld.wolfram.com/Covariance.html covariance cov(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")) from: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html covariance one sample t-test is a kind of Student's t-test which evaluates if a given sample can be reasonably assumed to be taken from the population. The test compares the sample statistic (m) to the population parameter (M). The one sample t-test is the small sample analog of the z test, which is suitable for large samples. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from various sources, including: Practical Statistics for Medical Research by D.Altman. ISBN: 0-412-27630-5 http://www.psychology.emory.edu/clinical/bliwise/Tutorials/TOM/meanstests/tone.htm one sample t-test t.test(x = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/t.test.html scipy.stats.ttest_1samp(a, popmean, axis=0) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html#scipy.stats.ttest_1samp source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L3194 one sample t-test 1 true 2 two sample t-test is a null hypothesis statistical test which is used to reject or accept the hypothesis of absence of difference between the means over 2 randomly sampled populations. It uses a t-distribution for the test and assumes that the variables in the population are normally distributed and with equal variances. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra two sample t-test adapted from: http://en.wikipedia.org/wiki/Student's_t-test#Independent_.28unpaired.29_samples and from: http://www.psychology.emory.edu/clinical/bliwise/Tutorials/TOM/meanstests/tind.htm t-test for independent means assuming equal variance t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = TRUE, conf.level = 0.95, ...) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/t.test.html scipy.stats.ttest_ind(a, b, axis=0, equal_var=True) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L3271 two sample t-test with equal variance 2 1 true 2 Welch t-test is a two sample t-test used when the variances of the 2 populations/samples are thought to be unequal (homoskedasticity hypothesis not verified). In this version of the two-sample t-test, the denominator used to form the t-statistics, does not rely on a 'pooled variance' estimate. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Welsh t-test Welch, B. L. (1947). "The generalization of "Student's" problem when several different population variances are involved". Biometrika 34 (1–2): 28–35. doi:10.1093/biomet/34.1-2.28 adapted from wikipedia: http://en.wikipedia.org/wiki/Welch's_t_test last accessed: 2014-05-06 t-test for independent means assuming unequal variance t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/t.test.html scipy.stats.ttest_ind(a, b, axis=0, equal_var=False) http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind source: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L3271 two sample t-test with unequal variance A Helmert contrast is a contrast in which the coefficients for the Helmert regressors compare each level with the average of the “preceding” ones Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.clayford.net/statistics/tag/helmert-contrasts/ An R and S-Plus Companion to Applied Regression. John Fox ISBN-13: 978-0761922803 contr.helmert(n, contrasts = TRUE, sparse = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/contrast.html Helmert contrast a polynomial contrast is a contrast which... Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra contr.poly(n, scores = 1:n, contrasts = TRUE, sparse = FALSE) from: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/contrast.html polynomial contrast treatment contrast is a contrast which allows to test how linear model coefficients of categorical variables are interpreted in case where the “first” level (aka, the baseline) is included into the intercept and all subsequent levels have a coefficient that represents their difference from the baseline. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from multiple sources: http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fidh_idd_genlin_emmeans.htm http://www.clayford.net/statistics/tag/helmert-contrasts/ http://www.aliquote.org/articles/tech/contrasts.html contr.treatment(n, base = 1, contrasts = TRUE, sparse = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/contrast.html treatment contrast the sum contrast is a contrast in which each coefficient compares the corresponding level of the factor to the average of the other levels Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.clayford.net/statistics/tag/helmert-contrasts/ An R and S-Plus Companion to Applied Regression. John Fox ISBN-13: 978-0761922803 contr.sum(n, contrasts = TRUE, sparse = FALSE) http://stat.ethz.ch/R-manual/R-patched/library/stats/html/contrast.html sum contrast 2 Pearson's Chi-Squared test for goodnes of fit is a statistical null hypothesis test which is used to either evaluate goodness of fit of dataset to a Chi-Squared distribution Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Chi2 test for goodness of fit adapted from: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html and http://en.wikipedia.org/wiki/Pearson's_chi-squared_test http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html chisq.test(x = NULL, correct = FALSE, p = rep(1/length(x), length(x)), rescale.p = FALSE, simulate.p.value = FALSE, B = 2000) Pearson's Chi square test of goodness of fit 1 2 2 Barnard's test is an exact statistical test used to determine if there are nonrandom associations between two categorical variables. It was developed in 1949 by Barnard and is a test which is, most times, more powerfull that the Fisher exact test duplicate with OBI_0200176. so either MIREOT and add metadata and axioms or move from OBI Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Barnard's_test and G A Barnard (1945) "A New Test for 2X2 Tables", Nature, 156, 177 & 783. Barnard's test barnardw.test(n1, n2, n3, n4, dp = 0.001, verbose = FALSE) from http://www.inside-r.org/packages/cran/Barnard/docs/barnardw.test Barnard's test a central composite design is a study design which contains an imbedded factorial or fractional factorial design with center points that is augmented with a group of so-called 'star points' that allow estimation of curvature. A CCD design with k factors has 2k star points. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.itl.nist.gov/div898/handbook/pri/section3/pri3361.htm Box-Wilson Central Composite Design cd(basis, generators, blocks = "Block", n0 = 4, alpha = "orthogonal", wbreps = 1, bbreps = 1, randomize = TRUE, inscribed = FALSE, coding) http://artax.karlin.mff.cuni.cz/r-help/library/rsm/html/ccd.html central composite design The Box-Behnken design is an independent quadratic design in that it does not contain an embedded factorial or fractional factorial design. In this design the treatment combinations are at the midpoints of edges of the process space and at the center. These designs are rotatable (or near rotatable) and require 3 levels of each factor. The designs have limited capability for orthogonal blocking compared to the central composite designs. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.itl.nist.gov/div898/handbook/pri/section3/pri3362.htm bbd(k, n0 = 4, block = (k == 4 | k == 5), randomize = TRUE, coding) from: http://artax.karlin.mff.cuni.cz/r-help/library/rsm/html/bbd.html Box–Behnkens design Plackett-Burman design is a type of study design optimizing multifactorial experiments characterized by their parsimony and economy with the run number a multiple of 4 (rather than a power of 2). Plackett-Burman design is often used for screening experiments where the main effect is often heavily confounded with two-factor interactions. This type of design is very useful for economically detecting large main effects, assuming all interactions are negligible when compared with the few important main effects. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.itl.nist.gov/div898/handbook/pri/section3/pri335.htm pb(nruns, nfactors = nruns - 1, factor.names = if (nfactors <= 50) Letters[1:nfactors] else paste("F", 1:nfactors, sep = ""), default.levels = c(-1, 1), ncenter=0, center.distribute=NULL, boxtyssedal = TRUE, n12.taguchi = FALSE, replications = 1, repeat.only = FALSE, randomize = TRUE, seed = NULL, oldver = FALSE, ...) from: http://www.inside-r.org/packages/cran/FrF2/docs/pb Plackett-Burman design upper confidence limit is a data item which is a largest value bounding a confidence interval Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO upper confidence limit lower confidence limit is a data item which is a lowest value bounding a confidence interval Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO lower confidence limit root-mean-square standardized effect is a data item which denotes effect size in the context of analysis of variance and corresponds to the square root of the arithmetic average of p standardized effects (effects normalized to be expressed in standard deviation units). Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Ψ http://www.statpower.net/Steiger%20Biblio/Steiger04.pdf RMSSE root-mean-square standardized effect Eta-squared is a biased estimator of the variance explained by the model in the population (it estimates only the effect size in the sample). Eta-squared describes the ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, making it analogous to the r2. This estimate shares the weakness with r2 that each additional variable will automatically increase the value of η2. In addition, it measures the variance explained of the sample, not the population, meaning that it will always overestimate the effect size, although the bias grows smaller as the sample grows larger. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Effect_size#Eta-squared.2C_.CE.B72 η2 eta-squared omega-squared is a effect size estimate for variance explained which is less biased than the eta-squared coefficient. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from: http://en.wikipedia.org/wiki/Effect_size#Omega-squared.2C_.CF.892 ω2 omega-squared Hedges's g is an estimator of effect size, which is similar to Cohen's d and is a measure based on a standardized difference. However, the denominator, corresponding to a pooled standard deviation, is computed differently from Cohen's d coefficient, by applying a correction factor (which involves a Gamma function). Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from : http://en.wikipedia.org/wiki/Effect_size#Cohen.27s_d and http://blog.stata.com/tag/cohens-d/ Hedges's g Glass's delta is an estimator of effect size which is similar to Cohen's d but where the denominator corresponds only to the standard deviation of the control group (or second group). It is considered less biais than the Cohen's d for estimating effect sizes based on means and distances between means. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from : http://en.wikipedia.org/wiki/Effect_size#Cohen.27s_d and http://blog.stata.com/tag/cohens-d/ Glass's delta 0 Probability distribution estimated empirically on the data without assumptions on the shape of the probability distribution. Camille Maumet Karl Helmer Philippe Rocca-Serra Thomas Nichols Initially discussed at https://github.com/incf-nidash/nidm/pull/191 non-parametric distribution http://artax.karlin.mff.cuni.cz/r-help/library/nparcomp/html/weight.matrix.html a contrast weight is a coefficient which multiplies a group mean, part of a linear combinaison defining a constrast as a weighted sum of group means, giving a 'weight' to a specific group mean hence the name. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Thomas Nichols adapted from wikipedia: http://en.wikipedia.org/wiki/Contrast_%28statistics%29 contrast coefficient contrast weight [1,0,0] a contrast weight matrix is a information content entity which holds a set of contrast weight, coefficient used in a weighting sum of means defining a contrast Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols STATO contrast weights contrast weight matrix contrast weight estimate is a model parameter estimate which results from the computation from the data and that is used as input to a model fitting process Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols STATO contrast weight estimate http://www.ncbi.nlm.nih.gov/pubmed/7791040 The Akaike information criterion (AIC) is a measure of the relative quality of a statistical model for a given set of data. As such, AIC provides a means for model selection. AIC is defined as: AIC = 2K - 2log(L) where K is the number of predictors and L is the maximized likelihood value. AIC deals with the trade-off between the goodness of fit of the model and the complexity of the model. It is founded on information theory: it offers a relative estimate of the information lost when a given model is used to represent the process that generates the data. AIC does not provide a test of a model in the sense of testing a null hypothesis; i.e. AIC can tell nothing about the quality of the model in an absolute sense. If all the candidate models fit poorly, AIC will not give any warning of that. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Akaike_information_criterion and http://users.ecs.soton.ac.uk/jn2/teaching/aic.pdf AIC AIC(object, ..., k = 2) from: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/AIC.html Akaike information criterion http://www.ncbi.nlm.nih.gov/pubmed/19761098 corrected Akaike information criteria is a modified version of the Akaike information criterion. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra CAIC corrected Akaike information criterion http://www.ncbi.nlm.nih.gov/pubmed/7791040 Bayesian information criterion or Schwartz's Bayesian information criterion is a criterion for model selection among a finite set of models. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC). Given any two estimated models, the model with the lower value of BIC is the one to be preferred. The BIC is an increasing function of sigma_e^2 and an increasing function of k. That is, unexplained variation in the dependent variable and the number of explanatory variables increase the value of BIC. Hence, lower BIC implies either fewer explanatory variables, better fit, or both. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Schwarz, Gideon E. (1978). "Estimating the dimension of a model". Annals of Statistics 6 (2): 461–464. doi:10.1214/aos/1176344136. http://en.wikipedia.org/wiki/Bayesian_information_criterion BIC SBIC Schwartz's Baeysian information criterion Bayesian information criterion 2 A statistical model selection is a data transformation which is based on computing a relative quality value in order to evaluate and select which model best explains data. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO statistical model selection 0 0 Probability distribution which has no skew so its skewness=0 Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols STATO symmetric distribution Probability distribution estimated empirically from all acquired data Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Thomas Nichols http://reference.wolfram.com/language/ref/EmpiricalDistribution.html empirical distribution Probability distribution estimated empirically on the data following a binning process Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Thomas Nichols histogram distribution Probability distribution estimated using a smooth kernel function to avoid making assumptions about the distribution of the data. The kernel density estimator is the estimated probability density function (pdf) of the random variable. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Thomas Nichols http://uk.mathworks.com/help/stats/kernel-distribution.html and http://reference.wolfram.com/language/ref/SmoothKernelDistribution.html smooth kernel distribution kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Thomas Nichols https://en.wikipedia.org/wiki/Kernel_density_estimation https://reference.wolfram.com/language/ref/KernelMixtureDistribution.html kernel mixture distribution Mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Thomas Nichols http://en.wikipedia.org/wiki/Mixture_distribution mixture distribution Probability distribution estimated empirically from a censored lifetime data Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Thomas Nichols http://reference.wolfram.com/language/ref/SurvivalDistribution.html survival distribution best linear unbiased prediction is a data transformation which predicts <TDB> under the assumption that the variable(s) under consideration have a random effect Philippe Rocca-Serra Henderson C. R., 1984 Applications of Linear Models in Animal Breeding. University of Guelph, Guelph, Ontario, Canada. ftp://tech.obihiro.ac.jp/suzuki/Henderson.pdf BLUP best linear unbiased predictor of the random effect conditional mode of the random effect best linear unbiased predictor breeding value estimation is a data transformation process aiming at computing breeding value estimates of an organism given a set of genomic (SNP) observations, pedigree information and/or phenotypic observations. Philippe Rocca-Serra breeding value estimation breeding value estimation is a data transformation process aiming at computing breeding value estimates of an organism given a set of genomic (SNP) observations. Philippe Rocca-Serra breeding value estimation using genotype data breeding value estimation is a data transformation process aiming at computing breeding value estimates of an organism given a set of pedigree information. Philippe Rocca-Serra breeding value estimation using pedigree data breeding value estimation is a data transformation process aiming at computing breeding value estimates of an organism given a set of phenotypic observations. Philippe Rocca-Serra breeding value estimation using phenotypic data Philippe Rocca-Serra genomic selection objective a dataset which is made up of genotypic information, that is presenting allele information at specific loci in a set of individuals of an organism. Philippe Rocca-Serra genotype data set a covariance structure is a data item which is part of a regression model and which indicates a pattern in the covariance matrix. The nature of covariance structure is specified before the regression analysis and various covariance structure may be tested and evaluated using information criteria to help choose the most suiteable model Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://www3.nd.edu/~kyuan/courses/sem/readpapers/benter.pdf covariance structure Given two sets of locations computes the Matern cross covariance matrix for covariances among all pairings. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols Matern covariance function http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm https://www.rdocumentation.org/packages/fields/versions/2.3/topics/matern.cov matern.cov Matern function anisotropic covariance structure The rational quadratic covariance function is used in spatial statistics, geostatistics, machine learning, image analysis, and other fields where multivariate statistical analysis is conducted on metric spaces. It is commonly used to define the statistical covariance between measurements made at two points that are d units distant from each other. Since the covariance only depends on distances between points, it is stationary. If the distance is Euclidean distance, the rational quadratic covariance function is also isotropic. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm https://en.wikipedia.org/wiki/Rational_quadratic_covariance_function http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corRatio.html rational quadratic anisotropic covariance structure spatial linear geometric anisotropic covariance structure is a type of covariance structure characterized by its anisotropy, i.e., the variation of properties can be different in directions x and y, which is this case give linear features. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols SP(LINGA) http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corLin.html spatial linear geometric anisotropic covariance structure spatial spherical geometric anisotropic covariance structure is a type of covariance structure characterized by its anisotropy, i.e., the variation of properties can be different in directions x and y, which is this case give spherical features. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm SP(SPHGA) http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corSpher.html spatial spherical geometric anisotropic covariance structure spatial gaussian geometric anisotropic covariance structure is a type of covariance structure characterized by its anisotropy, i.e., the variation of properties can be different in directions x and y, which is this case give gaussian features. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols SP(GAUGA) http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corGaus.html spatial gaussian geometric anisotropic covariance structure spatial exponential geometric anisotropic covariance structure is a type of covariance structure characterized by its anisotropy, i.e., the variation of properties can be different in directions x and y, which is this case give exponential features. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols SP(EXPGA) http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm spatial exponential geometric anisotropic covariance structure spatial exponential anisotropic covariance structure is a type of covariance structure characterized by its anisotropy, i.e., the variation of properties can be different in directions x and y, which is this case give exponential features. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols SP(EXPA)(c-list) http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm Sacks et al. (1989) http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corExp.html spatial exponential anisotropic covariance structure the banded heterogeneous Toeplitz covariance structure is a type of coviance structure which is often used to analyzed and intepret repeated measure design. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm TOEPH(q) banded heterogeneous Toeplitz covariance structure This covariance structure has heterogenous variances and heterogenous correlations between elements. The correlation between adjacent elements is homogenous across pairs of adjacent elements. The correlation between elements separated by a third is again homogenous, and so on. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols TOEPH http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm as well as: https://www.ibm.com/support/knowledgecenter/en/SSLVMB_23.0.0/spss/advanced/covariance_structures.html heterogeneous Toeplitz covariance structure A banded Toeplitz structure, defined by parameter q, can be viewed as a moving-average structure with order q-1. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm TOEP(q) banded Toeplitz covariance structure The Toeplitz covariance structure has homogenous variances and heterogenous correlations between elements. The correlation between adjacent elements is homogenous across pairs of adjacent elements. The correlation between elements separated by a third is again homogenous, and so on. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm https://www.ibm.com/support/knowledgecenter/en/SSLVMB_23.0.0/spss/advanced/covariance_structures.html TOEP Toeplitz covariance structure a form of covariance structure used to provide analysis ground s in the context of repeated measures datasets (longitudinal, time series) Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols HF http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm Huynh and Feldt 1970 Huynh-Feldt covariance structure factor-analytic structure is a covariance structure which is specified for q factors equal diagonal factor-analytic covariance structure is a type of factor analytic covariance structure specified for q factors, which includes a diagonal component for repeated measures. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm FA1(q) equal diagonal Factor Analytic covariance structure no diagonal factor-analytic covariance structure is a type of factor analytic covariance structure specified for q factors, which does not include a diagonal component for repeated measures. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm FA0(q) no diagonal Factor Analytic covariance structure factor-analytic structure is a type of heterogeneous covariance structure which is specified for q factors Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols adapted from: Heterogeneous Variance: Covariance Structures for Repeated Measures Russell D. Wolfinger. Journal of Agricultural, Biological, and Environmental Statistics Vol. 1, No. 2 (Jun., 1996), pp. 205-230. https://doi.org/10.2307/1400366 and http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm Jennrich and Schluchter 1986 FA(q) Factor Analytic covariance structure compound symmetry covariance structure is a covariance structure which means that all the variances are equal and all the covariances are equal. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm CS http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corCompSymm.html compound symmetry covariance structure heterogenous compound symmetry structure is a compound symmetry covariance structure which has a different variance parameter for each diagonal element, and it uses the square roots of these parameters in the off-diagonal entries. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm CSH heterogeneous compound symmetry covariance structure first order autoregressive moving average covariance structure is a type of covariance structure which is used in the context of time series analysis Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols ARMA(1,1) http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corARMA.html first order autoregressive moving average covariance structure first order autoregressive covariance structure is a covariance structure where correlations among errors decline exponentially with distance Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm AR(1) http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corAR1.html first order autoregressive covariance structure This is an homogeneous structure, i.e. the variance along the main diagonal is constant. The covariances decline exponentially. It has only 2 parameters. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm ARH(1) heterogeneous first-order autoregressive covariance structure Ante-dependence covariance structure is a covariance structure which specifies that the covariance between two time points is a function of the product of variances at both points (hence allowing hetrogenity of error variance across measures to affect the correlation) and the product of the correlations at the distances up to the one chosen. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm ANTE(1) Ante-dependence covariance structure Mallows' Cp is a data item which compares the precision and bias of the full model to models with a subset of the predictors thus helping to choose between multiple regression models. the mallows cp is a function of the number of parameter used in the model relying on the residuals sum of squares to compute a score. the smaller Cp is, the better the model fit is. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/goodness-of-fit-statistics/what-is-mallows-cp/ http://ugrad.stat.ubc.ca/R/library/locfit/html/cp.html Mallows' Cp repeated measure analysis is a kind of data transformation which deals with signals measured in the same experimental units at different times and, possibly, under different conditions over a period of time. Data produced by longitudinal studies qualify for such analysis. Since measurements are made on the same experimental units a number of times, they are likely to be correlated. Repeated measure analysis usually takes into consideration the possibility of correlation with time. It does so by specifying covariance structure in the analysis Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from https://ciser.cornell.edu/sasdoc/saspdf/analyst/chap16.pdf repeated measure analysis repeated measure analysis the ordinary least squares estimation is a model parameter estimation for a linear regression model when the errors are uncorrelated and equal in variance. Is the Best Linear Unbiased Estimation (BLUE) method under these assumptions, Uniformly Minimum-Variance Unbiased Estimator (UMVUE) with addition of a Gaussian assumption. Alejandra Gonzalez-Beltran Camille Maumet Philippe Rocca-Serra Tom Nichols http://en.wikipedia.org/wiki/Ordinary_least_squares and Tom Nichols OLS estimation https://stat.ethz.ch/R-manual/R-patched/library/stats/html/lm.html ordinary least squares estimation the weighted least squares estimation is a model parameter estimation for a linear regression model with errors that independent but have heterogeneous variance. Difficult to use use in practice, as weights must be set based on the variance which is usually unknown. If true variance is known, it is the Best Linear Unbiased Estimation (BLUE) method under these assumptions, Uniformly Minimum-Variance Unbiased Estimator (UMVUE) with addition of a Gaussian assumption. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols http://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares and Tom Nichols WLS estimation https://stat.ethz.ch/R-manual/R-patched/library/stats/html/lm.html weighted least squares estimation the generalized least squares estimation is a model parameter estimation for a linear regression model with errors that are dependent and (possibly) have heterogeneous variance. Difficult to use use in practice, as covariance matrix of the errors must known to "whiten" data and model. If true covariance is known, it is the Best Linear Unbiased Estimation (BLUE) method under these assumptions, Uniformly Minimum-Variance Unbiased Estimator (UMVUE) with addition of a Gaussian assumption. Philippe Rocca-Serra Tom Nichols http://en.wikipedia.org/wiki/Generalized_least_squares and Tom Nichols GLS estimation http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/gls.html generalized least squares estimation the iteratively reweighted least squares estimation is a model parameter estimation which is a practical implementation of Weighted Least Squares, where the heterogeneous variances of the errors are estimated from the residuals of the regression model, providing an estimate for the weights. Each successive estimate of the weights improves the estimation of the regression parameters, which in turn are used to compute residuals and update the weights Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols Tom Nichols iteratively reweighted least squares estimation the feasible generalized least squares estimation is a model parameter estimation which is a practical implementation of Generalised Least Squares, where the covariance of the errors is estimated from the residuals of the regression model, providing the information needed to whiten the data and model. Each successive estimate of the whitening matrix improves the estimation of the regression parameters, which in turn are used to compute residuals and update the whitening matrix. Alejandra Gonzalez-Beltran Camille Maumet Orlaith Burke Philippe Rocca-Serra Tom Nichols Tom Nichols feasible generalized least squares estimation used as an unbiased estimator of teh variance for a regression model a residual mean square is a data item which is obtained by dividing the sum of squared residuals (SSR) by the number of degrees of freedom Alejandra Gonzalez-Beltran Camille Maumet Philippe Rocca-Serra Thomas Nichols http://en.wikipedia.org/wiki/Mean_squared_error#Regression http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-statistics/understanding-mean-squares/ https://github.com/ISA-tools/stato/issues/35 MSE error mean square residual mean square Z-statistic is a statistic computed from observations and used to produce a p-value when compared to a Standard Normal Distribution in a statistical test called the Z-test. Alejandra Gonzalez-Beltran Camille Maument Philippe Rocca-Serra Thomas Nichols http://en.wikipedia.org/wiki/Z-test Z-statistic Deviance is an indicator of fit and can be estimated by computing -2 times the log-likelihood ratio of the fitted model compared to a saturated(full) model. It is a generalization of the idea of using the sum of squares of residuals in ordinary least squares to cases where model-fitting is achieved by maximum likelihood. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Deviance_%28statistics%29 deviance https://stat.ethz.ch/R-manual/R-devel/library/stats/html/deviance.html deviance http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682718/ The deviance information criterion (DIC) is a hierarchical modeling generalization of the AIC (Akaike information criterion) and BIC (Bayesian information criterion, also known as the Schwarz criterion). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. Like AIC and BIC it is an asymptotic approximation as the sample size becomes large. It is only valid when the posterior distribution is approximately multivariate normal. The deviance information criterion was published in 2002 by Spiegelhalter et al. Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. van der Linde, 2002. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, B, 64, 583-639. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://en.wikipedia.org/wiki/Deviance_information_criterion DIC http://artax.karlin.mff.cuni.cz/r-help/library/SpatialExtremes/html/DIC.html deviance information criterion The focused information criterion is a measurement data item which aims at facilitating model selection. It was published in 2003 by Claeskens, G. and Hjort, N.L. (2003). "The focused information criterion". Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Journal of the American Statistical Association, volume 98, pp. 879–899. doi:10.1198/016214503000000819 FIC focused information criterion a data transformation that finds a contrast value (the contrast estimate) by computing the weighted sum of model parameter estimates using a set of contrast weights. Alejandra Gonzalez-Beltran Camille Maumet Philippe Rocca-Serra Thomas Nichols https://github.com/ISA-tools/stato/pull/37 contrast estimation estimate of a contrast obtained by computing the weighted sum of model parameter estimates using a set of contrast weights. Alejandra Gonzalez-Beltran Camille Maumet Philippe Rocca-Serra Thomas Nichols https://github.com/ISA-tools/stato/pull/37 contrast estimate an estimate of the standard deviation of a contrast estimate sampling distribution. Alejandra Gonzalez-Beltran Camille Maumet Philippe Rocca-Serra Thomas Nichols https://github.com/ISA-tools/stato/pull/37 standard error of a contrast estimate A scree plot is a graphical display of the variance of each component in the dataset which is used to determine how many components should be retained in order to explain a high percentage of the variation in the data Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra http://www.stats.gla.ac.uk/glossary/?q=node/451 Cattell scree test plot screeplot(x, npcs = min(10, length(x$sdev)), type = c("barplot", "lines"), main = deparse(substitute(x)), ...) from: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/screeplot.html scree plot A scatterplot matrix contains all the pairwise scatter plots of a set of variables on a single page in a matrix format. Alejandra Gonzalez-Beltran Philippe Rocca-Serra Adapted from http://itl.nist.gov/div898/handbook/eda/section3/eda33qb.htm scatterplot matrix The alpha distribution is a continuous probability distribution whose density function is as defined at: https://docs.scipy.org/doc/scipy-1.0.0/reference/tutorial/stats/continuous_alpha.html Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra https://docs.scipy.org/doc/scipy-1.0.0/reference/tutorial/stats/continuous_alpha.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.alpha.html#scipy.stats.alpha alpha distribution a power-law probability distribution is a probability distribution whose density function (or mass function in the discrete case) has the form p(x) = L(x) . x^{-alpha} where alpha is a parameter >1 and L(x) is a slowly varying function. adapted from wikipedia and wolfram alpha: https://en.wikipedia.org/wiki/Power_law#Power-law_probability_distributions last accessed: 2015-11-03 https://cran.r-project.org/web/packages/poweRlaw/index.html http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.powerlaw.html#scipy.stats.powerlaw power law distribution A regression model is a statistical model used in a type of analysis knowns as regression analysis, whereby a function is used to determine the relation between a response variable and an independent variable , with a set of unknown parameters. Philippe Rocca-Serra adapted from wikipedia: https://en.wikipedia.org/wiki/Regression_analysis#Regression_models last accessed: 2015-11-03 regression model The Pareto distribution is a continuous probability distribution, which is defined by the follwoing probability density (1) function and distribution function (2) (1): P(x)=(ab^a)/(x^(a+1)) (2): D(x)=1-(b/x)^a defined over the interval x>=b. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapter from: http://mathworld.wolfram.com/ParetoDistribution.html last accessed: 2015-11-04 http://artax.karlin.mff.cuni.cz/r-help/library/LaplacesDemon/html/dist.Pareto.html last accessed: 2015-11-04 Usage >dpareto(x, alpha, log=FALSE) >ppareto(q, alpha) >qpareto(p, alpha) >rpareto(n, alpha) Arguments x,q These are each a vector of quantiles. p This is a vector of probabilities. n This is the number of observations, which must be a positive integer that has length 1. alpha This is the shape parameter alpha, which must be positive. log Logical. If log=TRUE, then the logarithm of the density or result is returned. Pareto type-I probability distribution the Pareto type-II probability distribution is a continuous probability distribution which is defined by a probability density function characterized by 2 parameters, alpha and lambda, 2 real, strictly positive numbers. alpha is known as the shape parameter while lambda is known as the scale parameter. the function defines the probably of a continous random variable according to the following: p(x) = {\alpha \over \lambda} \left[{1+ {x \over \lambda}}\right]^{-(\alpha+1)}, \qquad x \geq 0, Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra https://en.wikipedia.org/wiki/Lomax_distribution Lomax distribution http://www.inside-r.org/packages/cran/actuar/docs/Pareto dpareto(x, shape, scale, log = FALSE) Pareto type-II probability distribution The Pareto(III) distribution is a continous probability distribution which is described with a cumulative distribution function of the following form: F(x) = 1 − [1 + ((x − mu)/sigma)1/gamma]−1 for x > mu, sigma > 0, gamma > 0 and s =1. a is the location parameter, b is the scale parameter, g is the inequality parameter s is the shape parameter of value 1 The Pareto III distribution corresponds to a Pareto Type IV distribution where the shape parameter has a value of 1. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia: https://en.wikipedia.org/wiki/Pareto_distribution#Pareto_types_I.E2.80.93IV last accessed: 2015-11-04 Pareto type-III probability distribution The Pareto(IV) distribution is a continous probability distribution which is described with a cumulative distribution function of the following form: F(y) = 1 − [1 + ((y − a)/b)1/g]−s for y > a, b > 0, g > 0 and s > 0. a is the location parameter, b is the scale parameter, g is the inequality parameter s is the shape parameter The distribution is used in actuarial science, economics, finance and telecommunications, but not restricted to those fields. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra https://cran.r-project.org/web/packages/VGAM/VGAM.pdf page 517 dparetoIV(x, location = 0, scale = 1, inequality = 1, shape = 1, log = FALSE) https://cran.r-project.org/web/packages/VGAM/VGAM.pdf Pareto type-IV probability distribution The geometric mean of two numbers, say 2 and 8, is just the square root of their product; that is sqrt(2 x 8)=4. The geometric mean is defined as the nth root of the product of n numbers, i.e., for a set of numbers \{x_i\}_{i=1}^N, the geometric mean is defined as \left(\prod_{i=1}^N x_i\right)^{1/N}. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from: https://en.wikipedia.org/wiki/Mean#Geometric_mean_.28GM.29 https://en.wikipedia.org/wiki/Geometric_mean http://personality-project.org/r/html/geometric.mean.html Usage: >geometric.mean(x,na.rm=TRUE) Arguments: x , a vector or data.frame http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.mstats.gmean.html geometric mean The harmonic mean is a kind of mean which is calculated by dividing the total number of observations by the reciprocal of each number in a series. Harmonic Mean = N/(1/a1+1/a2+1/a3+1/a4+.......+1/aN) where a(i)= Individual score and N = Sample size (Number of scores) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from wikipedia and https://www.easycalculation.com/statistics/learn-harmonic-mean.php last accessed: 2015-11-04 https://en.wikipedia.org/wiki/Harmonic_mean https://en.wikipedia.org/wiki/Mean#Harmonic_mean_.28HM.29 http://personality-project.org/r/html/harmonic.mean.html Usage: > harmonic.mean(x,na.rm=TRUE) Arguments: x, a vector, matrix, or data.frame na.rm, na.rm=TRUE remove NA values before processing http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.mstats.hmean.html harmonic mean The weighted arithmetic mean is a measure of central tendency that is the sum of the products of each observed value and their respective non-negative weights, divided by the sum of the weights, such that the contribution of each observed value to the mean may defer according to its respective weight. It is defined by the formula: A = sum(vi*wi)/sum(wi), where 'i' ranges from 1 to n, 'vi' is the value of each observation, and 'wi' is the value of the respective weight for each observed value. The weighted arithmetic mean is a kind of mean similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points are weighted, meaning they contribute more than others. The weighted arithmetic mean is often used if one wants to combine average values from samples of the same population with different sample sizes. Alejandra Gonzalez-Beltran Matthew Diller Orlaith Burke Philippe Rocca-Serra https://en.wikipedia.org/wiki/Weighted_arithmetic_mean https://docs.scipy.org/doc/numpy/reference/generated/numpy.average.html np.average(range(1,11), weights=range(10,0,-1)) https://github.com/ISA-tools/stato/issues/59 weighted arithmetic mean The interquartile mean (IQM) (or midmean) is a statistical measure of central tendency based on the truncated mean of the interquartile range. In the calculation of the IQM, only the data in the second and third quartiles is used (as in the interquartile range), and the lowest 25% and the highest 25% of the scores are discarded. These points are called the first and third quartiles, hence the name of the IQM. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra IQM https://en.wikipedia.org/wiki/Mean#Interquartile_mean interquartile mean The root mean square (abbreviated RMS or rms), also known as the quadratic mean, in statistics is a statistical measure of central tendency defined as the square root of the mean of the squares of a sample. ( To find the root mean square of a set of numbers, square all the numbers in the set and then find the arithmetic mean of the squares. Take the square root of the result. This is the root mean square.) Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra RMS root mean square https://en.wikipedia.org/wiki/Root_mean_square and http://www.mathwords.com/r/root_mean_square.htm last accessed: 2015-11-04 quadratic mean the sample mean of sample of size n with n observations is an arithmetic mean computed over n number of observations on a statistical sample. The sample mean, denoted x¯ and read “x-bar,” is simply the average of the n data points x1, x2, ..., xn: x¯=x1+x2+⋯+xnn=1n∑i=1nxi The sample mean summarizes the "location" or "center" of the data. the sample mean is a measure of dispersion of the observations made on the sample and provides an unbias estimate of the population mean Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from: https://onlinecourses.science.psu.edu/stat414/node/66 and http://mathworld.wolfram.com/SampleMean.html last accessed: 2015-11-05 sample mean the population mean or distribution mean is a parameter of a probability distribution or population indicative of the data dispersion. For continous probabibility distribution, the population mean is computed using the probability density function, for discrete probability distributions, a mass density function is used instead. A population mean can be estimated by computing a sample mean Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from: http://mathworld.wolfram.com/PopulationMean.html last accessed: 2015-11-05 population mean A covariance structure where no restrictions are made on the covariance between any pair of measurements. Alejandra Gonzalez-Beltran Camille Maumet Philippe Rocca-Serra Thomas Nichols http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm#statug.mixed.mixedcovstruct unstructured covariance structure 2 The Yuen's t-test is a two sample t-test with populations of unequal variance which provides a more robust t-test procedure under normal distribution and long tailed distributions. The test computes a t statistics not using 'arithmetic means' but using 'trimmed means' instead as well as winsorized variances. Philippe Rocca-Serra Yuen-Welch's t-test Biometrika (1974) 61 (1): 165-170 10.1093/biomet/61.1.165 http://finzi.psych.upenn.edu/library/DescTools/html/YuenTTest.html Yuen t-Test with trimmed means Fagan nomogram is a graph plotting pre-test probabilities, likelyhood ratios and post-test probabilities on 3 parallel axis. The plot was first proposed by Fagan in 1975 as a way to visualize Baye's Theorem data where P(D) is the probability that the patient has the disease before the test. P(D|T) is the probability that the patient has the disease after the test result. P(T|D) is the probability of the test result if the patient has the disease, and P(T|D̄) is the probability of the test result if the patient does not have the disease. With this terminology the usefulness of both positive and negative test results can be assessed. A line drawn from P(D) on the right through the ratio of P(T|D) to P(T|D̄) gives P(D|T) on the left of the nomogram. Philippe Rocca-Serra http://www.ncbi.nlm.nih.gov/pubmed/1143310 N Engl J Med 1975; 293:257July 31, 1975 DOI: 10.1056/NEJM197507312930513 Fagan nomogram Two-Step Fagan Nomogram, which adds two extra axis between the LR axis that represents sensibility and specificity to calculate negative and positive likelihood ratios in the same nomogram Philippe Rocca-Serra http://www.ncbi.nlm.nih.gov/pubmed/23468201 2 step Fagan nomogram the likelihood ratio is a ratio which is formed by dividing the post-test odds with the pre-test odds in the context of a Bayesian formulation Philippe Rocca-Serra likelihood ratio the likelihood ratio of negative results is a ratio which is formed by dividing the difference between 1 and sensitivity of the test by the specificity value of a test.. This can be expressed also as dividing the probability of a person who has the disease testing negative by the probability of a person who does not have the disease testing negative. Philippe Rocca-Serra likelihood ratio for negative results adapted from Wikipedia: https://en.wikipedia.org/wiki/Likelihood_ratios_in_diagnostic_testing last accessed: May 2016 negative likelihood ratio the likelihood ratio of positive results is a ratio which is form by dividing the sensitivity value of a test by the difference between 1 and specificity of the test. This can be expressed also as dividing the probability of the test giving a positive result when testing an affected subject versus the probability of the test giving a positive result when a subject is not affected. Philippe Rocca-Serra likelihood ratio for positive results adapted from Wikipedia: https://en.wikipedia.org/wiki/Likelihood_ratios_in_diagnostic_testing last accessed: May 2016 positive likelihood ratio prevalence is a ratio formed by the number of subjects diagnosed with a disease divided by the total population size. Philippe Rocca-Serra adapted from: https://www.health.ny.gov/diseases/chronic/basicstat.htm prevalence Incidence is the ratio of the number of new cases of a disease divided by the number of persons at risk for the disease. Philippe Rocca-Serra adapted from: https://www.health.ny.gov/diseases/chronic/basicstat.htm incidence mortality is a ratio formed by the number of deaths due to a disease divided by the total population size. Philippe Rocca-Serra adapted from: https://www.health.ny.gov/diseases/chronic/basicstat.htm mortality in the context of binary classification, accuracy is defined as the proportion of true results (both true positives and true negatives) to the total number of cases examined (the sum of true positive, true negative, false positive and false negative). It can be understood as a measure of the proximity of measurement results to the true value. Philippe Rocca-Serra Rand accuracy Rand index adapted from wikipedia: https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification last accessed: May 2016 accuracy precision or positive predictive value is defined as the proportion of the true positives against all the positive results (both true positives and false positives) Philippe Rocca-Serra positive predictive value adapted from wikipedia: https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification last accessed: May 2016 precision The probability of a patient having the target disorder before a diagnostic test result is known Philippe Rocca-Serra http://www.cebm.net/pre-test-probability/ pretest probability a measure of heterogeneity in meta-analysis is a data item which aims to describe the variation in study outcomes between studies. Philippe Rocca-Serra adapted from http://www.statsdirect.com/help/default.htm#meta_analysis/heterogeneity.htm last accessed: May 2016 measure of heterogeneity The Cochran's Q statistic is a measure of heterogeneity accros study computed by summing the squared deviations of each study's estimate from the overall meta-analytic estimate, weighting each study's contribution in the same manner as in the meta-analysis. Philippe Rocca-Serra Cochran WG. The combination of estimates from different experiments. Biometrics 1954;10: 101-29. https://doi.org/10.2307/3001666 http://www.inside-r.org/packages/cran/RVAideMemoire/docs/cochran.qtest Cochran's Q statistic The quantity called I2, describes the percentage of total variation across studies that is due to heterogeneity rather than chance. I2 can be readily calculated from basic results obtained from a typical meta-analysis as I2 = 100%×(Q - df)/Q, where Q is Cochran's heterogeneity statistic and df the degrees of freedom. Negative values of I2 are put equal to zero so that I2 lies between 0% and 100%. A value of 0% indicates no observed heterogeneity, and larger values show increasing heterogeneity. Unlike Cochran's Q, it does not inherently depend upon the number of studies considered. A confidence interval for I² is constructed using either i) the iterative non-central chi-squared distribution method of Hedges and Piggott (2001); or ii) the test-based method of Higgins and Thompson (2002). The non-central chi-square method is currently the method of choice (Higgins, personal communication, 2006) – it is computed if the 'exact' option is selected. Philippe Rocca-Serra I2 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC192859/ and http://www.statsdirect.com/help/default.htm#meta_analysis/heterogeneity.htm I-squared Tau-squared is an estimate of the between-study variance in a random-effects meta-analysis. The square root of this number (i.e. tau) is the estimated standard deviation of underlying effects across studies. Philippe Rocca-Serra http://handbook.cochrane.org/chapter_9/9_5_4_incorporating_heterogeneity_into_random_effects_models.htm http://www.inside-r.org/packages/cran/meta/docs/metacor Tau squared The L’Abbé plot was introduced in 1987 in the context of meta-analyses of clinical trials with dichotomous (binary) outcomes, as a plot of observed risks in the treatment group against observed risks in the control group. Another formulation is that it plots the event rate in the experimental (intervention) group against the event rate in the control group, as an aid to exploring the heterogeneity of effect estimates within a meta-analysis. It is diagram used in meta-analysis that compares the risks observed in the experimental and control arms of clinical trials. Each trial is located in the space of a diagram where the sizes of the circles indicate the sizes of the trials. Trials in which the experimental treatment had a higher risk than the control will be in the upper left of the plot. If risk in the both groups is the same the circle will fall on the line of equality. If the control treatment has a higher risk than the experimental treatment then the point will be in the lower right of the plot. It is often used as an indicator of heterogeneity and hence as an indicator of the likelihood that results from different trials can be validly combined. Named after Kristin L'Abbé. Philippe Rocca-Serra 10.1002/jrsm.6 Graphical displays for meta-analysis: An overview with suggestions for practice http://www.ncbi.nlm.nih.gov/pubmed/3300460 and http://www.dictionarycentral.com/definition/l-abb-plot.html http://www.inside-r.org/packages/cran/meta/docs/labbe.metabin L'Abbe plot the proportion of individuals in a population with the outcome of interest Philippe Rocca-Serra adapted from: http://handbook.cochrane.org/chapter_9/9_2_2_4_measure_of_absolute_effect_the_risk_difference.htm observed risk The risk difference is the difference between the observed risks (proportions of individuals with the outcome of interest) in the two groups. The risk difference is straightforward to interpret: it describes the actual difference in the observed risk of events between experimental and control interventions. Alejandra Gonzalez-Beltran Philippe Rocca-Serra http://handbook.cochrane.org/chapter_9/9_2_2_4_measure_of_absolute_effect_the_risk_difference.htm risk difference Sidik-Jonkman estimator is a data item computed to estimate heterogeneity parameter (estimate of between-study variance) in a random effect model for meta analysis. Philippe Rocca-Serra http://www.ncbi.nlm.nih.gov/pubmed/16955539 http://www.inside-r.org/packages/cran/meta/docs/metacor metacor(cor, n, studlab, data=NULL, subset=NULL, sm=.settings$smcor, level=.settings$level, level.comb=.settings$level.comb, comb.fixed=.settings$comb.fixed, comb.random=.settings$comb.random, hakn="FALSE", method.tau="SJ", tau.common=.settings$tau.common, prediction=.settings$prediction, level.predict=.settings$level.predict, method.bias=.settings$method.bias, backtransf=.settings$backtransf, title=.settings$title, complab=.settings$complab, outclab="", byvar, bylab, print.byvar=.settings$print.byvar, keepdata=.settings$keepdata ) Sidik-Jonkman estimator Hunter-Schmidt estimator is a data item computed to estimate heterogeneity parameter (estimate of between-study variance) in a random effect model for meta analysis. Philippe Rocca-Serra Methods of Meta-Analysis: Correcting Error and Bias in Research Findings. by John E. Hunter, Frank L. Schmidt doi:10.2307/2289738 http://www.inside-r.org/packages/cran/meta/docs/metacor metacor(cor, n, studlab, data=NULL, subset=NULL, sm=.settings$smcor, level=.settings$level, level.comb=.settings$level.comb, comb.fixed=.settings$comb.fixed, comb.random=.settings$comb.random, hakn="FALSE", method.tau="HS", tau.common=.settings$tau.common, prediction=.settings$prediction, level.predict=.settings$level.predict, method.bias=.settings$method.bias, backtransf=.settings$backtransf, title=.settings$title, complab=.settings$complab, outclab="", byvar, bylab, print.byvar=.settings$print.byvar, keepdata=.settings$keepdata ) Hunter-Schmidt estimator restricted maximum likelihood estimation is a kind of maximum likelihood estimation data transformation which estimates the variance components of random-effects in univariate and multivariate meta-analysis. in contrast to 'maximum likelihood estimation', reml can produce unbiased estimates of variance and covariance parameters. Philippe Rocca-Serra https://doi.org/10.1093/biomet/58.3.545 REML reml(y, v, x, data, RE.constraints = NULL, RE.startvalues = 0.1, RE.lbound = 1e-10, intervals.type = c("z", "LB"), model.name="Variance component with REML", suppressWarnings = TRUE, silent = TRUE, run = TRUE, ...) https://www.rdocumentation.org/packages/metaSEM/versions/1.0.0/topics/reml restricted maximum likelihood estimation maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. MLE attempts to find the parameter values that maximize the likelihood function, given the observations. The method of maximum likelihood is based on the likelihood function, {\displaystyle {\mathcal {L}}(\theta \,;x)} {\displaystyle {\mathcal {L}}(\theta \,;x)}. We are given a statistical model, i.e. a family of distributions {\displaystyle \{f(\cdot \,;\theta )\mid \theta \in \Theta \}} {\displaystyle \{f(\cdot \,;\theta )\mid \theta \in \Theta \}}, where {\displaystyle \theta } \theta denotes the (possibly multi-dimensional) parameter for the model. The method of maximum likelihood finds the values of the model parameter, {\displaystyle \theta } \theta , that maximize the likelihood function, {\displaystyle {\mathcal {L}}(\theta \,;x)} {\displaystyle {\mathcal {L}}(\theta \,;x)}. I Philippe Rocca-Serra https://en.wikipedia.org/wiki/Maximum_likelihood_estimation http://stat.ethz.ch/R-manual/R-devel/library/stats4/html/mle.html maximum likelihood estimation DerSimonian-Laird estimator s a data item computed to estimate heterogeneity parameter (estimate of between-study variance) in a random effect model for meta analysis. The estimator is used in simple noniterative procedure for characterizing the distribution of treatment effects in a series of studies Philippe Rocca-Serra doi:10.1016/j.cct.2006.04.004 http://www.ncbi.nlm.nih.gov/pubmed/3802833 doi:10.1016/0197-2456(86)90046-2 http://www.inside-r.org/packages/cran/meta/docs/metacor metacor(cor, n, studlab, data=NULL, subset=NULL, sm=.settings$smcor, level=.settings$level, level.comb=.settings$level.comb, comb.fixed=.settings$comb.fixed, comb.random=.settings$comb.random, hakn="FALSE", method.tau="DL", tau.common=.settings$tau.common, prediction=.settings$prediction, level.predict=.settings$level.predict, method.bias=.settings$method.bias, backtransf=.settings$backtransf, title=.settings$title, complab=.settings$complab, outclab="", byvar, bylab, print.byvar=.settings$print.byvar, keepdata=.settings$keepdata ) DerSimonian-Laird estimator a random effect meta analysis procedure defined by Hartung and Knapp and by Sidik and Jonkman which performs better than DerSimonian and Laird approach, especially when there is heterogeneity and the number of studies in the meta-analysis is small. Philippe Rocca-Serra HKSJ method doi:10.1186/1471-2288-14-25 http://www.inside-r.org/packages/cran/meta/docs/metacor metacor(cor, n, studlab, data=NULL, subset=NULL, sm=.settings$smcor, level=.settings$level, level.comb=.settings$level.comb, comb.fixed=.settings$comb.fixed, comb.random=.settings$comb.random, hakn="TRUE", method.tau="HS", tau.common=.settings$tau.common, prediction=.settings$prediction, level.predict=.settings$level.predict, method.bias=.settings$method.bias, backtransf=.settings$backtransf, title=.settings$title, complab=.settings$complab, outclab="", byvar, bylab, print.byvar=.settings$print.byvar, keepdata=.settings$keepdata ) meta analysis by Hartung-Knapp-Sidik-Jonkman method a meta analysis which relies on the computation of the DerSimonian and Leard estimator as a measure of heterogeneity over a set of studies. Philippe Rocca-Serra http://www.inside-r.org/packages/cran/meta/docs/metacor metacor(cor, n, studlab, data=NULL, subset=NULL, sm=.settings$smcor, level=.settings$level, level.comb=.settings$level.comb, comb.fixed=.settings$comb.fixed, comb.random=.settings$comb.random, hakn="FALSE", method.tau="DL", tau.common=.settings$tau.common, prediction=.settings$prediction, level.predict=.settings$level.predict, method.bias=.settings$method.bias, backtransf=.settings$backtransf, title=.settings$title, complab=.settings$complab, outclab="", byvar, bylab, print.byvar=.settings$print.byvar, keepdata=.settings$keepdata ) meta analysis by DerSimonian and Leard method a meta analysis which relies on the computation of the Hunter and Schmidt estimator as a measure of heterogeneity over a set of studies by considering the weighted mean of the raw correlation coefficient. Hunter and Schmidt developed what is commonly termed validity generalization procedures (Schmidt and Hunter, 1977). These involve correcting the effect sizes in the meta-analysis for sampling, and measurement error and range restriction. Philippe Rocca-Serra Hunter JE, Schmidt FL. Methods of Meta-analysis: correcting error and bias in research findings. Newbury Park, CA: Sage 1990. http://www.inside-r.org/packages/cran/meta/docs/metacor metacor(cor, n, studlab, data=NULL, subset=NULL, sm=.settings$smcor, level=.settings$level, level.comb=.settings$level.comb, comb.fixed=.settings$comb.fixed, comb.random=.settings$comb.random, hakn="FALSE", method.tau="HS", tau.common=.settings$tau.common, prediction=.settings$prediction, level.predict=.settings$level.predict, method.bias=.settings$method.bias, backtransf=.settings$backtransf, title=.settings$title, complab=.settings$complab, outclab="", byvar, bylab, print.byvar=.settings$print.byvar, keepdata=.settings$keepdata ) meta analysis by Hunter-Schmidt method McNemar's test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal (that is, whether there is "marginal homogeneity"). It is named after Quinn McNemar, who introduced it in 1947. An application of the test in genetics is the transmission disequilibrium test for detecting linkage disequilibrium Philippe Rocca-Serra McNemar's Chi-squared Test for Count Data test of the marginal homogeneity of a contingency table within-subjects chi-squared test adapted from Wikipedia: https://en.wikipedia.org/wiki/McNemar%27s_test last accessed: May 2016 https://www.ncbi.nlm.nih.gov/pubmed/20254758 mcnemar.test(x, y = NULL, correct = TRUE) from: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/mcnemar.test.html McNemar test Cochran's Q test is a statistical test used for unreplicated randomized block design experiments with a binary response variable and paired data. In the analysis of two-way randomized block designs where the response variable can take only two possible outcomes (coded as 0 and 1), Cochran's Q test is a non-parametric statistical test to verify whether k treatments have identical effects. Philippe Rocca-Serra adapted from: http://www.inside-r.org/packages/cran/CVST/docs/cochranq.test and https://en.wikipedia.org/wiki/Cochran%27s_Q_test last accessed: May 2016 cochran.qtest(formula, data, alpha = 0.05, p.method = "fdr") from: http://www.inside-r.org/packages/cran/RVAideMemoire/docs/cochran.qtest Cochran's q test for heterogeneity a probability distribution scale parameter is a measure of variation which is set by the operator when selecting a parametric probability distribution and which defines how spread the distribution is. The larger the value of the scale parameter is, the more spread out the distribution. user request: https://github.com/ISA-tools/stato/issues/47 Philippe Rocca-Serra adapted from Wikipedia: https://en.wikipedia.org/wiki/Scale_parameter. last accessed: 2016/11/11 scale statistical dispersion probability distribution scale parameter a probability distribution shape parameter is a data item which is set by the operator when selecting a parametric probability distribution and which dictates the way the profile but not the location or size of the distribution plot looks like. user request: https://github.com/ISA-tools/stato/issues/47 Alejandra Gonzalez-Beltran Philippe Rocca-Serra adapted from Wikipedia: https://en.wikipedia.org/wiki/Shape_parameter last accessed: 2016-11-11 shape http://stat.ethz.ch/R-manual/R-patched/library/stats/html/GammaDist.html probability distribution shape parameter a scale estimator is a measurement datum (a statistic) which is calculated to approach the actual scale parameter of a probability distribution from observed data. user request: https://github.com/ISA-tools/stato/issues/47 Philippe Rocca-Serra adapted from Wikipedia: https://en.wikipedia.org/wiki/Scale_parameter last accessed: 2016/11/11 https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/gam.scale.html scale estimator a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable {\displaystyle X} X is log-normally distributed, then {\displaystyle Y=\ln(X)} Y=\ln(X) has a normal distribution. Likewise, if {\displaystyle Y} Y has a normal distribution, then {\displaystyle X=\exp(Y)} X=\exp(Y) has a log-normal distribution. A random variable which is log-normally distributed takes only positive real values. The distribution is occasionally referred to as the Galton distribution or Galton's distribution, after Francis Galton. user request: https://github.com/ISA-tools/stato/issues/47 Alejandra Gonzalez-Beltran Philippe Rocca-Serra adapted from Wikipedia: https://en.wikipedia.org/wiki/Log-normal_distribution last accessed: 2016/11/11 dlnorm(x, meanlog = 0, sdlog = 1, log = FALSE) plnorm(q, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE) qlnorm(p, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE) rlnorm(n, meanlog = 0, sdlog = 1) https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Lognormal.html log normal distribution outlier detection testing objective is a statistical objective of a data transformation which aims to test a null hypothesis that an observation is not an outlier. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO outlier detection testing objective outlier detection testing objective Dixon test is a statistical test used to detect outliers in a univariate data set assumed to come from a normally distributed population. Le test de Dixon est un test statistique destiné à identifier des données abérrantes dans un jeu de données associées à une variable aléatoire univariée dont la distribution sous-jacente est supposée normale. Philippe Rocca-Serra Dixon test Q test Robert B. Dean and Wilfrid J. Dixon (1951) "Simplified Statistics for Small Numbers of Observations". Anal. Chem., 1951, 23 (4), 636–638. adapted from Wikipedia: https://en.wikipedia.org/wiki/Dixon%27s_Q_test last accessed: 2016-11-19 dixon.outliers(data) from: http://finzi.psych.upenn.edu/library/referenceIntervals/html/dixon.outliers.html Dixon Q test 1 Grubbs' test is a statistical test used to detect one outlier in a univariate data set assumed to come from a normally distributed population. Le test de Grubb est un test statistique destiné à identifier une (et une seule) abérration dans un jeu de données associées à une variable aléatoire univariée dont la distribution sous-jacente est supposée normale. Philippe Rocca-Serra maximum normed residual test adapted from Wikipedia: https://en.wikipedia.org/wiki/Grubbs'_test_for_outliers last accessed: 2016-11-19 rgrubbs.test(x, alpha = 0.05) from: http://finzi.psych.upenn.edu/library/OutlierDM/html/rgrubbs.test.html Grubbs' test Tietjen-Moore test for outlier is a statistical test used to detect outliers and corresponds to a generalization of the Grubb's test, thus allowing detection of more than one outlier in a univariate data set assumed to come from a normally distributed population. If testing for a single outlier, the Tietjen-Moore test is equivalent to the Grubbs' test. Le test de Tietjen-Moore est un test statistique destiné à identifier des données abérrantes dans un jeu de données associées à une variable aléatoire univariée dont la distribution sous-jacente est supposée normale. Ce test est une généralisation du test de Grubb dans le sens où il permet de tester pour la présence de plus d'une seule et unique abérration. Philippe Rocca-Serra Tietjen-Moore test adapted from NIST: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h2.htm last accessed: 2016-11-19 FindOutliersTietjenMooreTest(dataSeries,k,alpha=0.05) from: https://rdrr.io/rforge/climtrends/man/findOutliers.Tietjen.Moore.test.html Tietjen-Moore test for outliers The Extreme Studentized Deviate Test is a statistical test used to detect outliers in a univariate data set assumed to come from a normally distributed population. The ESD Test differs from the Grubbs' test and the Tietjen-Moore test in the sense that it contains built-in correction for multiple testing. Philippe Rocca-Serra ESD test for outliers generalized ESD test for outliers Rosner, Bernard (May 1983), Percentage Points for a Generalized ESD Many-Outlier Procedure,Technometrics, 25(2), pp. 165-172. STATO adapted from NIST: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm last accessed: 2016-11-19 rgrubbs.test(x, alpha = 0.05) from: http://finzi.psych.upenn.edu/library/OutlierDM/html/rgrubbs.test.html generalized extreme studentized deviate test 1 2 a split-plot design is kind of factorial design which is used when running a full factorial completely randomized design is inpractical, either for cost or practicalities (e.g. equipment, fields), in other words, when a restricted randomization has to be applied. A split-plot design is used whenever practioners fix the level of 'hard to change factor' and run all the combinations of the other factors. The hard to change factor is also refered to as the 'whole plot' factor, while the remainders of the factors are refered to as 'split plot factor'. Performing a split-plot design therefore means fixing one factor level, and then applying the treatments formed by the cartesian products of the levels for the other factors. A mininum of 2 factors are required and one being applied before the other(s). Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from: http://www.minitab.com/uploadedFiles/Content/News/Published_Articles/recognize_split_plot_experiment.pdf adapted from wikipedia: https://en.wikipedia.org/wiki/Restricted_randomization last accessed: 14.12.2016 https://pdfs.semanticscholar.org/bb4b/d979610388c76bb81568f14a886304ce4662.pdf split-plot design 2 3 a split split plot design is a study design where restricted randomization affect 2 study factors (and not 1 as in split-plot design). Such design is only possible if at least 3 independent variables are present. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra adapted from https://onlinecourses.science.psu.edu/stat503/node/72 last accessed 2016/12/15 split split plot design Restricted randomization is a kind of randomization which is used or occured when hard to change factors exist in a study design. In other words, when complete randomization is not possible, a case of restricted randomization exists, for instance in the case of split-plot design. Restricted randomization allows intuitively poor allocations of treatments to experimental units to be avoided, while retaining the theoretical benefits of randomization. Restricted randomization can also result from an unplanned event and is then something that should be avoided. RandomizeR R package can be used to detect such events and assess the quality of randomization process. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra Adapted from Wikipedia: https://en.wikipedia.org/wiki/Restricted_randomization last accessed: 2016/12/15 restricted randomization a 'whole plot number' is a data item used to count and identify the actual piece of land (in the case of real field based trials) used in a split plot design experiment and receiving treatments corresponding to the levels of a factor whose randomization is restricted (these factors are known as 'hard to change' factors). In the case of non-field based trials, the 'whole plot' is a metaphor. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO adapted from http://people.stat.sfu.ca/~cschwarz/Stat-650/Notes/PDFbigbook-R/R-part012.pdf last accessed: 2016/12/15 whole plot number a 'sub plot number' is a data item used to count and identify the actual piece of land located within a 'whole plot', in the case of real field based trials using a split-plot design, and received completely randomized treatments corresponding to the factor levels combinations of the remainder factors declared in the experiment. in the case of 'split-split plot design', sub-plots also receive treatments corresponding to a factor whose randomization is restriction. In such configuration, each 'sub-plot' is itself divided into 'sub sub-plot', which then received the remainder of the treatments in completely randomized fashion. In the case of non-field based trials, the notion 'sub-plot' is a metaphor. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO adapted from http://people.stat.sfu.ca/~cschwarz/Stat-650/Notes/PDFbigbook-R/R-part012.pdf last accessed: 2016/12/15 sub-plot number a 'sub sub-plot number' is a data item used to count and identify the actual piece of land located within a 'sub plot', in the case of real field based trials using a split-split-plot design, and received completely randomized treatments corresponding to the factor levels combinations of the remainder factors declared in the experiment. in the case of 'split-split plot design', sub-plots also receive treatments corresponding to a factor whose randomization is restriction. In such configuration, each 'sub-plot' is itself divided into 'sub sub-plot', which then received the remainder of the treatments in completely randomized fashion. In the case of non-field based trials, the notion 'sub sub-plot' is a metaphor. Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO adapted from http://people.stat.sfu.ca/~cschwarz/Stat-650/Notes/PDFbigbook-R/R-part012.pdf last accessed: 2016/12/15 sub sub-plot number "Wilks' lambda is a test statistic used in multivariate analysis of variance (MANOVA) to test whether there are differences between the means of identified groups of subjects on a combination of dependent variables." Alejandra Gonzalez-Beltran Philippe Rocca-Serra http://www.blackwellpublishing.com/specialarticles/jcn_9_381.pdf https://stat.ethz.ch/R-manual/R-devel/library/stats/html/summary.manova.html ## S3 method for class 'manova' summary(object, test = c("Pillai", "Wilks", "Hotelling-Lawley", "Roy"), intercept = FALSE, tol = 1e-7, ...) https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mstats.f_value_wilks_lambda.html#scipy.stats.mstats.f_value_wilks_lambda scipy.stats.mstats.f_value_wilks_lambda(ER, EF, dfnum, dfden, a, b) Wilk's Lambda test "Pillai proposed the trace test for the following three tests: (a) equality of mean vectors of lp‐variate normal distributions with the common but unknown covariance matrix, (b) independence between two sets of variates distributed jointly as a normal distribution with unknown mean vector, and (c) equality of covariance matrices of two p‐variate normal distributions with unknown mean vectors." Alejandra Gonzalez-Beltran Philippe Rocca-Serra https://doi.org/10.1002/0470011815.b2a13067 Pillai's trace test "The Lawley–Hotelling trace is used to test the equality of mean vectors of k p‐variate normal distributions with common but unknown covariance matrix. The explicit form of the null distribution of T$_{0}^{2}$equation image is the F distribution. The asymptotic null distribution is the chi‐square distribution. The power function of the test is described and its power is compared with the likelihood ratio test. " Alejandra Gonzalez-Beltran Philippe Rocca-Serra https://doi.org/10.1002/0470011815.b2a13035 Lawley–Hotelling Trace Hotelling-Lawley Trace test "Roy's maximum root test finds the maximum characteristic root or eigenvalue statistic for testing equality of k p-variate normal distributions with same covariance matrix, independence between two sets of variables jointly distributed as a normal distribution, equality of covariance matrices of two p-variate normal distributions, whether the covariance matrix of a p-variabte normal distribution with unknown mean vector equals a specified matrix" Alejandra Gonzalez-Beltran Philippe Rocca-Serra Oxford Dictionary of Statistical Terms https://onlinecourses.science.psu.edu/stat505/node/163 Roy’s Maximum Root test "The multivariate analysis of variance, or MANOVA, is a procedure for comparing multivariate sample means. As a multivariate procedure, it is used when there are two or more dependent variables, and is typically followed by significance tests involving individual dependent variables separately. It helps to answer: 1. Do changes in the independent variable(s) have significant effects on the dependent variables? 2. What are the relationships among the dependent variables? 3. What are the relationships among the independent variables?" Alejandra Gonzalez-Beltran Philippe Rocca-Serra https://en.wikipedia.org/wiki/Multivariate_analysis_of_variance MANOVA https://stat.ethz.ch/R-manual/R-devel/library/stats/html/manova.html multivariate analysis of variance In Bayesian statistics context, a credible interval is an interval of a posterior distribution which is such that the density at any point inside the interval is greater than the density at any point outside and that the area under the curve for that interval is equal to a prespecified probability level. For any probability level there is generally only one such interval, which is also often known as the highest posterior density region. Unlike the usual confidence interval associated with frequentist inference, here the intervals specify the range within which parameters lie with a certain probability. The Bayesian counterparts of the confidence interval used in Frequentists Statistics. Philippe Rocca-Serra Bayesian credibility interval Adapted from Wikipedia: https://en.wikipedia.org/wiki/Credible_interval and from the Cambridge Dictionary of Statistics, fourth edition, ISBN-13 978-0-511-78827-7 last accessed: 2017-07-01 HPD region of highest posterior density credible interval In Bayesian statistics context, a 95% credible interval is a credible interval which,given the data, includes the true parameter with probability of 95%. Philippe Rocca-Serra Bayesian credibility interval at 95% Wikipedia: https://en.wikipedia.org/wiki/Credible_interval last accessed: 2017-07-01 95% credible interval " In clinical trials, it gives you an idea of how much difference there is between the averages of the experimental group and control groups." "The mean difference, or difference in means, measures the absolute difference between the mean value in two different groups." Alejandra Gonzalez-Beltran Philippe Rocca-Serra http://www.statisticshowto.com/mean-difference/ MD difference in means mean difference In Bayesian statistics context, a 99% credible interval is a credible interval which, given the data, includes the true parameter with probability of 99%. Philippe Rocca-Serra Bayesian credible interval at 99% Wikipedia: https://en.wikipedia.org/wiki/Credible_interval last accessed: 2017-07-01 99% credible interval group sequential design is a study design used in clinical trial settings in which interim analyses of the data are conducted after groups of patients are recruited. After each interim analysis, the trial may stop early if the evidence so far shows the new treatment is particularly effective or ineffective. Such designs are ethical and cost-effective, and so are of great interest in practice. Philippe Rocca-Serra adapted from https://www.jstatsoft.org/article/view/v066i02/v66i02.pdf https://cran.r-project.org/web/packages/gsDesign/index.html group sequential design interim analysis is a data transformation used to analyzed studies implementing a group-sequential design, to evaluate and interpret the accumulating information during a clinical trial. It means that the analysis of data that is conducted before full data collection has been completed. Clinical trials are unusual in that enrollment of patients is a continual process staggered in time. This means that if a treatment is particularly beneficial or harmful compared to the concurrent placebo group while the study is on-going, the investigators are ethically obliged to assess that difference using the data at hand and to make a deliberate consideration of terminating the study earlier than planned. Philippe Rocca-Serra adapted from https://onlinecourses.science.psu.edu/stat509/node/75 and from wikipedia: https://en.wikipedia.org/wiki/Interim_analysis last accessed: 2017-10-9 interim analysis the O'brien-Flemming boundary analysis is a kind of interim-analysis method implemented by O'brien and Flemming to account for the As all frequentist methods of the same type, it focuses on controlling the type I error rate as the repeated hypothesis testing of accumulating data increases the type I error rate of a clinical trial. O'brien-Flemming boundary analysis https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3856440/ The Pocock boundary analysis gives a p-value threshold for each interim analysis which guides the data monitoring committee on whether to stop the trial. The boundary used depends on the number of interim analyses. The Pocock boundary is simple to use in that the p-value threshold is the same at each interim analysis. The disadvantages are that the number of interim analyses must be fixed at the start and it is not possible under this scheme to add analyses after the trial has started. Another disadvantage is that investigators and readers frequently do not understand how the p-values are reported: for example, if there are five interim analyses planned, but the trial is stopped after the third interim analysis because the p-value was 0.01, then the overall p-value for the trial is still reported as <0.05 and not as 0.01. As all frequentist methods of the same type, it focuses on controlling the type I error rate as the repeated hypothesis testing of accumulating data increases the type I error rate of a clinical trial. Philippe Rocca-Serra https://doi.org/10.1093%2Fbiomet%2F64.2.191 and Wikipedia: https://en.wikipedia.org/wiki/Pocock_boundary Pocock boundary analysis The Haybittle–Peto boundary analysis is an interim analysis where a rule for deciding when to stop a clinical trial prematurely is defined. It is named for John Haybittle and Richard Peto. The Haybittle–Peto boundary is one such stopping rule, and it states that if an interim analysis shows a probability of equal to, or less than 0.001 that a difference as extreme or more between the treatments is found, given that the null hypothesis is true, then the trial should be stopped early. The final analysis is still evaluated at the normal level of significance (usually 0.05).[3][4] The main advantage of the Haybittle–Peto boundary is that the same threshold is used at every interim analysis, unlike the O'Brien–Fleming boundary, which changes at every analysis. Also, using the Haybittle–Peto boundary means that the final analysis is performed using a 0.05 level of significance as normal, which makes it easier for investigators and readers to understand. The main argument against the Haybittle–Peto boundary is that some investigators believe that the Haybittle–Peto boundary is too conservative and makes it too difficult to stop a trial. As all frequentist methods of the same type, it focuses on controlling the type I error rate as the repeated hypothesis testing of accumulating data increases the type I error rate of a clinical trial. Philippe Rocca-Serra 10.1259/0007-1285-44-526-793 and adapted from Wikipedia: https://en.wikipedia.org/wiki/Haybittle%E2%80%93Peto_boundary Haybittle-Peto boundary analysis A lnear mixed model is a mixed model containing both fixed effects and random effects and in which factors and covariates are assumed to have a linear relationship to the dependent variable. These models are useful in a wide variety of disciplines in the physical, biological and social sciences. They are particularly useful in settings where repeated measurements are made on the same statistical units (longitudinal study), or where measurements are made on clusters of related statistical units. Because of their advantage in dealing with missing values, mixed effects models are often preferred over more traditional approaches such as repeated measures ANOVA. Fixed-effects factors are generally considered to be the variables whose values of interest are all represented in the data file. Random-effects factors are variables whose values correspond to unwanted variation. They are useful when trying to understand variability in the dependent variable which was not anticipated and exceeds what was expected. Linear mixed models also allow to specify specific interactions between factors, and allow the evaluation of the various linear effect that a particular combination of factor levels may have on a response variable. Finally, linear mixed models allow to specify variance components in order to describe the relation between various random effects levels. Hanna Cwiek Pawel Krajewski Philippe Rocca-Serra LMM adapted from Wikipedia: https://en.wikipedia.org/wiki/Mixed_model linear mixed model An empirical measure is a random measure arising from a particular realization of a (usually finite) sequence of random variables. Alejandra Gonzalez-Beltran Philippe Rocca-Serra https://en.wikipedia.org/wiki/Empirical_measure empirical measure A model term is a data item set in statistical model formula to apportion source of variation. Alejandra Gonzalez-Beltran Hanna Cwiek Philippe Rocca-Serra STATO statistical model term model term the model random effect term is model term which aims to account for the unwanted variability in the data associated with a range of independent variables which are not the primary interest in the dataset. It is there also known as the variance component of the model Alejandra Gonzalez-Beltran Hanna Cwiek Philippe Rocca-Serra variance component model random effect term a model fixed effect term is a model term which accounts for variation explained by an independent variable and its levels. Alejandra Gonzalez-Beltran Hanna Cwiek Philippe Rocca-Serra model fixed effect term a model interaction effect term is a model term which accounts for variation explained by the combined effects of the factor levels of more than one (usually 2) independent variables. Alejandra Gonzalez-Beltran Hanna Cwiek Philippe Rocca-Serra model interaction effect term a model error term is a model term which accounts for residual variation not explained by the other components (fixed and random effect terms) Alejandra Gonzalez-Beltran Hanna Cwiek Philippe Rocca-Serra model error term a statistic estimator is a data item which is computed from a dataset to provide an approximated value (an estimator) for a 'statistical parameter' (a 'characteristics/parameter' of the true underlying distribution) of a real population. Hanna Cwiek Tom Nichols Philippe Rocca-Serra STATO statistic estimator An estimate of the number of degrees of freedom. Alejandra Gonzalez-Beltran Hanna Cwiek Philippe Rocca-Serra STATO degree of freedom approximation The Kenward-Roger method's fundamental idea is to calculate the approximate mean and variance of their statistic and then match moments with an F distribution to obtain the denominator degrees of freedom. Alejandra Gonzalez-Beltran Hanna Cwiek Philippe Rocca-Serra https://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_glimmix_details40.htm https://www.jstatsoft.org/article/view/v059i09 https://www.ncbi.nlm.nih.gov/pubmed/9333350 Kenward-Roger https://www.rdocumentation.org/packages/lmerTest/versions/2.0-36/topics/anova-methods ibrary(lme4) library(pbkrtest) fm1 <- lmer(Reaction ~ Days + (Days| Subject), sleepstudy) get_Lb_ddf(fm1, lme4::fixef(fm1)) Kenward-Roger degree of freedom approximation Satterthwaite degree of freedom approximation is a type of degree of freedom approximation which is used to estimate an “effective degrees of freedom” for a probability distribution formed from several independent normal distributions where only estimates of the variance are known. It was originally developed by statistician Franklin E. Satterthwaite. Alejandra Gonzalez-Beltran Hanna Cwiek Philippe Rocca-Serra satterthwaite Satterthwaite, F. E. (1946), "An Approximate Distribution of Estimates of Variance Components.", Biometrics Bulletin, 2: 110–114, doi:10.2307/3002019 Welch-Satterthwaite https://www.rdocumentation.org/packages/metRology/versions/0.9-23-2/topics/welch.satterthwaite Satterthwaite degree of freedom approximation a data transformation to determine the number of degree of freedom Alejandra Gonzalez-Beltran Hanna Cwiek Philippe Rocca-Serra between-within https://www.ncbi.nlm.nih.gov/pubmed/25899170 between-within denominator degrees of freedom approximation RR-BLUP is a data transformation used in the context of estimating breeding value using a Bayesian ridge regression. It can be obtained from Bayes B procedure by setting pi parameter to zero ( ) and assuming that all the markers have the same variance. term request by Guillaume Bauchet, cassavabase.org, Cornell University Philippe Rocca-Serra Comparison Between Linear and Non-parametric Regression Models for Genome-Enabled Prediction in Wheat Paulino Pérez-Rodríguez, Daniel Gianola, Juan Manuel González-Camacho, José Crossa, Yann Manès and Susanne Dreisigacker G3: GENES, GENOMES, GENETICS December 1, 2012 vol. 2 no. 12 1595-1605; https://doi.org/10.1534/g3.112.003665 RRBLUP ridge regression best linear unbiaised predictor a data transformation which calculate predictions of breeding values using an animal model and a relationship matrix calculated from the genomic/genetic markers (G Matrix), in constrast to using Pedigree information as in BLUP, also known as ABLUP Philippe Rocca-Serra adapted from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3382275/ and from: https://www.rdocumentation.org/packages/pedigree/versions/1.4/topics/gblup and from GBLUP gblup(formula, data, M, lambda) where: formula: formula of the model, do not include the random effect due to animal (generally ID). data: data.frame with columns corresponding to ID and the columns mentioned in the formula. M: Matrix of marker genotypes, usually the count of one of the two SNP alleles at each markers (0, 1, or 2). lambda : Variance ratio (σ2e/σ2a) https://www.rdocumentation.org/packages/pedigree/versions/1.4/topics/gblup genomic best linear unbiased prediction a data transformation which calculate estimates of genomic estimated breeding values (GEBVs) on an animal or plant model utilizing trait-specific marker information. Philippe Rocca-Serra from: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0012648 TABLUP trait-specific relationship matrix best linear unbiaised prediction Bayes A is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model and treats the prior probability π that a SNP has zero effect as unknown (i.e π=0) Philippe Rocca-Serra A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2637029/ doi: 10.1186/1297-9686-41-2. and: Prediction of total genetic value using genome-wide dense marker maps. Meuwissen TH, Hayes BJ, Goddard ME. Genetics. 2001 Apr;157(4):1819-29. PMID: 11290733 Bayes A Bayes B is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model, treats the prior probability π that a SNP has zero effect to a set value (i.e π >0) and uses a mixture distribution. Philippe Rocca-Serra A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2637029/ doi: 10.1186/1297-9686-41-2. and: Prediction of total genetic value using genome-wide dense marker maps. Meuwissen TH, Hayes BJ, Goddard ME. Genetics. 2001 Apr;157(4):1819-29. PMID: 11290733 Bayes B the estimated breeding value of an organism is a data item computed to estimate the true breeding value defined as genetic merit of an organism, half of which will be passed on to its progeny. While the exact breeding value can not been known, for performance traits it is possible to make good estimates. These estimates are called Estimated Breeding Values (EBVs). EBVs are expressed in the units of measurement for each particular trait. These estimates are output of various estimation methods which differ depending on the underlying assumptions (equal variance of marker effect, all markers contributing to the trait) , the mathemical methods used (bayesian or non-bayesians) and the genetic inheritance models being considered (additive, dominant, epistatic) selected by the analysts. Philippe Rocca-Serra adapted from: http://abri.une.edu.au/online/pages/understanding_ebvs_char.htm EBV estimated breeding value Additive genetic model is a data item which refer to the contributions to the final phenotype from more than one gene, or from alleles of a single gene (in heterozygotes), that combine in such a way that the sum of their effects in unison is equal to the sum of their effects individually. Philippe Rocca-Serra additive genetic inheritance model Additive genetic model is a data item which refer to the contributions to the final phenotype from more than one gene, or from alleles of a single gene (in heterozygotes), that combine in such a way that the sum of their effects in unison is equal to the sum of their effects individually and dominance (of alleles at a single locus). Philippe Rocca-Serra additive dominant genetic inheritance model Additive genetic model ris a data item which refer to the contributions to the final phenotype from more than one gene, or from alleles of a single gene (in heterozygotes), that combine in such a way that the sum of their effects in unison is equal to the sum of their effects individually and additive dominant ( (of alleles at a single locus) ) and epistasis (of alleles at more different loci) Philippe Rocca-Serra additive dominant genetic and epistatic inheritance model Dunn’s Multiple Comparison Test is a post hoc (i.e. it’s run after an ANOVA) non parametric test (a “distribution free” test that doesn’t assume your data comes from a particular distribution). It is one of the least powerful of the multiple comparisons tests and can be a very conservative test–especially for larger numbers of comparisons. The Dunn is an alternative to the Tukey test when you only want to test for differences in a small subset of all possible pairs; For larger numbers of pairwise comparisons, use Tukey’s instead. Use Dunn’s when you choose to test a specific number of comparisons before you run the ANOVA and when you are not comparing to controls. If you are comparing to a control group, use the Dunnett test instead. Philippe Rocca-Serra Dunn, O.J. (1961) Multiple comparisons among means. JASA, 56: 54-64 Dunn, Olive Jean (1964). "Multiple comparisons using rank sums". Technometrics. 6 (3): 241–252. doi:10.2307/1266041 and adapted from: http://www.statisticshowto.com/dunns-test/ Dunn's test ## Default S3 method: dunnTest(x, g, method = dunn.test::p.adjustment.methods[c(4, 2:3, 5:8, 1)], two.sided = TRUE, altp = two.sided, ...) from: http://www.rforge.net/doc/packages/FSA/dunnTest.html Dunn’s multiple comparison test Conover-Iman test for stochastic dominance is a stastical test for multiple group comparisons and reports the results among multiple pairwise comparisons after a Kruskal-Wallis test for stochastic dominance among k groups (Kruskal and Wallis, 1952). The interpretation of stochastic dominance requires an assumption that the CDF of one group does not cross the CDF of the other. The null hypothesis for each pairwise comparison is that the probability of observing a randomly selected value from the first group that is larger than a randomly selected value from the second group equals one half; this null hypothesis corresponds to that of the Wilcoxon-Mann-Whitney rank-sum test. Like the rank-sum test, if the data can be assumed to be continuous, and the distributions are assumed identical except for a difference in location, Conover-Iman test may be understood as a test for median difference. conover.test accounts for tied ranks. The Conover-Iman test is strictly valid if and only if the corresponding Kruskal-Wallis null hypothesis is rejected. Philippe Rocca-Serra Conover, W. J. and Iman, R. L. (1979). On multiple-comparisons procedures. Technical Report LA-7677-MS, Los Alamos Scientific Laboratory. posthoc.kruskal.conover.test(x, …) # S3 method for default posthoc.kruskal.conover.test( x, g, p.adjust.method = p.adjust.methods, …) # S3 method for formula posthoc.kruskal.conover.test(formula, data, subset, na.action, p.adjust.method = p.adjust.methods, …) https://www.rdocumentation.org/packages/PMCMR/versions/4.2/topics/posthoc.kruskal.conover.test conover.test makes k(k-1)/2 multiple pairwise comparisons based on Conover-Iman t-test-statistic of the rank differences. Conover-Iman test of multiple comparisons using rank sums application to breeding value estimation and genomic selection https://www.ncbi.nlm.nih.gov/pubmed/20122298 Bayesian LASSO is a data transformation where the regression parameters have independent Laplace (i.e., double-exponential) priors and are used to interprete Lasso estimate for linear regression parameters as Bayesian posterior mode estimates in accordance to a Bayesian framework. Philippe Rocca-Serra https://www.tandfonline.com/doi/abs/10.1198/016214508000000337 Bayes LASSO Bayesian least absolute shrinkage and selection operator a genotype matrix is a kind of genomic relationship matrix in the rawest of form and which simply corresponds to a matrix of individuals genotype for a given set of markers or genomic positions. Columns are snps or markers, Rows are individuals. Each column/row cell contains a genotype expressed as, in the genome is diploid, as a pair of characters chosen from ATGC where the dominant variant is uppercased and the recessive variant is lower cased. Philippe Rocca-Serra http://articles.extension.org/pages/68019/genomic-relationships-and-gblup genotype matrix the MAF matrix is a genomic relationship matrix which is obtained from the genotype matrix by counting the number of minor alleles at each locus Philippe Rocca-Serra http://articles.extension.org/pages/68019/genomic-relationships-and-gblup MAF matrix gene content matrix matrix of minor allele count MAF matrix the M matrix is a genomic relationship matrix which is obtained by subtracting 1 to every value of the MAF matrix (gene content matrix). The values of the M matrix are only -1, 0 or 1 and makes computation easier. M = MAF-1 Philippe Rocca-Serra http://articles.extension.org/pages/68019/genomic-relationships-and-gblup Efficient Methods to Compute Genomic Predictions. J. Dairy Sci. 91:4414-4423. 2008 P. M. VanRaden. 10.31 68/jds.2007-0980 deviation of 1 from the gene content matrix >M = MAF-1 M matrix P matrix is a kind of genomic relationship matrix which contains allele frequencies expressed as a difference from 0.5 and multiplied by 2. Philippe Rocca-Serra http://articles.extension.org/pages/68019/genomic-relationships-and-gblup Efficient Methods to Compute Genomic Predictions. J. Dairy Sci. 91:4414-4423. 2008 P. M. VanRaden. 10.31 68/jds.2007-0980 P matrix the Z-matrix is a genomic relationship matrix which is obtained by substracted the M matrix with the P matrix. It is also known as the incidence matrix for the markers. Philippe Rocca-Serra http://articles.extension.org/pages/68019/genomic-relationships-and-gblup Efficient Methods to Compute Genomic Predictions. J. Dairy Sci. 91:4414-4423. 2008 P. M. VanRaden. 10.31 68/jds.2007-0980 incidence matrix for genotyping markers Z matrix The degree of freedom numerator is the number of degrees of freedom that the estimate of variance used in the numerator is based on. It is one of the parameters for the F-distribution used to compute probabilities in analysis of variance. term request: https://github.com/ISA-tools/stato/issues/71 Hanna Cwiek Philippe Rocca-Serra df1 num df numerator degrees of freedom augmented design is a kind of experimental design where the goal is to compare existing (control) treatments with new treatments that have an experimental constraint of "limited replication". To understand limited replication, consider about experiments that may only allow a single representation of the new treatment, this limitation may be many times due to the cost associated with the experiment, limited resources, or limited number of new units that can be used in the experiment. In contrast, the existing treatments are referred as checks and are generally replicated multiple times. With augmented design one can estimate the following: a) Differences between checks and new treatments, b) Differences among new treatments, c) Differences among check treatments, and d) Differences among new and check treatments combined. Philippe Rocca-Serra Federer, W.T. 1956. Augmented (or hoonuiaku) designs. Hawaiian Planters’ Record LV(2): 191–208) http://rna.genomics.purdue.edu/ an example of dataset representing an augmented design: https://www.rdocumentation.org/packages/sommer/versions/3.2/topics/augment augmented experimental design a probability distribution location parameter is a data item which is set by the operator when selecting a parametric probability distribution and which dictates the way the location but not the profile or size of the distribution plot looks like. https://github.com/ISA-tools/stato/issues/50 Philippe Rocca-Serra adapted from: https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#shifting-and-scaling norm.stats(loc=3, scale=4, moments="mv"), where loc is the location parameter indicated how much shift is given probability distribution location parameter the Weibull probability distribution is continuous probabibility distribution which is used to model time to fail, time to repair and material strength in material science. In biomedicine, the Weibull probability is used to in determining 'hazard functions'. The 'location parameter' of the Weibull probability distribution can be used to define a failure-free zone. If the quantity X is a "time-to-failure", the Weibull distribution gives a distribution for which the failure rate is proportional to a power of time. The shape parameter, k, is that power plus one, and so this parameter can be interpreted directly as follows: A value of {\displaystyle k<1\,} {\displaystyle k<1\,} indicates that the failure rate decreases over time. This happens if there is significant "infant mortality", or defective items failing early and the failure rate decreasing over time as the defective items are weeded out of the population. In the context of the diffusion of innovations, this means negative word of mouth: the hazard function is a monotonically decreasing function of the proportion of adopters; A value of {\displaystyle k=1\,} {\displaystyle k=1\,} indicates that the failure rate is constant over time. This might suggest random external events are causing mortality, or failure. The Weibull distribution reduces to an exponential distribution; A value of {\displaystyle k>1\,} {\displaystyle k>1\,} indicates that the failure rate increases with time. This happens if there is an "aging" process, or parts that are more likely to fail as time goes on. In the context of the diffusion of innovations, this means positive word of mouth: the hazard function is a monotonically increasing function of the proportion of adopters. The function is first concave, then convex with an inflexion point at {\displaystyle (e^{1/k}-1)/e^{1/k},k>1\,} {\displaystyle (e^{1/k}-1)/e^{1/k},k>1\,}. Philippe Rocca-Serra Weibull distribution adapted from : https://en.wikipedia.org/wiki/Weibull_distribution and from http://www.engineeredsoftware.com/nasa/weibull.htm Weibull probability distribution pweibull(q, shape, scale = 1, lower.tail = TRUE, log.p = FALSE) from: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Weibull.html scipy.stats.weibull_min from: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.weibull_min.html#scipy.stats.weibull_min Weibull probability distribution statistical sampling is a planned process which aims at assembling a population of observation units (samples) in as an unbiaised manner as possible in order to obtain or infer information about the actual population these samples have been drawn. Philippe Rocca-Serra STATO statistical sampling Simple random sampling is a statistical sampling process which creates a sample of a size n entirely by chance. In such process, each unit has the same probability of being selected. Depending on the size of the population being sampled, the sampling process may be done with or without replacement Alejandra Gonzalez-Beltran Philippe Rocca-Serra https://en.wikipedia.org/wiki/Simple_random_sample random sampling srswor(n,N) from https://www.rdocumentation.org/packages/sampling/versions/2.8/topics/srswor [simple random sampling without replacement] Sampling R package srswr(n,N) from: https://www.rdocumentation.org/packages/sampling/versions/2.8/topics/srswr [simple random sampling with replacement] Sampling R package simple random sampling It is a sampling process used, among other things, in ecological studies when studying how things change in an given environment line intercept sampling is a sampling process by which an element in a spatial region is included in a sample if it is intersected by a line chosen by the operator. Philippe Rocca-Serra LIS LIS sampling Lee Kaiser, Biometrics, Vol. 39, No. 4 (Dec., 1983), pp. 965-976 http://www.jstor.org/stable/2531331 line intercept sampling line intercept sampling Quadrat sampling is a classic tool for the study of ecology, especially biodiversity. In general, a series of squares (quadrats) of a set size are placed in a habitat of interest and the species within those quadrats are identified and recorded. Passive quadrat sampling (done without removing the organisms found within the quadrat) can be either done by hand, with researchers carefully sorting through each individual quadrat or, more efficiently, can be done by taking a photograph of the quadrat for future analysis. Alejandra Gonzalez-Beltran Philippe Rocca-Serra http://www.coml.org/investigating/observing/quadrat_sampling.html quadrat sampling Cluster sampling is a sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population. Alejandra Gonzalez-Beltran Philippe Rocca-Serra https://en.wikipedia.org/wiki/Cluster_sampling cluster(data, clustername, size, method=c("srswor","srswr","poisson", "systematic"),pik,description=FALSE) from: https://www.rdocumentation.org/packages/sampling/versions/2.8/topics/cluster Sampling R package cluster sampling Probability proportional to size ('PPS') sampling is a sampling method in which the selection probability for each element is set to be proportional to its size measure, up to a maximum of 1. In a simple PPS design, these selection probabilities can then be used as the basis for Poisson sampling. However, this has the drawback of variable sample size, and different portions of the population may still be over- or under-represented due to chance variation in selections. Alejandra Gonzalez-Beltran Philippe Rocca-Serra https://en.wikipedia.org/wiki/Sampling_(statistics)#Probability-proportional-to-size_sampling PPS sampling probability-proportional-to-size sampling stratified sampling is a statistical sampling method which divides the population into homogenous subpopulations, which are then sampled using random or systematic sampling methods Philippe Rocca-Serra adapted from wikipedia: https://en.wikipedia.org/wiki/Stratified_sampling stratified sampling strata(data, stratanames=NULL, size, method=c("srswor","srswr","poisson", "systematic"), pik,description=FALSE) From: https://www.rdocumentation.org/packages/sampling/versions/2.8/topics/strata R Sampling Package stratified sampling systematic sampling is a process for collecting samples and assembling a statistical sample using a system or method (.e.g unequal probabilities, without replacement, fixed sample size), as opposed to a random sampling. Philippe Rocca-Serra Madow, W.G. (1949), On the theory of systematic sampling, II, Annals of Mathematical Statistics, 20, 333-354. from: https://www.rdocumentation.org/packages/sampling/versions/2.8/topics/UPsystematic Sampling R package systematic sampling Quota sampling is a method for selecting survey participants that is a non-probabilistic version of stratified sampling. In quota sampling, a population is first segmented into mutually exclusive sub-groups, just as in stratified sampling. Then judgment is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60. This means that individuals can put a demand on who they want to sample (targeting). Alejandra Gonzalez-Beltran Philippe Rocca-Serra https://en.wikipedia.org/wiki/Quota_sampling quota sampling Panel sampling is the method of first selecting a group of participants through a random sampling method and then asking that group for (potentially the same) information several times over a period of time. Therefore, each participant is interviewed at two or more time points; each period of data collection is called a "wave". The method was developed by sociologist Paul Lazarsfeld in 1938 as a means of studying political campaigns. Alejandra Gonzalez-Beltran Philippe Rocca-Serra https://en.wikipedia.org/wiki/Sampling_(statistics)#Panel_sampling panel sampling Snowball sampling (or chain sampling, chain-referral sampling, referral sampling) is a non-probability sampling technique where existing study subjects recruit future subjects from among their acquaintances. Thus the sample group is said to grow like a rolling snowball. Alejandra Gonzalez-Beltran Philippe Rocca-Serra from: https://en.wikipedia.org/wiki/Snowball_sampling chain sampling referral sampling snowball sampling chain-referral sampling The voluntary sampling method is a type of non-probability sampling. A voluntary sample is made up of people who self-select into the survey. Often, these subjects have a strong interest in the main topic of the survey. Volunteers may be invited through advertisements on Social Media Sites Alejandra Gonzalez-Beltran Philippe Rocca-Serra https://en.wikipedia.org/wiki/Sampling_(statistics)#Voluntary_Sampling voluntary sampling Convenience sampling (also known as grab sampling, accidental sampling, or opportunity sampling) is a type of non-probability sampling that involves the sample being drawn from that part of the population that is close to hand. This type of sampling is most useful for pilot testing. Alejandra Gonzalez-Beltran Philippe Rocca-Serra from wikipedia: https://en.wikipedia.org/wiki/Convenience_sampling accidental sampling grab sampling opportunity sampling convenience sampling Brewer's sampling is a statistical sampling method which was proposed by Brewer in 1975 and uses unequal probabibility sampling technique Philippe Rocca-Serra Brewer, Kenneth RW (1975). A Simple Procedure For Sampling πpswor1. Australian Journal of Statistics, 17(3), 166-172. UPbrewer(pik,eps=1e-06) from: https://www.rdocumentation.org/packages/sampling/versions/2.8/topics/UPbrewer Sampling R package Brewer sampling In imbalanced datasets, where the sampling ratio does not follow the population statistics, one can resample the dataset in a conservative manner called minimax sampling. The minimax sampling has its origin in Anderson minimax ratio whose value is proved to be 0.5: in a binary classification, the class-sample sizes should be chosen equally. This ratio can be proved to be minimax ratio only under the assumption of LDA classifier with Gaussian distributions. The notion of minimax sampling is recently developed for a general class of classification rules, called class-wise smart classifiers. Alejandra Gonzalez-Beltran Philippe Rocca-Serra https://en.wikipedia.org/wiki/Sampling_(statistics)#Minimax_sampling minimax sampling complete randomization is a group randomization where experimental units are randomly assigned to the entire set of groups defined by the experimental treatments. Philippe Rocca-Serra STATO crPar(N, K = 2, ratio = rep(1, K), groups = LETTERS[1:K]) from: https://www.rdocumentation.org/packages/randomizeR/versions/1.4/topics/crPar complete randomization Data imputation is a data transformation process whereby missing data is replaced with an estimated value for the missing element. The substituted values are intended to create a data record that does not fail edits. Various methods may be used to produce these substituted values. Philippe Rocca-Serra adapted from wikipedia and from the OECD glossary of statistical terms https://stats.oecd.org/glossary/detail.asp?ID=3406 data imputation last observation carried forward data imputation is a type of data imputation which uses a very simple, self explanatory method for substituted a missing value for an observation. It should be noted that this method gives a biased estimate of the treatment effect and underestimates the variability of the estimated result and should be used cautiously. Philippe Rocca-Serra adapted from Wikipedia: https://en.wikipedia.org/wiki/Analysis_of_clinical_trials#Last_observation_carried_forward last observation carried forward data imputation regression data imputation is a type of data imputation where missing values are replaced with the value of a regression function coefficient. Philippe Rocca-Serra regression data imputation substitution by the mean data imputation is a type of data imputation where missing values are replaced with the value the variable mean. Philippe Rocca-Serra substitution by the mean data imputation https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/ multivariate imputation with chained equations (MICE) is a type of data imputation which uses an algorithm devised by Stef van Buuren and Karin Groothuis-Oudshoorn Philippe Rocca-Serra MICE: Multivariate Imputation by Chained Equations in R by Stef van Buuren and Karin Groothuis-Oudshoorn. Journal of Statistical Software, http://www.stefvanbuuren.nl/publications/mice%20in%20r%20-%20draft.pdf MICE library(mice) miceMod <- mice(BostonHousing[, !names(BostonHousing) %in% "medv"], method="rf") # perform mice imputation, based on random forests. multivariate imputation with chained equations k-nearest neighbour imputation is a data imputation which uses the k-nearest neighbour algorithm to compute a substitution value for the missing values. For every observation to be imputed, it identifies ‘k’ closest observations based on the euclidean distance and computes the weighted average (weighted based on distance) of these ‘k’ obs. Philippe Rocca-Serra adapted from: http://r-statistics.co/Missing-Value-Treatment-With-R.html kNN data imputation library(DMwR) knnOutput <- knnImputation(BostonHousing[, !names(BostonHousing) %in% "medv"]) # perform knn imputation. anyNA(knnOutput) from: http://r-statistics.co/Missing-Value-Treatment-With-R.html k-nearest neighbour data imputation Matthews Correlation Coefficient (or MCC) is a correlation coefficient which is a measure of the quality of binary (two-class) classifications, introduced by biochemist Brian W. Matthews in 1975. Philippe Rocca-Serra adapted from wikipedia: https://en.wikipedia.org/wiki/Matthews_correlation_coefficient and from: https://doi.org/10.1016/0005-2795(75)90109-9 MCC mcc(preds = NULL, actuals = NULL, TP = NULL, FP = NULL, TN = NULL, FN = NULL) from: https://www.rdocumentation.org/packages/mltools/versions/0.3.4/topics/mcc Matthews correlation coefficient a covariance matrix is a square matrix that contains the variances and covariances associated with several variables. The diagonal elements of the matrix contain the variances of the variables and the off-diagonal elements contain the covariances between all possible pairs of variables. Philippe Rocca-Serra dispersion matrix variance-covariance matrix covariance matrix https://www.r-bloggers.com/setup-up-the-inverse-of-additive-relationship-matrix-in-r/ the numerator relationship matrix is the matrix of *expected* additive genetic relationships between individuals. This matrix was originally used by Henderson (Henderson, C.R. 1976. A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics 32:69-83.) to account for covariances between random effects, and therefore to use information from relatives in estimation of breeding value. Among the properties of the NRM matrix (also known as the A matrix), it is symmetric, the diagonal value correspond to 1+ the inbreeding coefficient for an individual. Philippe Rocca-Serra A matrix adapted from: https://jvanderw.une.edu.au/Genetic_properties_of_the_animal_model.pdf and from: Henderson, C.R. 1976. A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics 32:69-83. https://doi.org/10.2307/2529339 NRM https://rdrr.io/cran/sommer/man/A.mat.html numerator relationship matrix The degree of freedom denominator is the number of degrees of freedom that the estimate of variance used in the denominator is based on. It is one of the parameters for the F-distribution used to compute probabilities in analysis of variance. term request: https://github.com/ISA-tools/stato/issues/71 Hanna Cwiek Philippe Rocca-Serra den df df2 denominator degrees of freedom A matrix of relationships among a group of individuals, which can be used to predict breeding values, to manage inbreeding and in genetic conservation. It can be calculated from the pedigree, but it is also possible to calculate the relationship matrix from genotypes at genetic markers such as single-nucleotide polymorphisms (SNPs). Elements of the genomic relationship matrix are estimates of the realized proportion of the genome that two individuals share, whereas the pedigree-derived relationship matrix is the expectation of this proportion. Philippe Rocca-Serra https://doi.org/10.3168/jds.2007-0980 https://www.ncbi.nlm.nih.gov/pubmed/22059574 realized genomic relationship matrix relationship matrix https://cran.r-project.org/web/packages/snpReady/snpReady.pdf G matrix a scaled t distribution is a kind of Student's t distribution which is shifted by 'mean' and scaled by standard deviation 'sd'. Philippe Rocca-Serra R documentation t.scaled(x, df, mean = 0, sd = 1, ncp, log = FALSE) from: https://www.rdocumentation.org/packages/metRology/versions/0.9-23-2/topics/Scaled%20t%20distribution scaled t distribution a Bayesian model is a statistical model where inference is based on using Bayes theorem to obtain a posterior distribution for a quantity (or quantities) of interest for some model (such as parameter values) based on some prior distribution for the relevant unknown parameters and the likelihood from the model. Philippe Rocca-Serra adapted from several sources: Oxford Dictionary of Statistics: 10.1093/acref/9780199541454.001.0001 http://www.scholarpedia.org/article/Bayesian_statistics https://stats.stackexchange.com/questions/129017/what-exactly-is-a-bayesian-model Bayesian model a prior probability distribution is a probability distribution used as input to a Bayesian model to represent a priori knowledge about a model parameter. Along with the acquired/observed data, it is used to compute a posterior distribution according to the Bayes theorem. Philippe Rocca-Serra Oxford Dictionary of Statistics: 10.1093/acref/9780199541454.001.0001 prior probability distribution a posterior probability distribution is a probability distribution computed in a Bayesian model approach given a prior distribution and a set of events/observations. Philippe Rocca-Serra Oxford Dictionary of Statistics: 10.1093/acref/9780199541454.001.0001 posterior probability distribution Bayes C pi is a data transformation used to compute estimated breeding values using a Bayesian model and which assesses the SNP effect using MonteCarlo Markov chain methods. Bayes C pi treats the prior probability π that a SNP has zero effect as unknown. The method was devised to address short comings of Bayes A and Bayes B approaches Philippe Rocca-Serra adapted from: BMC Bioinformatics. 2011 May 23;12:186. doi: 10.1186/1471-2105-12-186. Extension of the bayesian alphabet for genomic selection. Habier D, Fernando RL, Kizilkaya K, Garrick DJ. but also https://cran.r-project.org/web/packages/gdmp/gdmp.pdf and https://jvanderw.une.edu.au/RFSlides.pdf Bayes C pi Bayes C pi genetic inheritance model is a data item defining the assumption used by a breeding value estimation method to consider when running the calculations. Philippe Rocca-Serra STATO genetic inheritance model sampling from a probability distribution is a data transformation which aims at obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. Philippe Rocca-Serra STATO sampling from a probability distribution Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. Philippe Rocca-Serra adapted from wikipedia: https://en.wikipedia.org/wiki/Gibbs_sampling Geman, S.; Geman, D. (1984). "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images". IEEE Transactions on Pattern Analysis and Machine Intelligence. 6 (6): 721–741. doi:10.1109/TPAMI.1984.4767596 Gibbs sampling the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. Philippe Rocca-Serra https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications". Biometrika. 57 (1): 97–109. doi:10.1093/biomet/57.1.97 Metropolis–Hastings sampling a continuous multivariate probability distribution is a continuous probability distribution which describes the possible values, and corresponding probabilities, of two or more (usually three or more) associated random variables. Philippe Rocca-Serra http://www.oxfordreference.com/view/10.1093/acref/9780199541454.001.0001/acref-9780199541454-e-1095?rskey=355sBu&result=1 continuous multivariate probability distribution a discrete multivariate probability distribution is a discrete probability distribution which describes the possible values, and corresponding probabilities, of two or more (usually three or more) associated random variables. Philippe Rocca-Serra http://www.oxfordreference.com/view/10.1093/acref/9780199541454.001.0001/acref-9780199541454-e-1095?rskey=355sBu&result=1 discrete multivariate probability distribution A data transformation that produces a reproducing kernel Hilbert space (or RKHS), which is a Hilbert space of functions in which point evaluation is a continuous linear functional. Alejandra Gonzalez-Beltran Philippe Rocca-Serra https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space RKHS https://www.rdocumentation.org/packages/KGode/versions/1.0.1/topics/rkhs rkhs reproducing kernel Hilbert space procedure a state space model is a kind of statistical model which describes the probabilistic dependence between the latent state variable and the observed measurement. The state or the measurement can be either continuous or discrete. The term “state space” originated in 1960s in the area of control engineering (Kalman, 1960). SSM provides a general framework for analyzing deterministic and stochastic dynamical systems that are measured or observed through a stochastic process. Philippe Rocca-Serra http://www.scholarpedia.org/article/State_space_model SSM state space model genomic estimated breeding value (GEBV) is an estimated breeding value derived from information in an organism DNA (genotype). GEBV is calculated differently to conventional Estimated Breeding Values using advanced modeling technique to deal with high dimensionality data. Alejandra Gonzalez-Beltran Philippe Rocca-Serra adapted from: https://businesswales.gov.wales/farmingconnect/posts/genomic-breeding-values GEBV genomic estimated breeding value In a planned experiment where to covariance (genotype x environment) can be controlled and held at 0, the heritability is defined as the ratio of the variance of the genotypic variables to the variance of the phenotypic variables. H2 = Var(G)/Var(P) H2 is the broad-sense heritability. This reflects all the genetic contributions to a population's phenotypic variance including additive, dominant, and epistatic (multi-genic interactions), as well as maternal and paternal effects, where individuals are directly affected by their parents' phenotype, for example, milk production in mammals. Philippe Rocca-Serra https://en.wikipedia.org/wiki/Heritability H2 broad sense heritability heritability A particularly important component of the genetic variance is the additive variance, Var(A), which is the variance due to the average effects (additive effects) of the alleles. Since each parent passes a single allele per locus to each offspring, parent-offspring resemblance depends upon the average effect of single alleles. Additive variance represents, therefore, the genetic component of variance responsible for parent-offspring resemblance. The additive genetic portion of the phenotypic variance is known as Narrow-sense heritability and is defined as: h2 = Var(A)/Var(P) Philippe Rocca-Serra https://en.wikipedia.org/wiki/Heritability h2 narrow sense heritability Bayes R is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model to compute 'genomic estimated breeding values'. In contrast to Bayes B methods, the new method assumes that the true SNP effects are derived from a series of normal distributions, the first with zero variance, up to one with a variance of approximately 1% of the genetic variance. Philippe Rocca-Serra Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, et al. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci. 2012;95:4114–29 doi: 10.3168/jds.2011-5019. https://www.ncbi.nlm.nih.gov/pubmed/22720968 Bayes R 0 The double exponential distribution (a.k.a. Laplace distribution) is the distribution of differences between two independent variates with identical exponential distributions (Abramowitz and Stegun 1972, p. 930). Philippe Rocca-Serra http://mathworld.wolfram.com/LaplaceDistribution.html double exponential probability distribution dLaplace(x, mu = 0, b = 1, params = list(mu, b), ...) https://www.rdocumentation.org/packages/ExtDist/versions/0.6-3/topics/Laplace https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.laplace.html Laplace probability distribution Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. Philippe Rocca-Serra adapted from wikipedia: https://en.wikipedia.org/wiki/Bootstrapping_(statistics) bootstrap sampling distribution estimation by bootstrapping random forest procedure is a type of data transformation used in classification and statistical learning using regression. The random forest procedure is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset (it operates by constructing a multitude of decision trees at training time) and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). The random forest procedure outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Philippe Rocca-Serra adapted from: https://en.wikipedia.org/wiki/Random_forest and http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html random forest # S3 method for formula randomForest(formula, data=NULL, ..., subset, na.action=na.fail) # S3 method for default randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE, ...) # S3 method for randomForest print(x, ...) from https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None) from: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html sklearn.ensemble.RandomForestRegressor(n_estimators=10, criterion=’mse’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False) from: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html random forest procedure log likelihood is a data item which corresponds to the natural logarithm of the likelihood. log likelihood is a data item commonly used to provide a measure of accuracy of a model. Philippe Rocca-Serra adapted from wikipedia ogLik(object, ...) ## S3 method for class 'lm' logLik(object, REML = FALSE, ...) from: https://stat.ethz.ch/R-manual/R-patched/library/stats/html/logLik.html log likelihood A data transformation process in which the Holm p-value procedure is applied with the aim of correcting false discovery rate Philippe Rocca-Serra Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70. http://www.jstor.org/stable/4615733. Holm fdr p.adjust(p, method = holm, n = length(p)) from: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/p.adjust.html Holm false discovery rate correction A data transformation process in which the Hommel p-value procedure is applied with the aim of correcting false discovery rate Philippe Rocca-Serra Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika, 75, 383–386. doi: 10.2307/2336190 Hommel fdr p.adjust(p, method = hommel, n = length(p)) from: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/p.adjust.html Hommel false discovery rate correction number of cross-validation segments is a count which is used as input parameter in a cross validation procedure to evaluate a statistical model. Philippe Rocca-Serra number of cross-validation segments number of predictive components is a count used as input to the principle component analysis (PCA) Philippe Rocca-Serra number of predictive components number of orthogonal components is a count used as input to the orthogonal partial least square discriminant analysis (OPLS-DA) Philippe Rocca-Serra number of orthogonal components A statistical model term testing is a data transformation that accounts for the evaluation of a component of a statistical model or model term. Alejandra Gonzalez-Beltran Philippe Rocca-Serra STATO statistical model term testing the Wald test is statistical test which computes a Wald chi-squared test for 1 or more coefficients, given their variance-covariance matrix. The Wald test (also called the Wald Chi-Squared Test) is a way to find out if explanatory variables in a model are significant. “Significant” means that they add something to the model; variables that add nothing can be deleted without affecting the model in any meaningful way Philippe Rocca-Serra Wald test for a term in a regression model: regTermTest(model, test.terms, null=NULL,df=Inf, method=c("Wald","LRT")) from: http://r-survey.r-forge.r-project.org/survey/html/regTermTest.html wald.test(Sigma, b, Terms = NULL, L = NULL, H0 = NULL, df = NULL, verbose = FALSE) from: https://www.rdocumentation.org/packages/aod/versions/1.3/topics/wald.test Wald test the Rao-Scott test is a statistical test which tests the hypothesis that all coefficients associated with a particular regression term are zero (or have some other specified values). the LRT uses a linear combination of chisquared distributions Philippe Rocca-Serra Lagrange multiplier test Rao-Scott test Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contingency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60. Rao score test regTermTest(model, test.terms, null=NULL,df=Inf, method=c("Wald","LRT")) from: http://r-survey.r-forge.r-project.org/survey/html/regTermTest.html Rao's score test the frequency (i.e., the proportion) of possible confidence intervals that contain the true value of their corresponding parameter. In other words, if confidence intervals are constructed using a given confidence level in an infinite number of independent experiments, the proportion of those intervals that contain the true value of the parameter will match the confidence level. A probability measure of the reliability of an inferential statistical test that has been applied to sample data and which is provided along with the confidence interval for the output statistic. Philippe Rocca-Serra adapted from wikipedia: https://en.wikipedia.org/wiki/Confidence_interval and from: http://www.oxfordreference.com/view/10.1093/acref/9780191792236.001.0001/acref-9780191792236-e-103?rskey=tQCMI6&result=1 confidence level http://davidmlane.com/hyperstat/A121160.html It is a measure of how precise is an estimate of the statistical parameter is. Standard error is the estimated standard deviation of an estimate. It measures the uncertainty associated with the estimate. Compared with the standard deviations of the underlying distribution, which are usually unknown, standard errors can be calculated from observed data. Philippe Rocca-Serra adapted from wikipedia and from SAGE research method article http://methods.sagepub.com/reference/encyc-of-research-design/n435.xml standard error of estimate Biplots are a type of exploratory graph used in statistics, a generalization of the simple two-variable scatterplot. A biplot is constructed by using the singular value decomposition (SVD) to obtain a low-rank approximation to a transformed version of the data matrix X, whose n rows are the samples (also called the cases, or objects), and whose p columns are the variables. The biplot was introduced by K. Ruben Gabriel (1971). Philippe Rocca-Serra Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal component analysis. Biometrika, 58(3), 453–467. adapted from: https://en.wikipedia.org/wiki/Biplot Last accessed: 04/07/2018 biplot(x, y, var.axes = TRUE, col, cex = rep(par("cex"), 2), xlabs = NULL, ylabs = NULL, expand = 1, xlim = NULL, ylim = NULL, arrow.len = 0.1, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, ...) from: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/biplot.html last accessed: 04/07/2018 biplot The coefficient of determination is a data item measuring the proportion of the variance in the dependent variable that is predictable from the independent variable(s). In the case of a linear regression mode, the coefficient of determination r2 is the quotient of the variances of the fitted values and observed values of the dependent variable. r2 > eruption.lm = lm(eruptions ~ waiting, data=faithful) > summary(eruption.lm)$r.squared from: http://www.r-tutor.com/elementary-statistics/simple-linear-regression/coefficient-determination coefficient of determination a regression coefficient is a data item generated by a type of data transformation called a regression, which aims to model a response variable by expression the predictor variables as part of a function where variable terms are modified by a number. A regression coefficient is one such number. Philippe Rocca-Serra regression coefficient An eigenvalue is a data item resulting from a data transformation known as eigen value decomposition. It also corresponds to a process of matrix diagonalization or any equivalent operation, ie. transforming the underlying system of equations into a special set of coordinate axes in which the matrix takes this canonical form. Each eigenvalue is paired with a corresponding so-called eigenvector. Philippe Rocca-Serra adapted from: http://mathworld.wolfram.com/Eigenvalue.html last accessed: 04/07/2018 eigen(x, symmetric, only.values = FALSE, EISPACK = FALSE) https://stat.ethz.ch/R-manual/R-devel/library/base/html/eigen.html eigenvalue Factor analysis is a dimension reduction data transformation that is used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. Factor analysis is related to principal component analysis (PCA), but the two are not identical. Both PCA and factor analysis aim to reduce the dimensionality of a set of data, but the approaches taken to do so are different for the two techniques. Factor analysis is clearly designed with the objective to identify certain unobservable factors from the observed variables, whereas PCA does not directly address this objective; at best, PCA provides an approximation to the required factors. term request from Ralf Weber and Gavin Lloyd, University of Birmingham Alejandra Gonzalez-Beltran Philippe Rocca-Serra adapted from Wikipedia: https://en.wikipedia.org/wiki/Factor_analysis last accessed: 04/07/2018 Cattell, R. B. (1952). Factor analysis. New York: Harper factanal(x, factors, data = NULL, covmat = NULL, n.obs = NA, subset, na.action, start = NULL, scores = c("none", "regression", "Bartlett"), rotation = "varimax", control = NULL, ...) https://stat.ethz.ch/R-manual/R-devel/library/stats/html/factanal.html factor analysis In factor analysis, factor loadings express the relationship of each variable to the underlying factor. Alejandra Gonzalez-Beltran Philippe Serra-Serra https://en.wikipedia.org/wiki/Factor_analysis https://www.theanalysisfactor.com/factor-analysis-1-introduction/ factor loadings https://stat.ethz.ch/R-manual/R-devel/library/stats/html/loadings.html loadings The score indicates how sensitive a likelihood function L(\Theta,X) is to its parameter \Theta. Explicitly, the score for \Theta is the gradient of the log-likelihood with respect to \Theta. Alejandra Gonzalez-Beltran Philippe Serra-Serra https://en.wikipedia.org/wiki/Score_(statistics) https://www.rdocumentation.org/packages/bnlearn/versions/4.3/topics/score efficient score informant score function score score The selectivity ratio (SR) is defined as the ratio of explained vexpl,i to residual variance vres,i for the variable i on the target projection (TP) component in the context of Partial Least Squares Analysis. Philippe Rocca-Serra https://onlinelibrary.wiley.com/doi/pdf/10.1002/cem.1289 selectivity ratio selectivity ratio Partial least squares regression (PLS regression) is a data transformation that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares Discriminant Analysis (PLS-DA) is a variant used when the Y is categorical. PLS is used to find the fundamental relations between two matrices (X and Y), i.e. a latent variable approach to modeling the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among X values. By contrast, standard regression will fail in these cases (unless it is regularized). Partial least squares was introduced by the Swedish statistician Herman O. A. Wold, who then developed it with his son, Svante Wold. An alternative term for PLS (and more correct according to Svante Wold[1]) is projection to latent structures, but the term partial least squares is still dominant in many areas. Although the original applications were in the social sciences, PLS regression is today most widely used in chemometrics and related areas. It is also used in bioinformatics, sensometrics, neuroscience and anthropology. term request from Ralf Weber and Gavin Lloyd, University of Birmingham Philippe Rocca-Serra PLS adapted from wikipedia: last accessed on 24/07/2018 from: https://en.wikipedia.org/wiki/Partial_least_squares_regression Partial least squares analysis https://rpubs.com/omicsdata/pls http://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.PLSRegression.html Partial Least Square regression a version of PLS used for classification, where the input y-block are group labels (categorical variable) rather than a continuous variable term request from Ralf Weber and Gavin Lloyd, University of Birmingham Philippe Rocca-Serra PLS-DA adapted from wikipedia partial least squares discriminant analysis # S3 method for default plsda(x, y, ncomp = 2, probMethod = "softmax", prior = NULL, ...) https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/plsda http://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.PLSRegression.html Partial Least Square Discriminant Analysis The arithmetic mean is defined as the sum of the numerical values of each and every observation divided by the total number of observations. S The arithmetic mean A is defined by the formula A=sum[Ai] / n where i ranges from 1 to n and Ai represents the value of individual observations. The arithmetic mean is significantly affected by extreme values and outliers. A better measure of central tendency is the median (http://purl.obolibrary.org/obo/STATO_0000574). replaced OBI import following addition of restrictions and use in STATO. however the xref to OBI is kept as class metadata Alejandra Gonzalez-Beltran Philippe Rocca-Serra STATO arithmetic mean http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html average value http://purl.obolibrary.org/obo/OBI_0000679 The median is that value of the variate which divides the total frequency into two halves. The median is measure of central tendency of data. It is obtained by arranging the observations in order from smallest to largest value. If there is an odd number of observations, the median is the middle value. If there is an even number of observations, the median is the average of the two middle values. PRS and AGB added restriction about 'measure of central tendency' and quartile, june 2013 on 'centre value' OBI class. replaced OBI import following addition of restrictions and use in STATO. however the xref to OBI is kept as class metadata Alejandra Gonzalez-Beltran Philippe Rocca-Serra second quartile A Dictionary of Statistical Terms, 5th edition, prepared for the International Statistical Institute by F.H.C. Marriott. Published for the International Statistical Institute by Longman Scientific and Technical. and Wolfram Alpha median center value http://purl.obolibrary.org/obo/OBI_0000674 a data transformation which finds principal component by applying non-linear iterative partial least squares algorithm term request from Ralf Weber and Gavin Lloyd, University of Birmingham Philippe Rocca-Serra NIPALS https://cran.r-project.org/web/packages/nipals/vignettes/nipals_algorithm.pdf nipals(x, ncomp = min(nrow(x), ncol(x)), center = TRUE, scale = TRUE, maxiter = 500, tol = 1e-06, startcol = 0, fitted = FALSE, force.na = FALSE, gramschmidt = TRUE, verbose = FALSE) https://www.rdocumentation.org/packages/mixOmics/versions/6.3.2/topics/nipals non-iterative Partial Least Squares A novel algorithm for partial least squares (PLS) regression, SIMPLS, is proposed which calculates the PLS factors directly as linear combinations of the original variables. The PLS factors are determined such as to maximize a covariance criterion, while obeying certain orthogonality and normalization restrictions. This approach follows that of other traditional multivariate methods. The construction of deflated data matrices as in the nonlinear iterative partial least squares (NIPALS)-PLS algorithm is avoided. For univariate y SIMPLS is equivalent to PLS1 and closely related to existing bidiagonalization algorithms. This follows from an analysis of PLS1 regression in terms of Krylov sequences. For multivariate Y there is a slight difference between the SIMPLS approach and NIPALS-PLS2. In practice the SIMPLS algorithm appears to be fast and easy to interpret as it does not involve a breakdown of the data sets. The acronym SIMPLS comes from 'straightforward implementation of a statistically inspired modification of the PLS method' term request from Ralf Weber and Gavin Lloyd, University of Birmingham Philippe Rocca-Serra SIMPLS: An alternative approach to partial least squares regression Sijmen de Jong https://doi.org/10.1016/0169-7439(93)85002-X simpls(X, Y, ncomp, stripped = FALSE, ...) https://www.rdocumentation.org/packages/cocorresp/versions/0.3-0/topics/simpls SIMPLS a partial least square regression applied when there is only one variable in Y (the matrix of response variables), or it is desirable to model and optimize separately the performance of each of the variables in Y. This case is usually referred to as PLS1 regression (J = 1). term request from Ralf Weber and Gavin Lloyd, University of Birmingham Philippe Rocca-Serra A comparison of nine PLS1 algorithms, Martin Andersson. https://doi.org/10.1002/cem.1248 plsreg1(x, y, nc = 2, cv = FALSE) https://www.rdocumentation.org/packages/plspm/versions/0.2-2/topics/plsreg1 PLS1 a partial least square regression applied to a multivariate response variable. term request from Ralf Weber and Gavin Lloyd, University of Birmingham Philippe Rocca-Serra a partial least square regression applied when there Y matrix of response variables is truely multivariate (J >1) https://doi.org/10.1016/0003-2670(86)80028-9 plsreg2(X, Y, nc = 2) https://www.rdocumentation.org/packages/plspm/versions/0.2-2/topics/plsreg2 PLS2 improved kernel PLS is a data transformation which implement a very fast kernel algorithm for updating PLS models in a recursive manner and for exponentially discounting past data. term request from Ralf Weber and Gavin Lloyd, University of Birmingham Philippe Rocca-Serra https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SICI%291099-128X%28199701%2911%3A1%3C73%3A%3AAID-CEM435%3E3.0.CO%3B2-%23 kernelpls.fit(X, Y, ncomp, stripped = FALSE, ...) https://www.rdocumentation.org/packages/pls/versions/2.6-0/topics/kernelpls.fit improved Kernel PLS variable importance in projection is a measure computed as part of a partial least square regression to accumulate the importance of each variable j being reflected by w from each component. term request from Ralf Weber and Gavin Lloyd, University of Birmingham Philippe Rocca-Serra VIP https://doi.org/10.1016/j.chemolab.2012.07.010 and S. Wold, E. Johansson, M. Cocchi PLS: Partial Least Squares Projections to Latent Structures, 3D QSAR in drug design, 1 (1993), pp. 523-550 vip(object) https://www.rdocumentation.org/packages/mixOmics/versions/6.3.2/topics/vip variable importance in projection a data transformation which compute the singular-value decomposition of a rectangular matrix. The singular-value decomposition is very general in the sense that it can be applied to any m × n matrix whereas eigenvalue decomposition can only be applied to certain classes of square matrices. term request from Ralf Weber and Gavin Lloyd, University of Birmingham Philippe Rocca-Serra adapted from wikipedia: https://en.wikipedia.org/wiki/Singular-value_decomposition last accessed: 24/08/2018 svd(x, nu = min(n, p), nv = min(n, p), LINPACK = FALSE) https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/svd numpy.linalg.svd(a, full_matrices=True, compute_uv=True) https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.svd.html singular value decomposition best linear unbiased estimator Philippe Rocca-Serra Henderson C. R., 1984 Applications of Linear Models in Animal Breeding. University of Guelph, Guelph, Ontario, Canada. ftp://tech.obihiro.ac.jp/suzuki/Henderson.pdf BLUE best linear unbiaised estimator of the fixed effect best linear unbiased estimator https://www.ncbi.nlm.nih.gov/pubmed/24033541 "An experiment was conducted to investigate the effect of a prebiotic on performance of partridge. The experiment was carried out with a total of eighty-day-old male Chukar partridge (Alectoris chukar chukar) chicks in a completely randomized design. The dietary treatments consisted of a control and an experimental treatment, and each treatment was replicated four times with 10 chicks per replicate." a completely randomized design is a type of design of experiment where the observation unit receive treatments (independent variable level) entirely at random. In other words, the observations unit are randomly assigned to treatments. Completely randomized designs differ from randomized complete block design and should not be confused as in the latter, a blocking variable is first use to assign experimental units to blocks. Then only, the members of each block are then randomly assigned to different treatment groups term request by Hanna Cwiek, https://github.com/ISA-tools/stato/issues/61 Philippe Rocca-Serra adapted from http://www.stat.yale.edu/Courses/1997-98/101/expdes.htm and from A Dictionary of Statistics (3 ed.) , Graham Upton and Ian Cook Publisher: Oxford University Press Print Publication Date: 2014 Print ISBN-13: 9780199679188 http://animsci.agrenv.mcgill.ca/StatisticalMethodsII/R/crd/index.html completely randomized design the Wald statistic is a statistic is used during a Wald test, a test of significance of the regression coefficient; it is based on the asymptotic normality property of maximum likelihood estimates, and is computed as: W = b * 1/Var(b) * b In this formula, b stands for the parameter estimates, and Var(b) stands for the asymptotic variance of the parameter estimates. The Wald statistic is tested against the Chi-square distribution in the Wald test. term request from Hanna Cwiek: https://github.com/ISA-tools/stato/issues/67 Philippe Rocca-Serra adapted from wikipedia and http://documentation.statsoft.com/STATISTICAHelp.aspx?path=glossary/GlossaryTwo/W/WaldStatistic Wald statistic degree of freedom calculation is a data transformation which is part of a stastical test and which aims to determine or estimate the number of degrees of freedom in a system. term request from Hanna Cwiek: https://github.com/ISA-tools/stato/issues/68 Philippe Rocca-Serra STATO degree of freedom calculation a restricted randomized design is a kind of study design which uses randomization to allocate observation unit to treatment but where intuitively poor allocations of treatments to experimental units are avoided, while retaining the theoretical benefits of randomization. This is often the case when so-called 'hard to change' factors are used in an experimental design. Philippe Rocca-Serra adapted from wikipedia restricted randomized design https://www.nature.com/articles/srep35323/tables/1 the percentage of variance is an output of Principal Component Analysis which is obtained by forming the ratio of an eigenvalue divided by the sum of all eigenvalues. This produces a "percentage of variance" for each eigenvector. Philippe Rocca-Serra adapted from: https://stats.stackexchange.com/questions/31908/what-is-percentage-of-variance-in-pca PoV http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html percentage of variance the scaled identity covariance structure is a type of covariance structure which has constant variance. The assumption is that there is no correlation between any elements. Hanna Cwiek Philippe Rocca-Serra Tom Nichols adapted from https://www.ibm.com/support/knowledgecenter/en/SSLVMB_24.0.0/spss/advanced/covariance_structures.html scaled identity covariance structure material anatomical entity Anatomical entity that has mass. material anatomical entity anatomical cluster Anatomical group that has its parts adjacent to one another. anatomical cluster length unit A unit which is a standard measure of the distance between two points. length unit mass unit A unit which is a standard measure of the amount of matter/energy of a physical object. mass unit time unit A unit which is a standard measure of the dimension in which events occur in sequence. time unit PLS weight is a information content entity which is generated when performing a Partial Least Square analysis term request from Ralf Weber and Gavin Lloyd, University of Birmingham Philippe Rocca-Serra PLS weight a dataset which is made up of pedigree information, that is presenting ancestry or lineage information in a set of individuals of an organism. Philippe Rocca-Serra pedigree data set this is experimental, do not use for markup, no STATO ID assigned Philippe Rocca-Serra response variable explained by fixed effect of predictor variable hypothesis this is experimental, do not use for markup, no STATO ID assigned Philippe Rocca-Serra response variable explained by interaction effect of predictor variables hypothesis this is experimental, do not use for markup, no STATO ID assigned Philippe Rocca-Serra response variable explained by random effect of predictor variable hypothesis variance component estimate example to be eventually removed Class has all its metadata, but is either not guaranteed to be in its final location in the asserted IS_A hierarchy or refers to another class that is not complete. metadata complete term created to ease viewing/sort terms for development purpose, and will not be included in a release PERSON:Alan Ruttenberg organizational term Class has undergone final review, is ready for use, and will be included in the next release. Any class lacking "ready_for_release" should be considered likely to change place in hierarchy, have its definition refined, or be obsoleted in the next release. Those classes deemed "ready_for_release" will also derived from a chain of ancestor classes that are also "ready_for_release." ready for release Class is being worked on; however, the metadata (including definition) are not complete or sufficiently clear to the branch editors. metadata incomplete Nothing done yet beyond assigning a unique class ID and proposing a preferred term. uncurated All definitions, placement in the asserted IS_A hierarchy and required minimal metadata are complete. The class is awaiting a final review by someone other than the term editor. pending final vetting Terms with this status should eventually replaced with a term from another ontology. Alan Ruttenberg group:OBI to be replaced with external ontology term A term that is metadata complete, has been reviewed, and problems have been identified that require discussion before release. Such a term requires editor note(s) to identify the outstanding issues. Alan Ruttenberg group:OBI requires discussion ## Elucidation This is used when the statement/axiom is assumed to hold true 'eternally' ## How to interpret (informal) First the "atemporal" FOL is derived from the OWL using the standard interpretation. This axiom is temporalized by embedding the axiom within a for-all-times quantified sentence. The t argument is added to all instantiation predicates and predicates that use this relation. ## Example Class: nucleus SubClassOf: part_of some cell forall t : forall n : instance_of(n,Nucleus,t) implies exists c : instance_of(c,Cell,t) part_of(n,c,t) ## Notes This interpretation is *not* the same as an at-all-times relation axiom holds for all times a false positive rate whose value is 5 per cent Following discussion with OBCS, deprecated of class STATO_0000043 and creation of instance Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO 5 % false positive rate a false positive rate whose value is 1 per cent Following discussion with OBCS, deprecated of class STATO_0000281 and creation of instance Alejandra Gonzalez-Beltran Orlaith Burke Philippe Rocca-Serra STATO 1 % false positive rate en Ontology for Biomedical Investigations Advisors for this project come from the IFOMIS group, Saarbruecken and from the Co-ODE group in Manchester Alan Ruttenberg Allyson Lister Barry Smith Bill Bug Bjoern Peters Carlo Torniai Chris Mungall Chris Stoeckert Chris Taylor Christian Bolling Cristian Cocos Daniel Rubin Daniel Schober Dawn Field Dirk Derom Elisabetta Manduchi Eric Deutsch Frank Gibson Gilberto Fragoso Helen C. Causton Helen Parkinson Holger Stenzhorn James A. Overton James Malone Jay Greenbaum Jeffrey Grethe Jennifer Fostel Jessica Turner Jie Zheng Joe White John Westbrook Kevin Clancy Larisa Soldatova Lawrence Hunter Liju Fan Luisa Montecchi Matthew Brush Matthew Pocock Melanie Courtot Melissa Haendel Mervi Heiskanen Monnie McGee Norman Morrison Philip Lord Philippe Rocca-Serra Pierre Grenon Richard Bruskiewich Richard Scheuermann Robert Stevens Ryan R. Brinkman Stefan Wiemann Susanna-Assunta Sansone Tanya Gray Tina Hernandez-Boussard Trish Whetzel Yongqun He 2009-07-31 The Ontology for Biomedical Investigations (OBI) is build in a collaborative, international effort and will serve as a resource for annotating biomedical investigations, including the study design, protocols and instrumentation used, the data generated and the types of analysis performed on the data. This ontology arose from the Functional Genomics Investigation Ontology (FuGO) and will contain both terms that are common to all biomedical investigations, including functional genomics investigations and those that are more domain specific. OWL-DL An ontology for the annotation of biomedical and functional genomics experiments. http://creativecommons.org/licenses/by/4.0/ Please cite the OBI consortium http://purl.obolibrary.org/obo/obi where traditional citation is called for. However it is adequate that individual terms be attributed simply by use of the identifying PURL for the term, in projects that refer to them. 2018-05-23