"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df = pd.DataFrame.from_dict(assm_level, orient='index', columns=['count'])\n",
"display(df)\n",
"df.plot(kind='pie', y='count', figsize=(6,6), title='Assemblies by level',)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Assemblies grouped and counted by annotation release number\n",
"\n",
"All RefSeq assemblies are annotated and each annotation release is numbered, starting from 100. A quick way to check if the latest annotation is the first time an assembly for that organism was annotated is to check the annotation release number. Anything above 100 can be interpreted to have been through multiple annotations. \n",
"\n",
"For example, in the analysis shown below, the human assembly has an annotation release number 109 indicating that a human assembly was annotated multiple times. On the other hand, the silvery gibbon assembly has an annotation release number of 100 indicating that this is the first time an assembly from this organism was annotated. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Counter({100: 9, 101: 8, 103: 3, 105: 3, 102: 3, 104: 2, 109: 1})\n"
]
}
],
"source": [
"## out of the 28 RefSeq assemblies, how many have been annotated more than once? \n",
"annot_counter = Counter()\n",
"for assembly in map(lambda d: d.assembly, genome_summary.assemblies):\n",
" if assembly.assembly_accession.startswith('GCF') and assembly.annotation_metadata and assembly.annotation_metadata.release_number:\n",
" rel = int(assembly.annotation_metadata.release_number.split('.')[0])\n",
" annot_counter[rel] += 1\n",
"pprint(annot_counter)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAhEAAAFuCAYAAAA/AkqbAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAVnUlEQVR4nO3de2zVd/348VfpkXEL41KBFHahApKBwpDIMuMUd2KMEcbCJW5owthmBKeB4ObUzcTsVrMgZhFhIoIx0QXRgaIupqIujjguzoShMgk4QRQsXTGstFh6vn/stwZ+0E+7N6ec0/J4/LWenn14fV7r6DOfz+lpRaFQKAQAwFvUp9QDAAA9k4gAAJKICAAgiYgAAJKICAAgiYgAAJKICAAgSa67Dnz06NHuOvRbUlVVFfX19aUeo2zZTzb7yWY/HbObbPaTrdz2U11dfdHHXYkAAJKICAAgiYgAAJJ022siAKAnKhQK0dzcHG1tbVFRUVGSGY4dOxYtLS2X9c8sFArRp0+f6NevX5fPW0QAwDmam5vjbW97W+RypfsWmcvlorKy8rL/ua2trdHc3Bz9+/fv0vPdzgCAc7S1tZU0IEopl8tFW1tbl58vIgDgHKW6hVEu3sr5iwgAuIKsW7cuTp8+XZRjXZnXawCgi87eO7uox6tc99OiHu+t+s53vhNz587t8usesrgSAQBlZtOmTZHP5yOfz8dnP/vZOHz4cMyfPz/y+XwsWLAg/vnPf0ZExLJly2Lbtm3t/9748eMjImLHjh0xb968uPfee+OWW26J++67LwqFQqxfvz6OHTsW8+fPj3nz5l3ynK5EAEAZ2b9/f6xatSq2bt0aw4YNi9deey2WLVsW8+fPjwULFsQzzzwTDz/8cHz3u9/NPM7LL78c27dvj1GjRsVtt90Wu3btirvvvju+/e1vx49+9KMYNmzYJc/qSgQAlJEXXnghZs2a1f5NfujQobFnz564/fbbIyJi7ty5sXPnzk6PM3Xq1Kiuro4+ffrEpEmT4vDhw0WfVUQAQA917o9ktrW1xf/+97/2z/Xt27f9nysrK6O1tbX4f37Rj1hExXgxy7EizBFR+hfCAHBleN/73hf33HNP3HPPPe23M6ZPnx5bt26NefPmxU9+8pOYMWNGRESMGTMm9u7dG7Nnz45f/epX50VERwYNGhSnTp0qyu2Mso4IALjSvPOd74xly5bFvHnzok+fPjF58uR49NFHY/ny5bF27doYNmxYrFq1KiIiFi5cGHfddVfk8/mYOXNmDBgwoNPjL1y4MBYuXBgjR46MzZs3X9KsFYVCoXBJR+jA0aNHL/kYxf6xmkvRW69ElNvvrC839pPNfjpmN9nKeT9NTU1d+mbcnXK5XLfcfuiKi51/dXX1RZ/rNREAQBIRAQAkEREAQBIRAQDn6KaXCvYYb+X8RQQAnKNPnz4le1FjqbW2tkafPl1PAz/iCQDn6NevXzQ3N0dLS0vJfi34VVddFS0tLZf1zywUCtGnT5/o169fl/8dEQEA56ioqCjKb7i8FOX8I7DncjsDAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEiS68qTtm3bFtu3b4+Kioq45pprYunSpdG3b9/ung0AKGOdXoloaGiIX/7yl1FbWxsrV66Mtra22LFjx+WYDQAoY126ndHW1hZnzpyJs2fPxpkzZ2Lo0KHdPRcAUOY6vZ0xbNiwmDVrVixZsiT69u0bU6ZMiSlTplyO2QCAMtZpRJw6dSp27doVq1evjgEDBsTXv/71eP755+OWW24573l1dXVRV1cXERG1tbVRVVV1ycMdu+QjFE8xzqeYjt1+c3GOU5SjRIx8tnfe4srlcmX3376c2E/H7Cab/WTrKfvpNCL27t0bI0aMiMGDB0dExIwZM+KVV165ICLy+Xzk8/n2j+vr64s8amn1tvMptt66n6qqql57bsVgPx2zm2z2k63c9lNdXX3Rxzt9TURVVVX87W9/i5aWligUCrF3794YPXp00QcEAHqWTq9EjB8/Pm666ab4whe+EJWVlXH99defd8UBALgydel9IhYsWBALFizo7lkAgB7EO1YCAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQREQAAElEBACQJNeVJ73++uuxdu3aOHz4cFRUVMSSJUtiwoQJ3T0bAFDGuhQRGzZsiKlTp8aKFSuitbU1WlpaunsuAKDMdXo7o6mpKf7yl7/Ehz70oYiIyOVyMXDgwG4fDAAob51eiTh+/HgMHjw4vvWtb8Wrr74aNTU1sWjRoujXr9/lmA8AKFOdRsTZs2fj0KFDsXjx4hg/fnxs2LAhtmzZEh//+MfPe15dXV3U1dVFRERtbW1UVVVd8nDHLvkIxVOM8ymmctpNRPntp1hyuVyvPbdisJ+O2U02+8nWU/bTaUQMHz48hg8fHuPHj4+IiJtuuim2bNlywfPy+Xzk8/n2j+vr64s3ZRnobedTbL11P1VVVb323IrBfjpmN9nsJ1u57ae6uvqij3f6moghQ4bE8OHD4+jRoxERsXfv3hgzZkxxpwMAepwu/XTG4sWL46mnnorW1tYYMWJELF26tLvnAgDKXJci4vrrr4/a2trungUA6EG8YyUAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJuhwRbW1t8cADD0RtbW13zgMA9BBdjohf/OIXMXr06O6cBQDoQboUESdOnIg//vGPceutt3b3PABAD9GliNi4cWN84hOfiIqKiu6eBwDoIXKdPWHPnj1x9dVXR01NTezbt6/D59XV1UVdXV1ERNTW1kZVVdUlD3fsko9QPMU4n2Iqp91ElN9+iiWXy/XacysG++mY3WSzn2w9ZT+dRsT+/ftj9+7d8dJLL8WZM2fi9OnT8dRTT8XnPve5856Xz+cjn8+3f1xfX1/8aUuot51PsfXW/VRVVfXacysG++mY3WSzn2zltp/q6uqLPt5pRNx5551x5513RkTEvn374mc/+9kFAQEAXHm8TwQAkKTTKxHnmjRpUkyaNKm7ZgEAehBXIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJCICAEgiIgCAJLlSDwDd4ey9s4tynGNFOUpE5bqfFulIxWE/pPK1w7lciQAAkogIACCJiAAAkogIACCJiAAAkogIACCJiAAAkogIACCJiAAAkogIACCJiAAAkogIACCJiAAAkogIACCJiAAAkogIACCJiAAAkogIACCJiAAAkogIACCJiAAAkogIACCJiAAAkogIACCJiAAAkogIACCJiAAAkogIACCJiAAAkogIACBJrrMn1NfXx+rVq6OxsTEqKioin8/HRz/60csxGwBQxjqNiMrKyvjkJz8ZNTU1cfr06XjwwQfj3e9+d4wZM+ZyzAcAlKlOb2cMHTo0ampqIiKif//+MXr06GhoaOj2wQCA8vaWXhNx/PjxOHToUIwbN6675gEAeohOb2e8qbm5OVauXBmLFi2KAQMGXPD5urq6qKuri4iI2traqKqquuThjl3yEYqnGOdTTOW0mwj76Yz9ZCun/Ry7/ebiHKcoR4kY+eyOIh2pOHztXB65XK5HnFuXIqK1tTVWrlwZ73//+2PGjBkXfU4+n498Pt/+cX19fXEmLBO97XyKzX6y2U82++mY3WTrrfupqqoqq3Orrq6+6OOd3s4oFAqxdu3aGD16dHzsYx8r+mAAQM/U6ZWI/fv3x/PPPx/XXntt3H///RERcccdd8S0adO6fTgAoHx1GhETJ06MTZs2XY5ZAIAexDtWAgBJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJRAQAkEREAABJcqUeAAB6i7P3zi7KcY4V5SgRlet+WqQjXZwrEQBAEhEBACQREQBAEhEBACQREQBAEhEBACQREQBAEhEBACQREQBAEhEBACQREQBAEhEBACQREQBAEhEBACQREQBAEhEBACQREQBAEhEBACQREQBAEhEBACQREQBAEhEBACQREQBAEhEBACQREQBAEhEBACQREQBAEhEBACQREQBAEhEBACTJdeVJf/rTn2LDhg3R1tYWt956a8yZM6ebxwIAyl2nVyLa2tpi/fr18aUvfSlWrVoVL7zwQhw5cuRyzAYAlLFOI+LAgQMxatSoGDlyZORyubj55ptj165dl2M2AKCMdRoRDQ0NMXz48PaPhw8fHg0NDd06FABQ/rr0moiuqKuri7q6uoiIqK2tjerq6ks/6M93X/oxeiu7yWY/2eynY3aTzX6yXWH76fRKxLBhw+LEiRPtH584cSKGDRt2wfPy+XzU1tZGbW1tcSe8RA8++GCpRyhr9pPNfrLZT8fsJpv9ZOsp++k0It7xjnfEv/71rzh+/Hi0trbGjh07Yvr06ZdjNgCgjHV6O6OysjIWL14cjz32WLS1tcXMmTPjmmuuuRyzAQBlrEuviZg2bVpMmzatu2fpFvl8vtQjlDX7yWY/2eynY3aTzX6y9ZT9VBQKhUKphwAAeh5vew0AJBERAEASEQEAJBERAECSXhsRjY2NcfDgwTh48GA0NjaWepyy19zcXOoR6IFOnTpV6hHK2u7dV9a7F74VvnYu9N///jcOHToUr776ao/5O7lob3tdLv7+97/HunXroqmpqf2dNU+cOBEDBw6Mu+++O2pqako8YXlavnx5rFmzptRjlNQ//vGPePrpp6OhoSGmTp0aCxcujEGDBkVExBe/+MV44oknSjxhaf31r3+Np59+OioqKmLJkiXxzDPPtL8J3fLly2PChAmlHrGkXnzxxfM+LhQKsX79+jh79mxERMyYMaMUY5WFH//4xzF37tyIiDhy5Eg8+eST0draGhERy5Yti/Hjx5dyvJI7cuRIbNiwIY4fPx719fUxduzYOHnyZNxwww1x1113xYABA0o9Yod6XUSsXr06PvWpT13wRfnKK6/EmjVr4sknnyzRZKW3bdu2iz5eKBR6TPV2p3Xr1sX8+fNj/Pjx8etf/zq+8pWvxAMPPBCjRo1q/0ZwJfve974Xy5cvj+bm5qitrY37778/Jk6cGAcPHowNGzbEI488UuoRS+ob3/hGTJkyJQYPHtz+WEtLS+zZsyciruyI2LlzZ3tEfP/7349FixbFjTfeGAcOHIiNGzfGo48+WuIJS2vNmjXxmc98Jqqrq+PAgQPx3HPPxeOPPx51dXWxZs2aWLFiRalH7FCvi4iWlpaLVu2ECROu+G+UP/zhD2PWrFlRWVl5wee8Xcgbt3SmTp0aERGzZ8+OmpqaePzxx+O+++6LioqK0g5XBs6ePRvXXnttREQMHjw4Jk6cGBERNTU1cebMmVKOVhYeeeSR+MEPfhDjxo2LD3/4wxERsW/fvli6dGmJJysvr732Wtx4440RETFu3DhfOxFx5syZ9l9aOW7cuDh8+HBEvPGGUz//+c9LOVqnel1ETJ06NZ544on4wAc+0P4rzE+cOBG/+93v2r9BXKnGjh0b733vey96S2f79u0lmKj8NDU1tV86nDx5cqxYsSJWrlzp/m2cH5p33HHHeZ9789L0lWzcuHHx0EMPxXPPPRdf/epXY+HCheLz/zl27Fh87Wtfi0KhECdOnIiWlpa46qqrIiJc5YuIkSNHxubNm2Py5Mmxc+fOuO666yLijf+v2traSjxdtl75jpUvvfRS7Nq1KxoaGiLijd9EOn369B771t3FcvTo0Rg0aNB5l1vf1NjYGEOGDLn8Q5WR3//+9zFixIgL7u3X19fH5s2b49Of/nSJJisPu3fvjne9613tf/m/6d///ne8+OKLcdttt5VosvLT0NAQGzdujIMHD8Y3v/nNUo9Tcn/+85/P+3js2LHRv3//aGxsjD/84Q/xkY98pESTlYfXX389nn322Thy5Ehcd911MWfOnOjfv380NTXFkSNHyvr1Rr0yIgCA7tfrbmc0NTXFs88+G7t3747GxsaoqKiIq6++OqZPnx5z5syJgQMHlnrEknlzN7t27YqTJ0/azf/HfrLZTzb76ZjdZOvJ37d63ZWIxx57LCZNmhQf/OAH2y/PNzY2xm9/+9t4+eWX46GHHirtgCVkN9nsJ5v9ZLOfjtlNtp68n173ZlPHjx+POXPmnHd/f8iQITFnzpz4z3/+U7rByoDdZLOfbPaTzX46ZjfZevJ+el1EvP3tb4+tW7ee9y6VjY2NsWXLlqiqqirdYGXAbrLZTzb7yWY/HbObbD15P73udsapU6diy5YtsXv37jh58mREvFF073nPe2LOnDnt70B4JbKbbPaTzX6y2U/H7CZbT95Pr4uILL/5zW9i5syZpR6jLNlNNvvJZj/Z7KdjdpOt3PfT625nZNm0aVOpRyhbdpPNfrLZTzb76ZjdZCv3/fS6H/H8/Oc/f9HHC4VC+2WiK5XdZLOfbPaTzX46ZjfZevJ+el1EnDx5Mr785S9f8HO1hUIhHn744RJNVR7sJpv9ZLOfbPbTMbvJ1pP30+siYtq0adHc3BzXX3/9BZ+74YYbLv9AZcRustlPNvvJZj8ds5tsPXk/V9QLKwGA4rmiXlgJABSPiAAAkogIACCJiAAAkogIACDJ/wEwAMBENTKIFAAAAABJRU5ErkJggg==\n",
"text/plain": [
"
"
],
"text/plain": [
" GCF_009731565.1 GCF_902806685.1 GCF_900239965.1 \\\n",
"assm_name Dplex_v4 iAphHyp1.1 Bicyclus_anynana_v1.2 \n",
"annot_rel_date Feb 24, 2020 Jun 05, 2020 Feb 16, 2018 \n",
"annot_rel_num 100 100 100 \n",
"assm_level Chromosome Chromosome Scaffold \n",
"num_chromosomes 31 30 None \n",
"contig_n50 108026 2012761 78697 \n",
"seq_length 248676414 408137179 475399557 \n",
"submission_date 2019-12-11 2020-02-22 2018-01-02 \n",
"\n",
" GCF_002938995.1 GCF_000836235.1 GCF_000836215.1 \\\n",
"assm_name ASM293899v1 Pxut_1.0 Ppol_1.0 \n",
"annot_rel_date Oct 03, 2018 Jul 31, 2015 Jul 30, 2015 \n",
"annot_rel_num 100 100 100 \n",
"assm_level Scaffold Scaffold Scaffold \n",
"num_chromosomes None None None \n",
"contig_n50 254123 128246 47768 \n",
"seq_length 357124929 243890167 227005758 \n",
"submission_date 2018-02-23 2015-02-02 2015-02-02 \n",
"\n",
" GCF_001298355.1 GCF_001856805.1 \n",
"assm_name Pap_ma_1.0 P_rapae_3842_assembly_v2 \n",
"annot_rel_date Oct 28, 2015 Aug 08, 2017 \n",
"annot_rel_num 100 100 \n",
"assm_level Scaffold Scaffold \n",
"num_chromosomes None None \n",
"contig_n50 92238 54957 \n",
"seq_length 278421261 245871251 \n",
"submission_date 2015-09-28 2016-10-16 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"genome_table = {}\n",
"for assembly in map(lambda d: d.assembly, genome_summary.assemblies):\n",
" if not assembly.annotation_metadata:\n",
" continue\n",
" n_chr = len(assembly.chromosomes) if assembly.assembly_level == 'Chromosome' else None\n",
" genome_table[assembly.assembly_accession] = {\n",
" 'assm_name': assembly.display_name,\n",
" 'annot_rel_date': assembly.annotation_metadata.release_date,\n",
" 'annot_rel_num': assembly.annotation_metadata.release_number,\n",
" 'assm_level': assembly.assembly_level,\n",
" 'num_chromosomes': n_chr,\n",
" 'contig_n50': assembly.contig_n50,\n",
" 'seq_length': assembly.seq_length,\n",
" 'submission_date': assembly.submission_date }\n",
"df = pd.DataFrame.from_dict(genome_table, orient='columns')\n",
"display(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Genome assembly downloads\n",
"So far, we have looked at interacting with genome summaries, which describe the essential metadata for genome assemblies. In addition to metadata, the Datasets API can be used to download a genome dataset consisting of genome, transcript, and protein sequences in FASTA format, as well as annotation data in gff3, gtf, and GenBank flat file formats.\n",
"\n",
"To illustrate, let's start by downloading a genome dataset including mitochondrial genome sequence and all protein sequences for the latest human genome assembly, GRCh38."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"assembly_accessions = ['GCF_000001405.39']\n",
"chromosomes = ['MT']\n",
"exclude_sequence = False\n",
"include_annotation_type = ['PROT_FASTA']\n",
"\n",
"api_response = api_instance.download_assembly_package(\n",
" assembly_accessions,\n",
" chromosomes=chromosomes,\n",
" exclude_sequence=exclude_sequence,\n",
" include_annotation_type=include_annotation_type,\n",
" # Because we are streaming back the results to disk, \n",
" # we should defer reading/decoding the response\n",
" _preload_content=False\n",
")\n",
"\n",
"with open('human_assembly.zip', 'wb') as f:\n",
" f.write(api_response.data)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we'll unzip the downloaded zip archive. All data is contained in ncbi_dataset/data. Data that is specific to the human reference genome, GRCh38, is contained within a subdirectory named with that assembly accession, GCF_000001405.39. \n",
"The data directory contains five files:\n",
" 1. The assembly data report contains assembly information like sequence names, NCBI accessions, UCSC-style chromosome names, and annotation statistics (gene counts). Note that this file is directly under the data directory and not in the subdirectory named with the assembly accession. When the genome dataset contains data for multiple assemblies, genome assembly metadata for all of these assemblies is contained in the `assembly_data_report.jsonl` file\n",
" 2. The sequence report (`sequence_report.jsonl`) contains a list of the sequences that comprise the GRCh38 assembly\n",
" 3. The nucleotide sequence in FASTA (nucleotide) format for the one \"chromosome\" we requested: `chrMT.fna`\n",
" 4. All protein sequences in FASTA (amino acid) format: `protein.faa`\n",
" 5. And finally, a dataset catalog file (`dataset_catalog.json`) that describes the contents of the archive, to aid in programmatic access.\n",
"Read more about the contents in the [download assembly command\n",
" section of the documentation](https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-assembly/). "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Archive: human_assembly.zip\r\n",
" Length Method Size Cmpr Date Time CRC-32 Name\r\n",
"-------- ------ ------- ---- ---------- ----- -------- ----\r\n",
" 661 Defl:N 384 42% 11-16-2020 13:15 bc3c97af README.md\r\n",
" 1044 Defl:N 573 45% 11-16-2020 13:15 875a7b5b ncbi_dataset/data/assembly_data_report.jsonl\r\n",
" 16834 Defl:N 5379 68% 11-16-2020 13:15 932c3ae8 ncbi_dataset/data/GCF_000001405.39/chrMT.fna\r\n",
"85815507 Defl:N 26280452 69% 11-16-2020 13:15 04376109 ncbi_dataset/data/GCF_000001405.39/protein.faa\r\n",
" 211 Defl:N 166 21% 11-16-2020 13:15 28e03896 ncbi_dataset/data/GCF_000001405.39/sequence_report.jsonl\r\n",
" 499 Defl:N 211 58% 11-16-2020 13:15 2cc86c6a ncbi_dataset/data/dataset_catalog.json\r\n",
"-------- ------- --- -------\r\n",
"85834756 26287165 69% 6 files\r\n"
]
}
],
"source": [
"!unzip -v human_assembly.zip"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using genome summary data to request genome datasets\n",
"\n",
"When you need to download a genome dataset for a particular taxonomic group, you'll need to first get the list of genome assembly accessions, then you can query by accession to download the data that you're interested in.\n",
"\n",
"In this example, we'll download a genome dataset for a list of bird RefSeq genomes annotated in 2020.\n",
"\n",
"1. Fetch a list of RefSeq assembly accessions for all rodent genomes using `assembly_descriptors_by_taxid` \n",
"2. Filter assemblies that were annotated in 2020\n",
"3. Download data, but in this case, retrieve a dehydrated zip archive that can be rehydrated later to obtain the sequence data itself."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of assemblies: 30\n"
]
}
],
"source": [
"genome_summary = api_instance.assembly_descriptors_by_taxon(\n",
" taxon=9989, ## Rodents taxid\n",
" limit='all',\n",
" filters_refseq_only=True)\n",
"\n",
"print(f'Number of assemblies: {genome_summary.total_count}')"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assemblies grouped by year of annotation\n",
"[(2015, 3), (2016, 4), (2017, 5), (2018, 4), (2019, 5), (2020, 9)]\n"
]
}
],
"source": [
"annots_by_year = Counter()\n",
"for assembly in map(lambda d: d.assembly, genome_summary.assemblies):\n",
" annot_year = int(assembly.annotation_metadata.release_date.split(' ')[-1])\n",
" annots_by_year[annot_year] += 1\n",
" \n",
"print(f'Assemblies grouped by year of annotation')\n",
"pprint(sorted(annots_by_year.items()))"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Rodent assemblies that were annotated in 2020:\n",
"GCF_003676075.2, GCF_012274545.1, GCF_000223135.1, GCF_003668045.3, GCF_903995425.1, GCF_004664715.2, GCF_011762505.1, GCF_000001635.27, GCF_011064425.1\n"
]
}
],
"source": [
"rodents_annotated_in_2020_accs = []\n",
"for assembly in map(lambda d: d.assembly, genome_summary.assemblies):\n",
" annot_year = int(assembly.annotation_metadata.release_date.split(' ')[-1])\n",
" if annot_year == 2020:\n",
" rodents_annotated_in_2020_accs.append(assembly.assembly_accession)\n",
" \n",
"print('Rodent assemblies that were annotated in 2020:')\n",
"print(f'{\", \".join(rodents_annotated_in_2020_accs)}')"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
GCF_004785775.1
\n",
"
GCF_900095145.1
\n",
"
GCF_900094665.1
\n",
"
GCF_008632895.1
\n",
"
GCF_000622305.1
\n",
"
\n",
" \n",
" \n",
"
\n",
"
assm_name
\n",
"
NIH_TR_1.0
\n",
"
PAHARI_EIJ_v1.1
\n",
"
CAROLI_EIJ_v1.1
\n",
"
UCSF_Mcou_1
\n",
"
S.galili_v1.0
\n",
"
\n",
"
\n",
"
org_name
\n",
"
Grammomys surdaster
\n",
"
shrew mouse
\n",
"
Ryukyu mouse
\n",
"
southern multimammate mouse
\n",
"
Upper Galilee mountains blind mole rat
\n",
"
\n",
"
\n",
"
sci_name
\n",
"
Grammomys surdaster
\n",
"
Mus pahari
\n",
"
Mus caroli
\n",
"
Mastomys coucha
\n",
"
Nannospalax galili
\n",
"
\n",
"
\n",
"
annot_rel_date
\n",
"
Apr 18, 2019
\n",
"
Jun 14, 2019
\n",
"
Jun 07, 2019
\n",
"
Oct 18, 2019
\n",
"
Jun 05, 2019
\n",
"
\n",
"
\n",
"
annot_rel_num
\n",
"
100
\n",
"
101
\n",
"
101
\n",
"
100
\n",
"
102
\n",
"
\n",
"
\n",
"
assm_level
\n",
"
Scaffold
\n",
"
Chromosome
\n",
"
Chromosome
\n",
"
Chromosome
\n",
"
Scaffold
\n",
"
\n",
"
\n",
"
num_chromosomes
\n",
"
None
\n",
"
25
\n",
"
22
\n",
"
4
\n",
"
None
\n",
"
\n",
"
\n",
"
contig_n50
\n",
"
51731
\n",
"
29465
\n",
"
30917
\n",
"
30483
\n",
"
30353
\n",
"
\n",
"
\n",
"
seq_length
\n",
"
2412664998
\n",
"
2475012951
\n",
"
2553112587
\n",
"
2507168619
\n",
"
3061408210
\n",
"
\n",
"
\n",
"
submission_date
\n",
"
2019-04-12
\n",
"
2017-04-28
\n",
"
2017-04-28
\n",
"
2019-09-24
\n",
"
2014-06-05
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" GCF_004785775.1 GCF_900095145.1 GCF_900094665.1 \\\n",
"assm_name NIH_TR_1.0 PAHARI_EIJ_v1.1 CAROLI_EIJ_v1.1 \n",
"org_name Grammomys surdaster shrew mouse Ryukyu mouse \n",
"sci_name Grammomys surdaster Mus pahari Mus caroli \n",
"annot_rel_date Apr 18, 2019 Jun 14, 2019 Jun 07, 2019 \n",
"annot_rel_num 100 101 101 \n",
"assm_level Scaffold Chromosome Chromosome \n",
"num_chromosomes None 25 22 \n",
"contig_n50 51731 29465 30917 \n",
"seq_length 2412664998 2475012951 2553112587 \n",
"submission_date 2019-04-12 2017-04-28 2017-04-28 \n",
"\n",
" GCF_008632895.1 \\\n",
"assm_name UCSF_Mcou_1 \n",
"org_name southern multimammate mouse \n",
"sci_name Mastomys coucha \n",
"annot_rel_date Oct 18, 2019 \n",
"annot_rel_num 100 \n",
"assm_level Chromosome \n",
"num_chromosomes 4 \n",
"contig_n50 30483 \n",
"seq_length 2507168619 \n",
"submission_date 2019-09-24 \n",
"\n",
" GCF_000622305.1 \n",
"assm_name S.galili_v1.0 \n",
"org_name Upper Galilee mountains blind mole rat \n",
"sci_name Nannospalax galili \n",
"annot_rel_date Jun 05, 2019 \n",
"annot_rel_num 102 \n",
"assm_level Scaffold \n",
"num_chromosomes None \n",
"contig_n50 30353 \n",
"seq_length 3061408210 \n",
"submission_date 2014-06-05 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"assm_table = {}\n",
"for assembly in map(lambda d: d.assembly, genome_summary.assemblies):\n",
" annot_year = int(assembly.annotation_metadata.release_date.split(' ')[-1])\n",
" if annot_year == 2019:\n",
" n_chr = len(assembly.chromosomes) if assembly.assembly_level == 'Chromosome' else None\n",
" assm_table[assembly.assembly_accession] = {\n",
" 'assm_name': assembly.display_name,\n",
" 'org_name': assembly.org.title,\n",
" 'sci_name': assembly.org.sci_name,\n",
" 'annot_rel_date': assembly.annotation_metadata.release_date,\n",
" 'annot_rel_num': assembly.annotation_metadata.release_number,\n",
" 'assm_level': assembly.assembly_level,\n",
" 'num_chromosomes': n_chr,\n",
" 'contig_n50': assembly.contig_n50,\n",
" 'seq_length': assembly.seq_length,\n",
" 'submission_date': assembly.submission_date }\n",
"df = pd.DataFrame.from_dict(assm_table, orient='columns')\n",
"display(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Download package for selected assemblies\n",
"\n",
"For the assemblies collected above, download a dehydrated data package (hydrated=DATA_REPORT_ONLY). This will only contain the data report, and defer collection of nucleotide and protein sequence data until rehydration."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Download a dehydrated package for ['GCF_003676075.2', 'GCF_012274545.1', 'GCF_000223135.1', 'GCF_003668045.3', 'GCF_903995425.1', 'GCF_004664715.2', 'GCF_011762505.1', 'GCF_000001635.27', 'GCF_011064425.1'], with the ability to rehydrate with the CLI later on.\n",
"Download complete\n",
"CPU times: user 12.9 ms, sys: 6.34 ms, total: 19.3 ms\n",
"Wall time: 84.2 ms\n"
]
}
],
"source": [
"%%time\n",
"\n",
"print(f'Download a dehydrated package for {rodents_annotated_in_2020_accs}, with the ability to rehydrate with the CLI later on.')\n",
"api_response = api_instance.download_assembly_package(\n",
" rodents_annotated_in_2020_accs,\n",
" exclude_sequence=True,\n",
" hydrated='DATA_REPORT_ONLY',\n",
" _preload_content=False )\n",
"\n",
"zipfile_name = 'rodent_genomes.zip'\n",
"with open(zipfile_name, 'wb') as f:\n",
" f.write(api_response.data)\n",
"\n",
"print('Download complete')"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"caution: not extracting; -d ignored\r\n",
"Archive: rodent_genomes.zip\r\n",
" Length Method Size Cmpr Date Time CRC-32 Name\r\n",
"-------- ------ ------- ---- ---------- ----- -------- ----\r\n",
" 661 Defl:N 384 42% 11-16-2020 16:38 bc3c97af README.md\r\n",
" 13280 Defl:N 3383 75% 11-16-2020 16:38 76d1b5cf ncbi_dataset/data/assembly_data_report.jsonl\r\n",
" 1592 Defl:N 262 84% 11-16-2020 16:38 c7576a15 ncbi_dataset/data/dataset_catalog.json\r\n",
" 1645 Defl:N 366 78% 11-16-2020 16:38 1057cdf0 ncbi_dataset/fetch.txt\r\n",
"-------- ------- --- -------\r\n",
" 17178 4395 74% 4 files\r\n"
]
}
],
"source": [
"!unzip -v {zipfile_name}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Rehydrate data package\n",
"To rehydrate, use the [NCBI Datasets command-line application](https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start/). For example, the following commands illustrate the process for Linux \n",
"```\n",
"curl -o datasets 'https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets' \n",
"chmod +x datasets\n",
"# specify the directory that contains the extracted zip archive after the directory flag\n",
"./datasets rehydrate --directory rodent_genomes/\n",
"```\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}