{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "For details see https://skeptric.com/schema-jobposting" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import sys\n", "import gzip\n", "import rdflib\n", "from urllib.request import urlretrieve\n", "from pathlib import Path\n", "\n", "from tqdm.notebook import tqdm" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sys.path.insert(0, '../src')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from lib.rdftool import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data From http://webdatacommons.org/structureddata/2019-12/stats/schema_org_subsets.html\n", "\n", "Download both the microdata (1.9GB) and the JSON-LD (700MB)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "DEST_DIR = Path('..') / 'data' / 'webcommons'\n", "DEST_DIR.mkdir(parents=True, exist_ok=True)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "class TqdmUpTo(tqdm):\n", " def update_to(self, b=1, bsize=1, tsize=None):\n", " if tsize is not None:\n", " self.total = tsize\n", " self.update(b * bsize - self.n) # will also set self.n = b * bsize\n", "\n", "def download(url, filename, overwrite=False):\n", " filename = Path(filename)\n", " if (not filename.exists()) or overwrite:\n", " with TqdmUpTo(unit = 'B', unit_scale = True, unit_divisor = 1024, miniters = 1, desc = Path(filename).name) as t:\n", " urlretrieve(url, filename = filename, reporthook = t.update_to)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "JOBS_JSON_2019 = DEST_DIR / '2019-12_json_JobPosting.gz'" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "JOBS_MD_2019 = DEST_DIR / '2019-12_md_JobPosting.gz'" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "download('http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/classspecific/json/schema_JobPosting.gz',\n", " JOBS_JSON_2019)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "download('http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/classspecific/md/schema_JobPosting.gz',\n", " JOBS_MD_2019)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[N-quads](https://www.w3.org/TR/n-quads): Subject Predicate Object Graph\n", "\n", "First few lines:\n", "```\n", "(node with id) (has schema type) (Job posting) (from URL)\n", "(same node) (has identifier) (another node) (from same URL)\n", "(same node) (has title) \"Category Manager - Prof. Audio Visual Solutions\" (from Same URL)\n", "(same node) (has description) (doubly encoded HTML job description) (from same URL)\n", "(same node) (has hiring organisation) (hirer node) (from same URL)\n", "...\n", "(hirer node) (has schema type) (Organization) (form same URL)\n", "(hirer node) (has name) \"Anixter International\" (from same URL)\n", "...\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 _:genid2d8020c9b7d2294a778072a41d6d59640a2db2 .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 \"Category Manager - Prof. Audio Visual Solutions\" .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 \"<p><strong>Category Manager - Professional Audio Visual Solutions<br /><br />Company Information<br /><br /></strong>Anixter is a Fortune 500 company and a leading global supplier of communication and security products and electrical and electronic wire and cable. Our high-performing team works closely with customers and the community to better understand their business challenges specify cost-saving solutions and make informed purchasing decisions around technologies, applications and relevant standards. Please view our video for more information <a href="https://www.anixter.com/en_us/about-us.html">about Anixter</a>.<br /><br />Anixter offers competitive salary and a bonus program to reward your results. We are known for our exceptional training and on-going development programs to support your career growth including a tuition reimbursement. We provide our employees excellent benefits including medical, dental, 401(k) with employer match, and additional company provided retirement benefits. <strong><br /><br /><br /></strong><strong>Position Purpose:</strong> </p><p>This Category Manager position will be primarily focused on managing a portfolio of professional A/V product solutions for a variety of enterprise and commercial environments. Prior experience with suppliers and technologies within the professional A/V market is strongly desired. </p><p>In the role you will be responsible for managing multiple supplier relationships/programs and creating marketing plans to promote growth and profitability. You will become the supplier’s primary point of contact and a valuable strategic resource for the sales team.</p><p><strong>Responsibilities include:</strong></p><ul><li>Developing profitable growth strategies with key suppliers supported by executable initiatives that deliver results in line with short and long term company goals</li><li>Build and maintain outstanding supplier relationships.</li><li>Articulate supplier’s value and differentiating features & benefits to internal sales team.</li><li>Articulate Anixter’s value, capabilities and differentiating features to our supplier partners and customers.</li><li>Monitor inventory levels and product performance. Work with Inventory Management team to develop inventory models and replenishment strategies</li><li>Lead and implement business performance reviews, developmental plans, and supplier negotiations</li><li>Develop and maintain sales tools.</li><li>Maintain appropriate product information databases and internal website.</li><li>Understand key business drivers for product categories to support sales growth. </li></ul><p><strong>Requirements:</strong></p><ul><li>Minimum 5 years' experience in Sales/Marketing </li><li>Professional A/V market experience preferred</li><li>Post-secondary education in related field or equivalent related work experience.</li><li>Ability to exceed expectations through relentless execution of a plan.</li><li>Strong communication and presentation skills. </li><li>Possess the ability to work independently, as well as a strong team player.</li><li>Ability to thrive in a fast-paced environment where continuous learning is required in order to grow personally and professionally.</li><li>Computer skills; MS Office (Word, Excel, Access, Power Point)</li></ul><p><strong>Work Environment</strong> <br />Our founders developed the Blue Book more than 40 years ago to present the beliefs and ethos that define our business style. While we have grown and changed dramatically since we were established in 1957, one thing has remained constant: our commitment to the values presented in the Blue Book. You can review <a href="http://goo.gl/ZabyOl">The Blue Book here</a>.</p><p><br /><strong><em>Anixter is an Equal Opportunity and Affirmative Action Employer; Minority / Female / Disabled / Veteran. We require all of our employees to perform work in an ethical manner and uphold a culture of honesty and ethics at all times.</em></strong></p><p><br /><br /><a href="http://jobs.anixter.com/apply-us?JOBSHAREMRE3MDLFIYWRAN5IFRBW7A2GFVEXN6PGXQ2HWIV2XSJDB367INWSJGJREZAR7VYKUY">Click here to apply online</a></p><p><br />EB-2618554352</p>\" .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 _:genid2d8020c9b7d2294a778072a41d6d59640a2db1 .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 \"2019-11-11\"^^ .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 _:genid2d8020c9b7d2294a778072a41d6d59640a2db3 .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 \"2019-08-01 17:48:55\"^^ .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 \"FULL_TIME\" .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db1 .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db1 \"Anixter International\" .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db2 .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db2 \"Anixter International\" .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db2 \"inventory-management/glenview-il-60026-/category-manager-prof-audio-visual-solutions/153414552962719\" .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db3 .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db3 _:genid2d8020c9b7d2294a778072a41d6d59640a2db4 .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db4 .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db4 \"Glenview\" .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db4 \"IL\" .\r\n", "_:genid2d8020c9b7d2294a778072a41d6d59640a2db4 \"United States\" .\r\n", "\r\n", "gzip: stdout: Broken pipe\r\n" ] } ], "source": [ "!zcat {JOBS_JSON_2019} | head -n 20" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# JSON" ] }, { "cell_type": "code", "execution_count": 356, "metadata": {}, "outputs": [], "source": [ "json_f = gzip.open(JOBS_JSON_2019, 'rt')" ] }, { "cell_type": "code", "execution_count": 357, "metadata": {}, "outputs": [], "source": [ "json_all_graphs = parse_nquads(json_f)" ] }, { "cell_type": "code", "execution_count": 358, "metadata": {}, "outputs": [], "source": [ "json_seen_domains = set()\n", "json_graphs = []" ] }, { "cell_type": "code", "execution_count": 359, "metadata": {}, "outputs": [], "source": [ "json_skipped = []" ] }, { "cell_type": "code", "execution_count": 360, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "32ae637d63c74918842dc5697d62d7b5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=100000), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "skype:raloffice?call|skype:raloffice?chat does not look like a valid URI, trying to serialize this will break.\n", "skype:raloffice?call|skype:raloffice?chat does not look like a valid URI, trying to serialize this will break.\n", "skype:raloffice?call|skype:raloffice?chat does not look like a valid URI, trying to serialize this will break.\n", "£50k OTE £100K + Full Benefits\\n\\nWe're looking for a Senior Recruiter to join our Tech team. The Technology Recruitment team at one of our most successful team and enjoys excellent market presence and success across a number of technologies.\\n\\nThe relationships and success we've forged in the markets have led to us recently expanding in London\\n\\nNow we'd like you to lead this growth further.\\n\\nYou will be a recognised thought leader in your niche field of tech and actively seek to develop the Brand and your personal presence within your market.\\n\\nYou will manage the full 360 recruitment process, developing new business and building your own pipeline of clients with a number working exclusively and/or on a retained basis.\\n\\nYou will have a dedicated Resourcer who will support you with candidate generation and account management of your desk.\\n\\nYou will play an active role in the development of the Business Plan for your niche field.\\n\\nYou will play a key role in identifying and managing commercial growth, opportunities and threats, developing effective strategies to ensure consistent delivery of revenue targets.\\n\\nSome of the benefits:\\n\\n* Guaranteed pay review every 6 months with dedicated personal development plan\\n\\n* Commission structure tailored for mid/high achieves, paying up to 40%\\n\\n* Dedicated Resourcer\\n\\n* Additional support & investment – Personal training courses, LI Recruiter, All Major Job Boards, Odro - The Intelligent Selection Process & Exhibiting & Sponsoring specialist UK & International Events & Talks.\\n\\n* Laptop / Phone / Healthcare / Gym / Pension Contribution\\n\\n* Flexible working hours | Remote working\\n\\n* Extended Lunch | Early Friday finish\\n\\n* Up to 35+ Holidays including duvet days\\n\\n* 5 Year Sabbatical\\n\\n* Quarterly Incentives – Fine Dinning & Team Events\\n\\n* Bi-annual Incentives - Dubai, Marbella, Barbados, and more\\n\\nBut we think the real sell is how we'll support you to source more candidates, bring on more clients, work smarter, develop and earn more money. You're going to enjoy it a lot more too.\\n\\nInterested?\\n\\nLet's have a confidential, informal chat, and we can tell you more about us, the role, and any other questions you might have.\\n\\nCall or Email\" \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typenpct
2http://www.w3.org/1999/02/22-rdf-syntax-ns#type28030.993972
5http://schema.org/JobPosting/title23870.846454
6http://schema.org/JobPosting/description21530.763475
1http://schema.org/JobPosting/datePosted18260.647518
3http://schema.org/JobPosting/jobLocation17650.625887
............
86http://schema.org/JobPosting/country10.000355
88http://schema.org/JobPosting/disambiguatingDes...10.000355
90http://schema.org/JobPosting/expirienceRequire...10.000355
91http://schema.org/JobPosting/Responsibilities10.000355
121http://schema.org/JobPosting/startDate10.000355
\n", "

122 rows × 3 columns

\n", "" ], "text/plain": [ " type n pct\n", "2 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 2803 0.993972\n", "5 http://schema.org/JobPosting/title 2387 0.846454\n", "6 http://schema.org/JobPosting/description 2153 0.763475\n", "1 http://schema.org/JobPosting/datePosted 1826 0.647518\n", "3 http://schema.org/JobPosting/jobLocation 1765 0.625887\n", ".. ... ... ...\n", "86 http://schema.org/JobPosting/country 1 0.000355\n", "88 http://schema.org/JobPosting/disambiguatingDes... 1 0.000355\n", "90 http://schema.org/JobPosting/expirienceRequire... 1 0.000355\n", "91 http://schema.org/JobPosting/Responsibilities 1 0.000355\n", "121 http://schema.org/JobPosting/startDate 1 0.000355\n", "\n", "[122 rows x 3 columns]" ] }, "execution_count": 267, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(c.items(), columns=['type', 'n']).assign(pct = lambda df: df['n'] / len(seen_domains)).sort_values('n', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Analysis" ] }, { "cell_type": "code", "execution_count": 501, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1843, 2820)" ] }, "execution_count": 501, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(json_graphs), len(graphs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How often is each type present from JSON-LD graphs" ] }, { "cell_type": "code", "execution_count": 474, "metadata": {}, "outputs": [], "source": [ "j_counts = pd.DataFrame([Counter(p for p, o in graph.predicate_objects(s)) for graph, s in json_graphs])" ] }, { "cell_type": "code", "execution_count": 477, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
http://www.w3.org/1999/02/22-rdf-syntax-ns#typehttp://schema.org/datePostedhttp://schema.org/titlehttp://schema.org/descriptionhttp://schema.org/hiringOrganizationhttp://schema.org/jobLocationhttp://schema.org/employmentTypehttp://schema.org/validThroughhttp://schema.org/baseSalaryhttp://schema.org/identifierhttp://schema.org/industryhttp://schema.org/urlhttp://schema.org/salaryCurrencyhttp://schema.org/educationRequirementshttp://schema.org/occupationalCategoryhttp://schema.org/experienceRequirementshttp://schema.org/workHourshttp://schema.org/jobBenefitshttp://schema.org/skillshttp://schema.org/qualificationshttp://schema.org/responsibilitieshttp://schema.org/imagehttp://schema.org/jobLocationTypehttp://schema.org/incentiveCompensationhttp://schema.org/namehttp://schema.org/mainEntityOfPagehttp://schema.org/specialCommitmentshttp://schema.org/applicantLocationRequirementshttp://schema.org/estimatedSalaryhttp://schema.org/sameAshttp://schema.org/disambiguatingDescriptionhttp://schema.org/industrieshttp://schema.org/URLhttp://schema.org/jobStartDatehttp://schema.org/logohttp://schema.org/potentialActionhttp://schema.org/HiringOrganizationhttp://schema.org/postalCodehttp://schema.org/warningbaseSalaryhttp://schema.org/http://schema.org/geohttp://schema.org/gvalidThrough
01.00.9967440.9940310.9913190.983180.9755830.8166030.6049920.4688010.4145420.3917530.2273470.1502980.102550.0900710.0889850.0786760.0770480.0765060.0716220.0591430.0575150.0282150.0255020.0151930.0113940.0059690.0043410.0037980.0016280.0016280.0010850.0010850.0005430.0005430.0005430.0005430.0005430.0005430.0005430.0005430.000543
\n", "
" ], "text/plain": [ " http://www.w3.org/1999/02/22-rdf-syntax-ns#type \\\n", "0 1.0 \n", "\n", " http://schema.org/datePosted http://schema.org/title \\\n", "0 0.996744 0.994031 \n", "\n", " http://schema.org/description http://schema.org/hiringOrganization \\\n", "0 0.991319 0.98318 \n", "\n", " http://schema.org/jobLocation http://schema.org/employmentType \\\n", "0 0.975583 0.816603 \n", "\n", " http://schema.org/validThrough http://schema.org/baseSalary \\\n", "0 0.604992 0.468801 \n", "\n", " http://schema.org/identifier http://schema.org/industry \\\n", "0 0.414542 0.391753 \n", "\n", " http://schema.org/url http://schema.org/salaryCurrency \\\n", "0 0.227347 0.150298 \n", "\n", " http://schema.org/educationRequirements \\\n", "0 0.10255 \n", "\n", " http://schema.org/occupationalCategory \\\n", "0 0.090071 \n", "\n", " http://schema.org/experienceRequirements http://schema.org/workHours \\\n", "0 0.088985 0.078676 \n", "\n", " http://schema.org/jobBenefits http://schema.org/skills \\\n", "0 0.077048 0.076506 \n", "\n", " http://schema.org/qualifications http://schema.org/responsibilities \\\n", "0 0.071622 0.059143 \n", "\n", " http://schema.org/image http://schema.org/jobLocationType \\\n", "0 0.057515 0.028215 \n", "\n", " http://schema.org/incentiveCompensation http://schema.org/name \\\n", "0 0.025502 0.015193 \n", "\n", " http://schema.org/mainEntityOfPage http://schema.org/specialCommitments \\\n", "0 0.011394 0.005969 \n", "\n", " http://schema.org/applicantLocationRequirements \\\n", "0 0.004341 \n", "\n", " http://schema.org/estimatedSalary http://schema.org/sameAs \\\n", "0 0.003798 0.001628 \n", "\n", " http://schema.org/disambiguatingDescription http://schema.org/industries \\\n", "0 0.001628 0.001085 \n", "\n", " http://schema.org/URL http://schema.org/jobStartDate \\\n", "0 0.001085 0.000543 \n", "\n", " http://schema.org/logo http://schema.org/potentialAction \\\n", "0 0.000543 0.000543 \n", "\n", " http://schema.org/HiringOrganization http://schema.org/postalCode \\\n", "0 0.000543 0.000543 \n", "\n", " http://schema.org/warningbaseSalary http://schema.org/ \\\n", "0 0.000543 0.000543 \n", "\n", " http://schema.org/geo http://schema.org/gvalidThrough \n", "0 0.000543 0.000543 " ] }, "execution_count": 477, "metadata": {}, "output_type": "execute_result" } ], "source": [ "j_missing = j_counts.isna().mean().sort_values()\n", "(1 - j_missing).to_frame().T" ] }, { "cell_type": "code", "execution_count": 487, "metadata": {}, "outputs": [], "source": [ "m_counts = pd.DataFrame([Counter(p for p, o in graph.predicate_objects(s)) for graph, s in graphs])" ] }, { "cell_type": "code", "execution_count": 493, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
http://www.w3.org/1999/02/22-rdf-syntax-ns#typehttp://schema.org/JobPosting/titlehttp://schema.org/JobPosting/descriptionhttp://schema.org/JobPosting/datePostedhttp://schema.org/JobPosting/jobLocationhttp://schema.org/JobPosting/hiringOrganizationhttp://schema.org/JobPosting/employmentTypehttp://schema.org/JobPosting/validThroughhttp://schema.org/JobPosting/baseSalaryhttp://schema.org/JobPosting/industryhttp://schema.org/JobPosting/urlhttp://schema.org/JobPosting/workHourshttp://schema.org/JobPosting/experienceRequirementshttp://schema.org/JobPosting/occupationalCategoryhttp://schema.org/JobPosting/namehttp://schema.org/JobPosting/imagehttp://schema.org/JobPosting/identifierhttp://schema.org/JobPosting/educationRequirementshttp://schema.org/JobPosting/qualificationshttp://schema.org/JobPosting/responsibilitieshttp://schema.org/JobPosting/salaryCurrencyhttp://schema.org/JobPosting/addresshttp://schema.org/JobPosting/skillshttp://schema.org/JobPosting/specialCommitmentshttp://schema.org/JobPosting/abouthttp://schema.org/JobPosting/jobBenefitshttp://schema.org/JobPosting/benefitshttp://schema.org/JobPosting/telephonehttp://schema.org/JobPosting/incentiveshttp://schema.org/JobPosting/addressLocalityhttp://schema.org/JobPosting/col-md-12http://schema.org/JobPosting/logohttp://schema.org/JobPosting/currencyhttp://schema.org/JobPosting/valuehttp://schema.org/JobPosting/addressRegionhttp://schema.org/JobPosting/incentiveCompensationhttp://schema.org/JobPosting/unitTexthttp://schema.org/JobPosting/postalCodehttp://schema.org/JobPosting/addressCountryhttp://schema.org/JobPosting/texthttp://schema.org/JobPosting/jobLocationTypehttp://schema.org/JobPosting/estimatedSalaryhttp://schema.org/JobPosting/facilityhttp://schema.org/JobPosting/customfield2http://schema.org/JobPosting/sameAshttp://schema.org/JobPosting/datehttp://schema.org/JobPosting/customfield1http://schema.org/JobPosting/departmenthttp://schema.org/JobPosting/mainEntityOfPagehttp://schema.org/JobPosting/shifttypehttp://schema.org/JobPosting/contacthttp://schema.org/JobPosting/customfield3http://schema.org/JobPosting/potentialActionhttp://schema.org/JobPosting/datePublishedhttp://schema.org/JobPosting/streetAddresshttp://schema.org/JobPosting/hiringOrganisationhttp://schema.org/JobPosting/depthttp://schema.org/JobPosting/headlinehttp://schema.org/JobPosting/cityhttp://schema.org/JobPosting/minValuehttp://schema.org/JobPosting/responsabilitieshttp://schema.org/JobPosting/maxValuehttp://schema.org/JobPosting/customfield4http://schema.org/JobPosting/jobTitlehttp://schema.org/JobPosting/emailhttp://schema.org/JobPosting/authorhttp://schema.org/JobPosting/employmenttypehttp://schema.org/JobPosting/reviewhttp://schema.org/JobPosting/additionalTypehttp://schema.org/JobPosting/jobLocation.addresshttp://schema.org/JobPosting/businessunithttp://schema.org/JobPosting/jobSalaryhttp://schema.org/JobPosting/salaryhttp://schema.org/JobPosting/validTroughhttp://schema.org/JobPosting/significantLinkhttp://schema.org/JobPosting/employmentUnithttp://schema.org/JobPosting/joblocationhttp://schema.org/JobPosting/jobStartDatehttp://schema.org/JobPosting/jobCategoryhttp://schema.org/JobPosting/EventDatehttp://schema.org/JobPosting/publisherhttp://schema.org/JobPosting/dateModifiedhttp://schema.org/JobPosting/memberhttp://schema.org/JobPosting/contentUrlhttp://schema.org/JobPosting/blogPosthttp://schema.org/JobPosting/jobCityhttp://schema.org/JobPosting/thumbnailUrlhttp://schema.org/JobPosting/locationhttp://schema.org/JobPosting/photohttp://schema.org/JobPosting/jobExpireshttp://schema.org/JobPosting/alternateNamehttp://schema.org/JobPosting/datepostedhttp://schema.org/JobPosting/jobLocationAddresshttp://schema.org/JobPosting/jobReferencehttp://schema.org/JobPosting/urllinkhttp://schema.org/JobPosting/agenthttp://schema.org/JobPosting/dateCreatedhttp://schema.org/JobPosting/RequirementsDescriptionhttp://schema.org/JobPosting/keywordshttp://schema.org/JobPosting/jobExperiencehttp://schema.org/JobPosting/jobstartdatehttp://schema.org/JobPosting/dateExpireshttps://schema.org/experienceRequirementshttp://schema.org/JobPosting/adcodehttp://schema.org/JobPosting/customfield5http://schema.org/JobPosting/funderhttp://schema.org/JobPosting/ziphttp://schema.org/JobPosting/countryhttp://schema.org/JobPosting/disambiguatingDescriptionhttp://schema.org/JobPosting/relatedLinkhttp://schema.org/JobPosting/expirienceRequirementshttp://schema.org/JobPosting/Responsibilitieshttp://schema.org/JobPosting/startTimehttp://schema.org/JobPosting/jobcategoryhttp://schema.org/JobPosting/txt_inlinehttp://schema.org/JobPosting/skillRequirementshttp://schema.org/JobPosting/genrehttp://schema.org/JobPosting/commenthttp://schema.org/JobPosting/startDate
00.9939720.8446810.7620570.6464540.6251770.5929080.3854610.2294330.2117020.2081560.2039010.1007090.0893620.0812060.0804960.0744680.0698580.0673760.0613480.0542550.0436170.0425530.0421990.0265960.0216310.0180850.0180850.013830.0106380.0081560.0074470.0070920.0056740.0053190.0053190.0049650.0042550.0039010.0039010.0031910.0031910.0031910.0028370.0024820.0024820.0024820.0021280.0021280.0021280.0021280.0017730.0017730.0017730.0017730.0014180.0014180.0014180.0014180.0014180.0010640.0010640.0010640.0010640.0010640.0010640.0010640.0007090.0007090.0007090.0007090.0007090.0007090.0007090.0007090.0007090.0007090.0007090.0007090.0007090.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.0003550.000355
\n", "
" ], "text/plain": [ " http://www.w3.org/1999/02/22-rdf-syntax-ns#type \\\n", "0 0.993972 \n", "\n", " http://schema.org/JobPosting/title \\\n", "0 0.844681 \n", "\n", " http://schema.org/JobPosting/description \\\n", "0 0.762057 \n", "\n", " http://schema.org/JobPosting/datePosted \\\n", "0 0.646454 \n", "\n", " http://schema.org/JobPosting/jobLocation \\\n", "0 0.625177 \n", "\n", " http://schema.org/JobPosting/hiringOrganization \\\n", "0 0.592908 \n", "\n", " http://schema.org/JobPosting/employmentType \\\n", "0 0.385461 \n", "\n", " http://schema.org/JobPosting/validThrough \\\n", "0 0.229433 \n", "\n", " http://schema.org/JobPosting/baseSalary \\\n", "0 0.211702 \n", "\n", " http://schema.org/JobPosting/industry http://schema.org/JobPosting/url \\\n", "0 0.208156 0.203901 \n", "\n", " http://schema.org/JobPosting/workHours \\\n", "0 0.100709 \n", "\n", " http://schema.org/JobPosting/experienceRequirements \\\n", "0 0.089362 \n", "\n", " http://schema.org/JobPosting/occupationalCategory \\\n", "0 0.081206 \n", "\n", " http://schema.org/JobPosting/name http://schema.org/JobPosting/image \\\n", "0 0.080496 0.074468 \n", "\n", " http://schema.org/JobPosting/identifier \\\n", "0 0.069858 \n", "\n", " http://schema.org/JobPosting/educationRequirements \\\n", "0 0.067376 \n", "\n", " http://schema.org/JobPosting/qualifications \\\n", "0 0.061348 \n", "\n", " http://schema.org/JobPosting/responsibilities \\\n", "0 0.054255 \n", "\n", " http://schema.org/JobPosting/salaryCurrency \\\n", "0 0.043617 \n", "\n", " http://schema.org/JobPosting/address http://schema.org/JobPosting/skills \\\n", "0 0.042553 0.042199 \n", "\n", " http://schema.org/JobPosting/specialCommitments \\\n", "0 0.026596 \n", "\n", " http://schema.org/JobPosting/about \\\n", "0 0.021631 \n", "\n", " http://schema.org/JobPosting/jobBenefits \\\n", "0 0.018085 \n", "\n", " http://schema.org/JobPosting/benefits \\\n", "0 0.018085 \n", "\n", " http://schema.org/JobPosting/telephone \\\n", "0 0.01383 \n", "\n", " http://schema.org/JobPosting/incentives \\\n", "0 0.010638 \n", "\n", " http://schema.org/JobPosting/addressLocality \\\n", "0 0.008156 \n", "\n", " http://schema.org/JobPosting/col-md-12 http://schema.org/JobPosting/logo \\\n", "0 0.007447 0.007092 \n", "\n", " http://schema.org/JobPosting/currency http://schema.org/JobPosting/value \\\n", "0 0.005674 0.005319 \n", "\n", " http://schema.org/JobPosting/addressRegion \\\n", "0 0.005319 \n", "\n", " http://schema.org/JobPosting/incentiveCompensation \\\n", "0 0.004965 \n", "\n", " http://schema.org/JobPosting/unitText \\\n", "0 0.004255 \n", "\n", " http://schema.org/JobPosting/postalCode \\\n", "0 0.003901 \n", "\n", " http://schema.org/JobPosting/addressCountry \\\n", "0 0.003901 \n", "\n", " http://schema.org/JobPosting/text \\\n", "0 0.003191 \n", "\n", " http://schema.org/JobPosting/jobLocationType \\\n", "0 0.003191 \n", "\n", " http://schema.org/JobPosting/estimatedSalary \\\n", "0 0.003191 \n", "\n", " http://schema.org/JobPosting/facility \\\n", "0 0.002837 \n", "\n", " http://schema.org/JobPosting/customfield2 \\\n", "0 0.002482 \n", "\n", " http://schema.org/JobPosting/sameAs http://schema.org/JobPosting/date \\\n", "0 0.002482 0.002482 \n", "\n", " http://schema.org/JobPosting/customfield1 \\\n", "0 0.002128 \n", "\n", " http://schema.org/JobPosting/department \\\n", "0 0.002128 \n", "\n", " http://schema.org/JobPosting/mainEntityOfPage \\\n", "0 0.002128 \n", "\n", " http://schema.org/JobPosting/shifttype \\\n", "0 0.002128 \n", "\n", " http://schema.org/JobPosting/contact \\\n", "0 0.001773 \n", "\n", " http://schema.org/JobPosting/customfield3 \\\n", "0 0.001773 \n", "\n", " http://schema.org/JobPosting/potentialAction \\\n", "0 0.001773 \n", "\n", " http://schema.org/JobPosting/datePublished \\\n", "0 0.001773 \n", "\n", " http://schema.org/JobPosting/streetAddress \\\n", "0 0.001418 \n", "\n", " http://schema.org/JobPosting/hiringOrganisation \\\n", "0 0.001418 \n", "\n", " http://schema.org/JobPosting/dept http://schema.org/JobPosting/headline \\\n", "0 0.001418 0.001418 \n", "\n", " http://schema.org/JobPosting/city http://schema.org/JobPosting/minValue \\\n", "0 0.001418 0.001064 \n", "\n", " http://schema.org/JobPosting/responsabilities \\\n", "0 0.001064 \n", "\n", " http://schema.org/JobPosting/maxValue \\\n", "0 0.001064 \n", "\n", " http://schema.org/JobPosting/customfield4 \\\n", "0 0.001064 \n", "\n", " http://schema.org/JobPosting/jobTitle http://schema.org/JobPosting/email \\\n", "0 0.001064 0.001064 \n", "\n", " http://schema.org/JobPosting/author \\\n", "0 0.001064 \n", "\n", " http://schema.org/JobPosting/employmenttype \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/review \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/additionalType \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/jobLocation.address \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/businessunit \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/jobSalary \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/salary \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/validTrough \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/significantLink \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/employmentUnit \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/joblocation \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/jobStartDate \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/jobCategory \\\n", "0 0.000709 \n", "\n", " http://schema.org/JobPosting/EventDate \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/publisher \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/dateModified \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/member \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/contentUrl \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/blogPost \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/jobCity \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/thumbnailUrl \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/location http://schema.org/JobPosting/photo \\\n", "0 0.000355 0.000355 \n", "\n", " http://schema.org/JobPosting/jobExpires \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/alternateName \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/dateposted \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/jobLocationAddress \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/jobReference \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/urllink http://schema.org/JobPosting/agent \\\n", "0 0.000355 0.000355 \n", "\n", " http://schema.org/JobPosting/dateCreated \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/RequirementsDescription \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/keywords \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/jobExperience \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/jobstartdate \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/dateExpires \\\n", "0 0.000355 \n", "\n", " https://schema.org/experienceRequirements \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/adcode \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/customfield5 \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/funder http://schema.org/JobPosting/zip \\\n", "0 0.000355 0.000355 \n", "\n", " http://schema.org/JobPosting/country \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/disambiguatingDescription \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/relatedLink \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/expirienceRequirements \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/Responsibilities \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/startTime \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/jobcategory \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/txt_inline \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/skillRequirements \\\n", "0 0.000355 \n", "\n", " http://schema.org/JobPosting/genre http://schema.org/JobPosting/comment \\\n", "0 0.000355 0.000355 \n", "\n", " http://schema.org/JobPosting/startDate \n", "0 0.000355 " ] }, "execution_count": 493, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m_missing = m_counts.isna().mean().sort_values()\n", "(1 - m_missing).to_frame().T" ] }, { "cell_type": "code", "execution_count": 485, "metadata": {}, "outputs": [], "source": [ "def prop_more_than_1(x):\n", " return (x > 1).mean()" ] }, { "cell_type": "code", "execution_count": 492, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
http://www.w3.org/1999/02/22-rdf-syntax-ns#typehttp://schema.org/datePostedhttp://schema.org/titlehttp://schema.org/descriptionhttp://schema.org/hiringOrganizationhttp://schema.org/jobLocationhttp://schema.org/employmentTypehttp://schema.org/validThroughhttp://schema.org/baseSalaryhttp://schema.org/identifierhttp://schema.org/industryhttp://schema.org/urlhttp://schema.org/salaryCurrencyhttp://schema.org/educationRequirementshttp://schema.org/occupationalCategoryhttp://schema.org/experienceRequirementshttp://schema.org/workHourshttp://schema.org/jobBenefitshttp://schema.org/skillshttp://schema.org/qualificationshttp://schema.org/responsibilitieshttp://schema.org/imagehttp://schema.org/jobLocationTypehttp://schema.org/incentiveCompensationhttp://schema.org/namehttp://schema.org/mainEntityOfPagehttp://schema.org/specialCommitmentshttp://schema.org/applicantLocationRequirementshttp://schema.org/estimatedSalaryhttp://schema.org/sameAshttp://schema.org/disambiguatingDescriptionhttp://schema.org/industrieshttp://schema.org/URLhttp://schema.org/jobStartDatehttp://schema.org/logohttp://schema.org/potentialActionhttp://schema.org/HiringOrganizationhttp://schema.org/postalCodehttp://schema.org/warningbaseSalaryhttp://schema.org/http://schema.org/geohttp://schema.org/gvalidThrough
min1.01.01.0000001.01.01.0000001.0000001.01.0000001.01.0000001.01.01.0000001.0000001.0000001.0000001.0000001.0000001.01.0000001.0000001.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.0
mean1.01.01.0016381.01.01.0361511.0478411.01.0034721.01.0470911.01.01.0105821.4277111.0792681.0068971.0985921.0425531.01.2293581.0094341.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.0
max1.01.04.0000001.01.023.0000006.0000001.03.0000001.08.0000001.01.03.0000009.0000006.0000002.0000009.0000005.0000001.016.0000002.0000001.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.0
prop_more_than_10.00.00.0005430.00.00.0086810.0293000.00.0010850.00.0086810.00.00.0005430.0173630.0021700.0005430.0010850.0010850.00.0016280.0005430.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
\n", "
" ], "text/plain": [ " http://www.w3.org/1999/02/22-rdf-syntax-ns#type \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/datePosted http://schema.org/title \\\n", "min 1.0 1.000000 \n", "mean 1.0 1.001638 \n", "max 1.0 4.000000 \n", "prop_more_than_1 0.0 0.000543 \n", "\n", " http://schema.org/description \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/hiringOrganization \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/jobLocation \\\n", "min 1.000000 \n", "mean 1.036151 \n", "max 23.000000 \n", "prop_more_than_1 0.008681 \n", "\n", " http://schema.org/employmentType \\\n", "min 1.000000 \n", "mean 1.047841 \n", "max 6.000000 \n", "prop_more_than_1 0.029300 \n", "\n", " http://schema.org/validThrough \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/baseSalary http://schema.org/identifier \\\n", "min 1.000000 1.0 \n", "mean 1.003472 1.0 \n", "max 3.000000 1.0 \n", "prop_more_than_1 0.001085 0.0 \n", "\n", " http://schema.org/industry http://schema.org/url \\\n", "min 1.000000 1.0 \n", "mean 1.047091 1.0 \n", "max 8.000000 1.0 \n", "prop_more_than_1 0.008681 0.0 \n", "\n", " http://schema.org/salaryCurrency \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/educationRequirements \\\n", "min 1.000000 \n", "mean 1.010582 \n", "max 3.000000 \n", "prop_more_than_1 0.000543 \n", "\n", " http://schema.org/occupationalCategory \\\n", "min 1.000000 \n", "mean 1.427711 \n", "max 9.000000 \n", "prop_more_than_1 0.017363 \n", "\n", " http://schema.org/experienceRequirements \\\n", "min 1.000000 \n", "mean 1.079268 \n", "max 6.000000 \n", "prop_more_than_1 0.002170 \n", "\n", " http://schema.org/workHours http://schema.org/jobBenefits \\\n", "min 1.000000 1.000000 \n", "mean 1.006897 1.098592 \n", "max 2.000000 9.000000 \n", "prop_more_than_1 0.000543 0.001085 \n", "\n", " http://schema.org/skills http://schema.org/qualifications \\\n", "min 1.000000 1.0 \n", "mean 1.042553 1.0 \n", "max 5.000000 1.0 \n", "prop_more_than_1 0.001085 0.0 \n", "\n", " http://schema.org/responsibilities http://schema.org/image \\\n", "min 1.000000 1.000000 \n", "mean 1.229358 1.009434 \n", "max 16.000000 2.000000 \n", "prop_more_than_1 0.001628 0.000543 \n", "\n", " http://schema.org/jobLocationType \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/incentiveCompensation \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/name http://schema.org/mainEntityOfPage \\\n", "min 1.0 1.0 \n", "mean 1.0 1.0 \n", "max 1.0 1.0 \n", "prop_more_than_1 0.0 0.0 \n", "\n", " http://schema.org/specialCommitments \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/applicantLocationRequirements \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/estimatedSalary http://schema.org/sameAs \\\n", "min 1.0 1.0 \n", "mean 1.0 1.0 \n", "max 1.0 1.0 \n", "prop_more_than_1 0.0 0.0 \n", "\n", " http://schema.org/disambiguatingDescription \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/industries http://schema.org/URL \\\n", "min 1.0 1.0 \n", "mean 1.0 1.0 \n", "max 1.0 1.0 \n", "prop_more_than_1 0.0 0.0 \n", "\n", " http://schema.org/jobStartDate http://schema.org/logo \\\n", "min 1.0 1.0 \n", "mean 1.0 1.0 \n", "max 1.0 1.0 \n", "prop_more_than_1 0.0 0.0 \n", "\n", " http://schema.org/potentialAction \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/HiringOrganization \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/postalCode \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/warningbaseSalary http://schema.org/ \\\n", "min 1.0 1.0 \n", "mean 1.0 1.0 \n", "max 1.0 1.0 \n", "prop_more_than_1 0.0 0.0 \n", "\n", " http://schema.org/geo http://schema.org/gvalidThrough \n", "min 1.0 1.0 \n", "mean 1.0 1.0 \n", "max 1.0 1.0 \n", "prop_more_than_1 0.0 0.0 " ] }, "execution_count": 492, "metadata": {}, "output_type": "execute_result" } ], "source": [ "j_counts.agg(['min', 'mean', 'max', prop_more_than_1])[j_missing.index]" ] }, { "cell_type": "code", "execution_count": 494, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
http://www.w3.org/1999/02/22-rdf-syntax-ns#typehttp://schema.org/JobPosting/titlehttp://schema.org/JobPosting/descriptionhttp://schema.org/JobPosting/datePostedhttp://schema.org/JobPosting/jobLocationhttp://schema.org/JobPosting/hiringOrganizationhttp://schema.org/JobPosting/employmentTypehttp://schema.org/JobPosting/validThroughhttp://schema.org/JobPosting/baseSalaryhttp://schema.org/JobPosting/industryhttp://schema.org/JobPosting/urlhttp://schema.org/JobPosting/workHourshttp://schema.org/JobPosting/experienceRequirementshttp://schema.org/JobPosting/occupationalCategoryhttp://schema.org/JobPosting/namehttp://schema.org/JobPosting/imagehttp://schema.org/JobPosting/identifierhttp://schema.org/JobPosting/educationRequirementshttp://schema.org/JobPosting/qualificationshttp://schema.org/JobPosting/responsibilitieshttp://schema.org/JobPosting/salaryCurrencyhttp://schema.org/JobPosting/addresshttp://schema.org/JobPosting/skillshttp://schema.org/JobPosting/specialCommitmentshttp://schema.org/JobPosting/abouthttp://schema.org/JobPosting/jobBenefitshttp://schema.org/JobPosting/benefitshttp://schema.org/JobPosting/telephonehttp://schema.org/JobPosting/incentiveshttp://schema.org/JobPosting/addressLocalityhttp://schema.org/JobPosting/col-md-12http://schema.org/JobPosting/logohttp://schema.org/JobPosting/currencyhttp://schema.org/JobPosting/valuehttp://schema.org/JobPosting/addressRegionhttp://schema.org/JobPosting/incentiveCompensationhttp://schema.org/JobPosting/unitTexthttp://schema.org/JobPosting/postalCodehttp://schema.org/JobPosting/addressCountryhttp://schema.org/JobPosting/texthttp://schema.org/JobPosting/jobLocationTypehttp://schema.org/JobPosting/estimatedSalaryhttp://schema.org/JobPosting/facilityhttp://schema.org/JobPosting/customfield2http://schema.org/JobPosting/sameAshttp://schema.org/JobPosting/datehttp://schema.org/JobPosting/customfield1http://schema.org/JobPosting/departmenthttp://schema.org/JobPosting/mainEntityOfPagehttp://schema.org/JobPosting/shifttypehttp://schema.org/JobPosting/contacthttp://schema.org/JobPosting/customfield3http://schema.org/JobPosting/potentialActionhttp://schema.org/JobPosting/datePublishedhttp://schema.org/JobPosting/streetAddresshttp://schema.org/JobPosting/hiringOrganisationhttp://schema.org/JobPosting/depthttp://schema.org/JobPosting/headlinehttp://schema.org/JobPosting/cityhttp://schema.org/JobPosting/minValuehttp://schema.org/JobPosting/responsabilitieshttp://schema.org/JobPosting/maxValuehttp://schema.org/JobPosting/customfield4http://schema.org/JobPosting/jobTitlehttp://schema.org/JobPosting/emailhttp://schema.org/JobPosting/authorhttp://schema.org/JobPosting/employmenttypehttp://schema.org/JobPosting/reviewhttp://schema.org/JobPosting/additionalTypehttp://schema.org/JobPosting/jobLocation.addresshttp://schema.org/JobPosting/businessunithttp://schema.org/JobPosting/jobSalaryhttp://schema.org/JobPosting/salaryhttp://schema.org/JobPosting/validTroughhttp://schema.org/JobPosting/significantLinkhttp://schema.org/JobPosting/employmentUnithttp://schema.org/JobPosting/joblocationhttp://schema.org/JobPosting/jobStartDatehttp://schema.org/JobPosting/jobCategoryhttp://schema.org/JobPosting/EventDatehttp://schema.org/JobPosting/publisherhttp://schema.org/JobPosting/dateModifiedhttp://schema.org/JobPosting/memberhttp://schema.org/JobPosting/contentUrlhttp://schema.org/JobPosting/blogPosthttp://schema.org/JobPosting/jobCityhttp://schema.org/JobPosting/thumbnailUrlhttp://schema.org/JobPosting/locationhttp://schema.org/JobPosting/photohttp://schema.org/JobPosting/jobExpireshttp://schema.org/JobPosting/alternateNamehttp://schema.org/JobPosting/datepostedhttp://schema.org/JobPosting/jobLocationAddresshttp://schema.org/JobPosting/jobReferencehttp://schema.org/JobPosting/urllinkhttp://schema.org/JobPosting/agenthttp://schema.org/JobPosting/dateCreatedhttp://schema.org/JobPosting/RequirementsDescriptionhttp://schema.org/JobPosting/keywordshttp://schema.org/JobPosting/jobExperiencehttp://schema.org/JobPosting/jobstartdatehttp://schema.org/JobPosting/dateExpireshttps://schema.org/experienceRequirementshttp://schema.org/JobPosting/adcodehttp://schema.org/JobPosting/customfield5http://schema.org/JobPosting/funderhttp://schema.org/JobPosting/ziphttp://schema.org/JobPosting/countryhttp://schema.org/JobPosting/disambiguatingDescriptionhttp://schema.org/JobPosting/relatedLinkhttp://schema.org/JobPosting/expirienceRequirementshttp://schema.org/JobPosting/Responsibilitieshttp://schema.org/JobPosting/startTimehttp://schema.org/JobPosting/jobcategoryhttp://schema.org/JobPosting/txt_inlinehttp://schema.org/JobPosting/skillRequirementshttp://schema.org/JobPosting/genrehttp://schema.org/JobPosting/commenthttp://schema.org/JobPosting/startDate
min1.01.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.01.01.0000001.0000001.0000001.01.01.01.01.01.0000001.01.01.01.01.01.01.01.01.01.01.0000001.01.01.01.01.01.0000001.01.0000001.01.01.01.01.01.01.01.01.01.01.0000001.01.01.01.0000001.01.01.01.01.01.01.01.01.0000001.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.03.0000001.01.01.02.0000001.03.0000001.01.01.0
mean1.01.0256091.0316431.0274271.0998301.0406701.0487581.0030911.0335011.3236801.1356521.0105631.0238101.2096071.0528631.2000001.0050761.0368421.2485551.0392161.0081301.0083331.1764711.01.01.0392161.0196081.1282051.01.01.01.01.01.6666671.01.01.01.01.01.01.01.01.01.01.1428571.01.01.01.01.01.4000001.01.6000001.01.01.01.01.01.01.01.01.01.01.3333331.01.01.05.0000001.01.01.01.01.01.01.01.02.0000001.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.03.0000001.01.01.02.0000001.03.0000001.01.01.0
max1.030.00000011.00000022.00000020.00000020.0000006.0000002.0000006.00000024.00000030.0000003.0000003.0000005.0000004.00000031.0000002.0000003.00000019.0000002.0000002.0000002.00000014.0000001.01.02.0000002.0000002.0000001.01.01.01.01.02.0000001.01.01.01.01.01.01.01.01.01.02.0000001.01.01.01.01.03.0000001.02.0000001.01.01.01.01.01.01.01.01.01.02.0000001.01.01.09.0000001.01.01.01.01.01.01.01.03.0000001.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.03.0000001.01.01.02.0000001.03.0000001.01.01.0
prop_more_than_10.00.0056740.0124110.0063830.0241130.0124110.0159570.0007090.0039010.0265960.0031910.0007090.0017730.0092200.0031910.0028370.0003550.0017730.0031910.0021280.0003550.0003550.0021280.00.00.0007090.0003550.0017730.00.00.00.00.00.0035460.00.00.00.00.00.00.00.00.00.00.0003550.00.00.00.00.00.0003550.00.0010640.00.00.00.00.00.00.00.00.00.00.0003550.00.00.00.0003550.00.00.00.00.00.00.00.00.0003550.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0003550.00.00.00.0003550.00.0003550.00.00.0
\n", "
" ], "text/plain": [ " http://www.w3.org/1999/02/22-rdf-syntax-ns#type \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/title \\\n", "min 1.000000 \n", "mean 1.025609 \n", "max 30.000000 \n", "prop_more_than_1 0.005674 \n", "\n", " http://schema.org/JobPosting/description \\\n", "min 1.000000 \n", "mean 1.031643 \n", "max 11.000000 \n", "prop_more_than_1 0.012411 \n", "\n", " http://schema.org/JobPosting/datePosted \\\n", "min 1.000000 \n", "mean 1.027427 \n", "max 22.000000 \n", "prop_more_than_1 0.006383 \n", "\n", " http://schema.org/JobPosting/jobLocation \\\n", "min 1.000000 \n", "mean 1.099830 \n", "max 20.000000 \n", "prop_more_than_1 0.024113 \n", "\n", " http://schema.org/JobPosting/hiringOrganization \\\n", "min 1.000000 \n", "mean 1.040670 \n", "max 20.000000 \n", "prop_more_than_1 0.012411 \n", "\n", " http://schema.org/JobPosting/employmentType \\\n", "min 1.000000 \n", "mean 1.048758 \n", "max 6.000000 \n", "prop_more_than_1 0.015957 \n", "\n", " http://schema.org/JobPosting/validThrough \\\n", "min 1.000000 \n", "mean 1.003091 \n", "max 2.000000 \n", "prop_more_than_1 0.000709 \n", "\n", " http://schema.org/JobPosting/baseSalary \\\n", "min 1.000000 \n", "mean 1.033501 \n", "max 6.000000 \n", "prop_more_than_1 0.003901 \n", "\n", " http://schema.org/JobPosting/industry \\\n", "min 1.000000 \n", "mean 1.323680 \n", "max 24.000000 \n", "prop_more_than_1 0.026596 \n", "\n", " http://schema.org/JobPosting/url \\\n", "min 1.000000 \n", "mean 1.135652 \n", "max 30.000000 \n", "prop_more_than_1 0.003191 \n", "\n", " http://schema.org/JobPosting/workHours \\\n", "min 1.000000 \n", "mean 1.010563 \n", "max 3.000000 \n", "prop_more_than_1 0.000709 \n", "\n", " http://schema.org/JobPosting/experienceRequirements \\\n", "min 1.000000 \n", "mean 1.023810 \n", "max 3.000000 \n", "prop_more_than_1 0.001773 \n", "\n", " http://schema.org/JobPosting/occupationalCategory \\\n", "min 1.000000 \n", "mean 1.209607 \n", "max 5.000000 \n", "prop_more_than_1 0.009220 \n", "\n", " http://schema.org/JobPosting/name \\\n", "min 1.000000 \n", "mean 1.052863 \n", "max 4.000000 \n", "prop_more_than_1 0.003191 \n", "\n", " http://schema.org/JobPosting/image \\\n", "min 1.000000 \n", "mean 1.200000 \n", "max 31.000000 \n", "prop_more_than_1 0.002837 \n", "\n", " http://schema.org/JobPosting/identifier \\\n", "min 1.000000 \n", "mean 1.005076 \n", "max 2.000000 \n", "prop_more_than_1 0.000355 \n", "\n", " http://schema.org/JobPosting/educationRequirements \\\n", "min 1.000000 \n", "mean 1.036842 \n", "max 3.000000 \n", "prop_more_than_1 0.001773 \n", "\n", " http://schema.org/JobPosting/qualifications \\\n", "min 1.000000 \n", "mean 1.248555 \n", "max 19.000000 \n", "prop_more_than_1 0.003191 \n", "\n", " http://schema.org/JobPosting/responsibilities \\\n", "min 1.000000 \n", "mean 1.039216 \n", "max 2.000000 \n", "prop_more_than_1 0.002128 \n", "\n", " http://schema.org/JobPosting/salaryCurrency \\\n", "min 1.000000 \n", "mean 1.008130 \n", "max 2.000000 \n", "prop_more_than_1 0.000355 \n", "\n", " http://schema.org/JobPosting/address \\\n", "min 1.000000 \n", "mean 1.008333 \n", "max 2.000000 \n", "prop_more_than_1 0.000355 \n", "\n", " http://schema.org/JobPosting/skills \\\n", "min 1.000000 \n", "mean 1.176471 \n", "max 14.000000 \n", "prop_more_than_1 0.002128 \n", "\n", " http://schema.org/JobPosting/specialCommitments \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/about \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobBenefits \\\n", "min 1.000000 \n", "mean 1.039216 \n", "max 2.000000 \n", "prop_more_than_1 0.000709 \n", "\n", " http://schema.org/JobPosting/benefits \\\n", "min 1.000000 \n", "mean 1.019608 \n", "max 2.000000 \n", "prop_more_than_1 0.000355 \n", "\n", " http://schema.org/JobPosting/telephone \\\n", "min 1.000000 \n", "mean 1.128205 \n", "max 2.000000 \n", "prop_more_than_1 0.001773 \n", "\n", " http://schema.org/JobPosting/incentives \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/addressLocality \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/col-md-12 \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/logo \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/currency \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/value \\\n", "min 1.000000 \n", "mean 1.666667 \n", "max 2.000000 \n", "prop_more_than_1 0.003546 \n", "\n", " http://schema.org/JobPosting/addressRegion \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/incentiveCompensation \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/unitText \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/postalCode \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/addressCountry \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/text \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobLocationType \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/estimatedSalary \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/facility \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/customfield2 \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/sameAs \\\n", "min 1.000000 \n", "mean 1.142857 \n", "max 2.000000 \n", "prop_more_than_1 0.000355 \n", "\n", " http://schema.org/JobPosting/date \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/customfield1 \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/department \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/mainEntityOfPage \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/shifttype \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/contact \\\n", "min 1.000000 \n", "mean 1.400000 \n", "max 3.000000 \n", "prop_more_than_1 0.000355 \n", "\n", " http://schema.org/JobPosting/customfield3 \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/potentialAction \\\n", "min 1.000000 \n", "mean 1.600000 \n", "max 2.000000 \n", "prop_more_than_1 0.001064 \n", "\n", " http://schema.org/JobPosting/datePublished \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/streetAddress \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/hiringOrganisation \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/dept \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/headline \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/city \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/minValue \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/responsabilities \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/maxValue \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/customfield4 \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobTitle \\\n", "min 1.000000 \n", "mean 1.333333 \n", "max 2.000000 \n", "prop_more_than_1 0.000355 \n", "\n", " http://schema.org/JobPosting/email \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/author \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/employmenttype \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/review \\\n", "min 1.000000 \n", "mean 5.000000 \n", "max 9.000000 \n", "prop_more_than_1 0.000355 \n", "\n", " http://schema.org/JobPosting/additionalType \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobLocation.address \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/businessunit \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobSalary \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/salary \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/validTrough \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/significantLink \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/employmentUnit \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/joblocation \\\n", "min 1.000000 \n", "mean 2.000000 \n", "max 3.000000 \n", "prop_more_than_1 0.000355 \n", "\n", " http://schema.org/JobPosting/jobStartDate \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobCategory \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/EventDate \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/publisher \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/dateModified \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/member \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/contentUrl \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/blogPost \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobCity \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/thumbnailUrl \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/location \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/photo \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobExpires \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/alternateName \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/dateposted \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobLocationAddress \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobReference \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/urllink \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/agent \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/dateCreated \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/RequirementsDescription \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/keywords \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobExperience \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobstartdate \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/dateExpires \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " https://schema.org/experienceRequirements \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/adcode \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/customfield5 \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/funder \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/zip \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/country \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/disambiguatingDescription \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/relatedLink \\\n", "min 3.000000 \n", "mean 3.000000 \n", "max 3.000000 \n", "prop_more_than_1 0.000355 \n", "\n", " http://schema.org/JobPosting/expirienceRequirements \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/Responsibilities \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/startTime \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/jobcategory \\\n", "min 2.000000 \n", "mean 2.000000 \n", "max 2.000000 \n", "prop_more_than_1 0.000355 \n", "\n", " http://schema.org/JobPosting/txt_inline \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/skillRequirements \\\n", "min 3.000000 \n", "mean 3.000000 \n", "max 3.000000 \n", "prop_more_than_1 0.000355 \n", "\n", " http://schema.org/JobPosting/genre \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/comment \\\n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 \n", "\n", " http://schema.org/JobPosting/startDate \n", "min 1.0 \n", "mean 1.0 \n", "max 1.0 \n", "prop_more_than_1 0.0 " ] }, "execution_count": 494, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m_counts.agg(['min', 'mean', 'max', prop_more_than_1])[m_missing.index]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deeper analysis" ] }, { "cell_type": "code", "execution_count": 510, "metadata": {}, "outputs": [], "source": [ "SDO = rdflib.namespace.Namespace('http://schema.org/')\n", "def extract_property(graphs, sdo_type):\n", " predicate = SDO[sdo_type]\n", " for items in ([graph_to_dict(graph, o) if isinstance(o, rdflib.term.BNode) else o.toPython() for o in graph.objects(s, predicate)] for graph, s in graphs):\n", " if items:\n", " yield items" ] }, { "cell_type": "code", "execution_count": 606, "metadata": {}, "outputs": [], "source": [ "SDO = rdflib.namespace.Namespace('http://schema.org/')\n", "def extract_types(graphs, sdo_type):\n", " predicate = SDO[sdo_type]\n", " for graph, s in graphs:\n", " items = list(graph.objects(s, predicate))\n", " if items:\n", " item = items[0]\n", " if isinstance(item, rdflib.term.BNode):\n", " try:\n", " dtype = list(graph.objects(item, rdflib.namespace.RDF.type))\n", " yield dtype[0].toPython()\n", " except Exception:\n", " yield 'Unknown Object'\n", " elif isinstance(item, rdflib.term.Literal):\n", " dtype = type(item.toPython())\n", " if dtype == rdflib.term.Literal:\n", " yield item.datatype.toPython()\n", " else:\n", " yield dtype\n", " elif isinstance(item, rdflib.term.URIRef):\n", " yield 'URI'\n", " else:\n", " yield 'Unknown'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Title" ] }, { "cell_type": "code", "execution_count": 607, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 1832}), Counter({str: 2272, 'URI': 110}))" ] }, "execution_count": 607, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'title')), Counter(extract_types(graphs, 'JobPosting/title'))" ] }, { "cell_type": "code", "execution_count": 666, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['Category Manager - Prof. Audio Visual Solutions'],\n", " ['Stage Commerciële Economie'],\n", " ['Poster Distributor Wanted'],\n", " ['Montréal - Machiniste - Anglais - Français'],\n", " ['PT Faculty Pool - Apprenticeship/Electrical IID']]" ] }, "execution_count": 666, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(json_graphs, 'title'))[:5]" ] }, { "cell_type": "code", "execution_count": 667, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['ADDETTO ALLA PIANIFICAZIONE DELLA PRODUZIONE JUNIOR'],\n", " ['Psychiatric Nurse Practitioner'],\n", " ['Visual Merchandiser ZARA Men Arnhem (fulltime)'],\n", " ['\\n\\t\\t\\t\\t\\tبحاجة الى العمل دكتور صيدلي\\t\\t\\t\\t\\t26 مشاهدة\\t\\t\\t\\t'],\n", " ['Philadelphia-Housekeepers']]" ] }, "execution_count": 667, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(graphs, 'JobPosting/title'))[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Description" ] }, { "cell_type": "code", "execution_count": 608, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 1827}), Counter({str: 2149}))" ] }, "execution_count": 608, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'description')), Counter(extract_types(graphs, 'JobPosting/description'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### JobLocation" ] }, { "cell_type": "code", "execution_count": 609, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({'http://schema.org/Place': 1760,\n", " 'Unknown Object': 27,\n", " 'http://schema.org/place': 8,\n", " str: 1,\n", " 'http://schema.org/Country': 2}),\n", " Counter({'http://schema.org/Place': 1347,\n", " str: 361,\n", " 'URI': 24,\n", " 'https://schema.org/Place': 11,\n", " 'http:/schema.orgPlace': 17,\n", " 'http://schema.org/PostalAddress': 1,\n", " 'Unknown Object': 1,\n", " 'http://schema.org/City': 1}))" ] }, "execution_count": 609, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'jobLocation')), Counter(extract_types(graphs, 'JobPosting/jobLocation'))" ] }, { "cell_type": "code", "execution_count": 670, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Place'],\n", " 'http://schema.org/address': [{'http://schema.org/addressCountry': ['United States'],\n", " 'http://schema.org/addressLocality': ['Glenview'],\n", " 'http://schema.org/addressRegion': ['IL'],\n", " 'http://schema.org/postalCode': ['60026'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PostalAddress']}],\n", " '_label': ['http://jobs.anixter.com/jobs/inventory-management/glenview-il-60026-/category-manager-prof-audio-visual-solutions/153414552962719?lang=en_us']}],\n", " [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Place'],\n", " 'http://schema.org/address': [{'http://schema.org/addressCountry': ['NL'],\n", " 'http://schema.org/postalCode': ['5223 MA'],\n", " 'http://schema.org/addressLocality': ['Den Bosch'],\n", " 'http://schema.org/addressRegion': ['NB'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PostalAddress']}],\n", " '_label': ['http://stage.socialdeal.nl/o/stage-commerciele-economie-2']}],\n", " [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Place'],\n", " 'http://schema.org/address': [{'http://schema.org/postalCode': ['SL6 8ND'],\n", " 'http://schema.org/addressLocality': ['Maidenhead'],\n", " 'http://schema.org/addressCountry': ['GB'],\n", " 'http://schema.org/streetAddress': ['21 Lassell Gardens'],\n", " 'http://schema.org/addressRegion': ['Berkshire'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PostalAddress']}],\n", " '_label': ['http://www.poster-campaign.com/poster-distributors/']}]]" ] }, "execution_count": 670, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(json_graphs, 'jobLocation'))[:3]" ] }, { "cell_type": "code", "execution_count": 671, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[{'http://schema.org/Place/address': [{'http://schema.org/PostalAddress/addressLocality': ['Reggio Emilia provincia'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PostalAddress'],\n", " 'http://schema.org/PostalAddress/addressRegion': ['Regione Emilia Romagna']}],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Place'],\n", " '_label': ['http://cambiolavoro.com/clav/bacheca.nsf/AnnunciDiLavoroNew/ADDETTO_ALLA_PIANIFICAZIONE_DELLA_PRODUZIONE_JUNIOR_REGIONE_EMILIA_ROMAGNA_REGGIO_EMILIA_2F4B6DB7F4B2420DC1258486004FDCCA?OpenDocument']}],\n", " [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Place'],\n", " 'http://schema.org/Place/address': [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PostalAddress'],\n", " 'http://schema.org/PostalAddress/addressRegion': ['NV'],\n", " 'http://schema.org/PostalAddress/addressLocality': ['Pahrump']}],\n", " '_label': ['http://careers.cnsjobmarket.psychiatrist.com/jobs/psychiatric-nurse-practitioner-pahrump-nv-108424726-d']},\n", " {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Place'],\n", " 'http://schema.org/Place/address': [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PostalAddress'],\n", " 'http://schema.org/PostalAddress/addressRegion': ['NV'],\n", " 'http://schema.org/PostalAddress/addressLocality': ['Pahrump']}],\n", " '_label': ['http://careers.cnsjobmarket.psychiatrist.com/jobs/psychiatric-nurse-practitioner-pahrump-nv-108424726-d']}],\n", " [{'http://schema.org/Place/address': ['Arnhem'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Place'],\n", " '_label': ['http://emploi.lalibre.be/fr/emploi/37819/visual-merchandiser-zara-men-arnhem-fulltime']}]]" ] }, "execution_count": 671, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(graphs, 'JobPosting/jobLocation'))[:3]" ] }, { "cell_type": "code", "execution_count": 734, "metadata": {}, "outputs": [], "source": [ "def extract_subtype(rdf_type, subtype, json=True):\n", " if json:\n", " data_graphs = json_graphs\n", " else:\n", " data_graphs = graphs\n", " rdf_type = 'JobPosting/' + rdf_type\n", " return [loc[0] for loc in extract_property(data_graphs, rdf_type) if loc and isinstance(loc[0], dict) and loc[0].get('http://www.w3.org/1999/02/22-rdf-syntax-ns#type') == ['http://schema.org/' + subtype]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Totals (1843, 2820)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Common attributes for jobLocation" ] }, { "cell_type": "code", "execution_count": 749, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1760,\n", " 'http://schema.org/address': 1739,\n", " '_label': 1760,\n", " 'http://schema.org/geo': 100,\n", " 'http://schema.org/name': 55,\n", " 'http://schema.org/country': 4,\n", " 'http://schema.org/url': 4,\n", " 'http://schema.org/description': 1,\n", " 'http://schema.org/additionalProperty': 2,\n", " 'http://schema.org/image': 1,\n", " 'http://schema.org/Address': 1})" ] }, "execution_count": 749, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(y for x in extract_subtype('jobLocation', 'Place') for y in x)" ] }, { "cell_type": "code", "execution_count": 750, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/Place/address': 1236,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1347,\n", " '_label': 1347,\n", " 'http://schema.org/Place/name': 38,\n", " 'http://schema.org/Place/addressLocality': 20,\n", " 'http://schema.org/Place/geo': 17,\n", " 'http://schema.org/Place/datePosted': 10,\n", " 'http://schema.org/Place/telephone': 9,\n", " 'http://schema.org/Place/addressRegion': 12,\n", " 'http://schema.org/Place/Address': 1,\n", " 'http://schema.org/Place/hasMap': 2,\n", " 'http://schema.org/Place/postalCode': 3,\n", " 'http://schema.org/Place/streetAddress': 4,\n", " 'http://schema.org/Place/url': 1,\n", " 'http://schema.org/Place/telepohone': 1,\n", " 'http://schema.org/Place/jobLocation': 1})" ] }, "execution_count": 750, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(y for x in extract_subtype('jobLocation', 'Place', False) for y in x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Job Location - Address" ] }, { "cell_type": "code", "execution_count": 790, "metadata": {}, "outputs": [], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/address')\n", " if address:\n", " if isinstance(address[0], dict) and 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' in address[0]:\n", " c.update([address[0]['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0]])\n", " else:\n", " c.update([type(address[0])])" ] }, { "cell_type": "code", "execution_count": 791, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/PostalAddress': 1559,\n", " dict: 148,\n", " 'http://schema.org/postalAddress': 6,\n", " str: 26})" ] }, "execution_count": 791, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c" ] }, { "cell_type": "code", "execution_count": 795, "metadata": {}, "outputs": [], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/address')\n", " if address:\n", " if isinstance(address[0], dict) and 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' in address[0]:\n", " c.update([address[0]['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0]])\n", " else:\n", " c.update([type(address[0])])" ] }, { "cell_type": "code", "execution_count": 796, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/PostalAddress': 1143,\n", " str: 60,\n", " 'http://schema.org/Postaladdress': 19,\n", " 'http:/schema.orgPostalAddress': 13,\n", " 'http://schema.org/Address': 1})" ] }, "execution_count": 796, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c" ] }, { "cell_type": "code", "execution_count": 786, "metadata": {}, "outputs": [], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/address')\n", " if address and isinstance(address[0], dict):\n", " c.update(address[0].keys())" ] }, { "cell_type": "code", "execution_count": 821, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "UK\n", "N1H 3A1\n", "Amsterdam \n", "台灣台北市中正區襄陽路一號\n", "Industriveien 6,
 2020 Skedsmokorset
\n", "\n", "Chaponnay, Rhône-Alpes, Rhône, France\n", "東京都 千葉県 神奈川県 埼玉県を中心とした取引先企業\r\n", "※勤務地はご希望に応じます。\r\n", "※関東圏内での転勤の可能性あり\n", "China\n", "�ソス�ソス�ソス鼬ァ�ソスF�ソス�ソス�ソスs�ソス�ソスc�ソスR�ソス�ソス966�ソスF�ソス�ソス�ソス�ソス�ソス�ソス�ソス�ソス�ソスン地\n", "Symonds Yat East, Wye Valley\n" ] } ], "source": [ "i = 0\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/address')\n", " if address and isinstance(address[0], str):\n", " print(address[0])\n", " i+=1\n", " if i > 10: break" ] }, { "cell_type": "code", "execution_count": 787, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/addressCountry': 1423,\n", " 'http://schema.org/addressLocality': 1643,\n", " 'http://schema.org/addressRegion': 1509,\n", " 'http://schema.org/postalCode': 994,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1565,\n", " 'http://schema.org/streetAddress': 628,\n", " 'http://schema.org/url': 1,\n", " 'http://schema.org/name': 20,\n", " 'http://schema.org/postalcode': 1,\n", " 'http://schema.org/streetaddress': 1})" ] }, "execution_count": 787, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "md" ] }, { "cell_type": "code", "execution_count": 797, "metadata": {}, "outputs": [], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/address')\n", " if address and isinstance(address[0], dict):\n", " c.update(address[0].keys())" ] }, { "cell_type": "code", "execution_count": 798, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/PostalAddress/addressLocality': 962,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1176,\n", " 'http://schema.org/PostalAddress/addressRegion': 857,\n", " 'http://schema.org/PostalAddress/postalCode': 354,\n", " 'http://schema.org/PostalAddress/addressCountry': 447,\n", " 'http://schema.org/PostalAddress/streetAddress': 206,\n", " 'http://schema.org/Postaladdress/addressLocality': 19,\n", " 'http://schema.org/PostalAddress/url': 2,\n", " 'http://schema.org/Postaladdress/addressRegion': 5,\n", " 'http://schema.org/PostalAddress/addresscountry': 1,\n", " 'http://schema.org/PostalAddress/name': 5,\n", " 'http://schema.org/PostalAddress/telephone': 10,\n", " 'http://schema.org/Address/addressLocality': 1,\n", " 'http://schema.org/PostalAddress/geo': 1})" ] }, "execution_count": 798, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### name" ] }, { "cell_type": "code", "execution_count": 822, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({str: 55})" ] }, "execution_count": 822, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/name')\n", " if address:\n", " if isinstance(address[0], dict) and 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' in address[0]:\n", " c.update([address[0]['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0]])\n", " else:\n", " c.update([type(address[0])])\n", "c" ] }, { "cell_type": "code", "execution_count": 823, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({str: 38})" ] }, "execution_count": 823, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/name')\n", " if address:\n", " if isinstance(address[0], dict) and 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' in address[0]:\n", " c.update([address[0]['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0]])\n", " else:\n", " c.update([type(address[0])])\n", "c" ] }, { "cell_type": "code", "execution_count": 824, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Southwark\n", "Birmingham\n", "Johnson & Johnson\n", "Wick, Caithness\n", "Cumbria\n" ] } ], "source": [ "i = 0\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/name')\n", " if address:\n", " print(address[0])\n", " i += 1\n", " if i>=5: break" ] }, { "cell_type": "code", "execution_count": 825, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Benin\n", "Amberg\n", "Челябинск\n", "Arlon\n", "Город:\n" ] } ], "source": [ "i = 0\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/name')\n", " if address:\n", " print(address[0])\n", " i += 1\n", " if i>=5: break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### geo" ] }, { "cell_type": "code", "execution_count": 799, "metadata": {}, "outputs": [], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/geo')\n", " if address:\n", " if isinstance(address[0], dict) and 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' in address[0]:\n", " c.update([address[0]['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0]])\n", " else:\n", " c.update([type(address[0])])" ] }, { "cell_type": "code", "execution_count": 800, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/GeoCoordinates': 100})" ] }, "execution_count": 800, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c" ] }, { "cell_type": "code", "execution_count": 801, "metadata": {}, "outputs": [], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/geo')\n", " if address:\n", " if isinstance(address[0], dict) and 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' in address[0]:\n", " c.update([address[0]['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0]])\n", " else:\n", " c.update([type(address[0])])" ] }, { "cell_type": "code", "execution_count": 802, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/GeoCoordinates': 12, str: 5})" ] }, "execution_count": 802, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c" ] }, { "cell_type": "code", "execution_count": 803, "metadata": {}, "outputs": [], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/geo')\n", " if address and isinstance(address[0], dict):\n", " c.update(list(address[0]))" ] }, { "cell_type": "code", "execution_count": 804, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/latitude': 99,\n", " 'http://schema.org/longitude': 99,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 100,\n", " 'http://schema.org/address': 1})" ] }, "execution_count": 804, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c" ] }, { "cell_type": "code", "execution_count": 815, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "54.727615356462,55.955778063477\n", "55.980490257187,37.299160243061\n", "47.76697408393,39.942479411045\n", "60.002559475351,30.268780856466\n", "58.003400123447,55.663826612107\n" ] } ], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/geo')\n", " if address and isinstance(address[0], str):\n", " print(address[0])" ] }, { "cell_type": "code", "execution_count": 813, "metadata": {}, "outputs": [], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/geo')\n", " if address and isinstance(address[0], dict):\n", " c.update(list(address[0]))" ] }, { "cell_type": "code", "execution_count": 814, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/GeoCoordinates/longitude': 12,\n", " 'http://schema.org/GeoCoordinates/latitude': 12,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 12})" ] }, "execution_count": 814, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c" ] }, { "cell_type": "code", "execution_count": 806, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({str: 30, float: 69})" ] }, "execution_count": 806, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/geo')\n", " if address and isinstance(address[0], dict):\n", " address = address[0]\n", " if 'http://schema.org/latitude' in address:\n", " c.update([type(address['http://schema.org/latitude'][0])])\n", "c" ] }, { "cell_type": "code", "execution_count": 807, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({str: 30, float: 69})" ] }, "execution_count": 807, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/geo')\n", " if address and isinstance(address[0], dict):\n", " address = address[0]\n", " if 'http://schema.org/latitude' in address:\n", " c.update([type(address['http://schema.org/longitude'][0])])\n", "c" ] }, { "cell_type": "code", "execution_count": 817, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({str: 12})" ] }, "execution_count": 817, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/geo')\n", " if address and isinstance(address[0], dict):\n", " address = address[0]\n", " if 'http://schema.org/GeoCoordinates/latitude' in address:\n", " c.update([type(address['http://schema.org/GeoCoordinates/latitude'][0])])\n", "c" ] }, { "cell_type": "code", "execution_count": 818, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18.424055299999964 -33.9248685\n", "-100.76 46.8\n", "0.000000 0.000000\n", "-3.43597299999999 55.378051\n", "-83.8261 33.5757\n", "-77.700485 39.633438\n", "-0.462222222 46.325\n", "10.6478 53.8672\n", "-75.694206 41.371868\n", "-92.017937 30.218462\n", "0.000000 0.000000\n", "8.045 52.84754\n" ] }, { "data": { "text/plain": [ "Counter()" ] }, "execution_count": 818, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/geo')\n", " if address and isinstance(address[0], dict):\n", " address = address[0]\n", " if 'http://schema.org/GeoCoordinates/longitude' in address:\n", " print(address['http://schema.org/GeoCoordinates/longitude'][0], address['http://schema.org/GeoCoordinates/latitude'][0])\n", "c" ] }, { "cell_type": "code", "execution_count": 809, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['45.6685554'] ['13.1040857']\n", "[52.48142] [-1.89983]\n", "[35.35423] [139.320407]\n", "[53.131759] [8.706955]\n", "['50.7426'] ['7.1339']\n", "[55.378051] [-3.43597299999999]\n", "[33.8870126] [130.8499488]\n", "[48.7] [9.6667]\n", "[58.43333] [-3.08333]\n", "['46.7956'] ['7.1538']\n", "[52.6386] [-1.13169]\n" ] } ], "source": [ "i = 0\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/geo')\n", " if address and isinstance(address[0], dict):\n", " address = address[0]\n", " if 'http://schema.org/latitude' in address:\n", " i+=1\n", " print(address['http://schema.org/latitude'], address['http://schema.org/longitude'])\n", " if i > 10:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Postal Address" ] }, { "cell_type": "code", "execution_count": 827, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/addressCountry': 1423,\n", " 'http://schema.org/addressLocality': 1643,\n", " 'http://schema.org/addressRegion': 1509,\n", " 'http://schema.org/postalCode': 994,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1565,\n", " 'http://schema.org/streetAddress': 628,\n", " 'http://schema.org/url': 1,\n", " 'http://schema.org/name': 20,\n", " 'http://schema.org/postalcode': 1,\n", " 'http://schema.org/streetaddress': 1})" ] }, "execution_count": 827, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/address')\n", " if address and isinstance(address[0], dict):\n", " c.update(address[0].keys())\n", "c" ] }, { "cell_type": "code", "execution_count": 828, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/PostalAddress/addressLocality': 962,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1176,\n", " 'http://schema.org/PostalAddress/addressRegion': 857,\n", " 'http://schema.org/PostalAddress/postalCode': 354,\n", " 'http://schema.org/PostalAddress/addressCountry': 447,\n", " 'http://schema.org/PostalAddress/streetAddress': 206,\n", " 'http://schema.org/Postaladdress/addressLocality': 19,\n", " 'http://schema.org/PostalAddress/url': 2,\n", " 'http://schema.org/Postaladdress/addressRegion': 5,\n", " 'http://schema.org/PostalAddress/addresscountry': 1,\n", " 'http://schema.org/PostalAddress/name': 5,\n", " 'http://schema.org/PostalAddress/telephone': 10,\n", " 'http://schema.org/Address/addressLocality': 1,\n", " 'http://schema.org/PostalAddress/geo': 1})" ] }, "execution_count": 828, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = Counter()\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/address')\n", " if address and isinstance(address[0], dict):\n", " c.update(address[0].keys())\n", "c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### addressCountry" ] }, { "cell_type": "code", "execution_count": 837, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['United States', 'NL', 'GB', 'CA', 'United States']" ] }, "execution_count": 837, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/address')\n", " if address and isinstance(address[0], dict):\n", " a = address[0].get('http://schema.org/addressCountry')\n", " if a:\n", " c.append(a[0])\n", "c[:5]" ] }, { "cell_type": "code", "execution_count": 834, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({str: 1346, dict: 77})" ] }, "execution_count": 834, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 840, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 76,\n", " 'http://schema.org/name': 76})" ] }, "execution_count": 840, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(k for a in c for k in a if isinstance(a, dict))" ] }, { "cell_type": "code", "execution_count": 841, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/Country': 152})" ] }, "execution_count": 841, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(a['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0] for a in c for k in a if isinstance(a, dict))" ] }, { "cell_type": "code", "execution_count": 853, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Schweiz', 'Czech Republic', 'Belgium', 'United States', 'US']" ] }, "execution_count": 853, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/address')\n", " if address and isinstance(address[0], dict):\n", " a = address[0].get('http://schema.org/PostalAddress/addressCountry')\n", " if a:\n", " c.append(a[0])\n", "c[:5]" ] }, { "cell_type": "code", "execution_count": 854, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({str: 437, dict: 10})" ] }, "execution_count": 854, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(map(type, c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Empty..." ] }, { "cell_type": "code", "execution_count": 857, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{}, {}, {}, {}, {}, {}, {}, {}, {}, {}]" ] }, "execution_count": 857, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[a for a in c if isinstance(a, dict)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Country name" ] }, { "cell_type": "code", "execution_count": 865, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['Italia', 'IN', 'PL', 'US', 'UA'], 76, Counter({str: 76}))" ] }, "execution_count": 865, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/address')\n", " if address and isinstance(address[0], dict):\n", " a = address[0].get('http://schema.org/addressCountry')\n", " if a and isinstance(a[0], dict):\n", " name = a[0].get('http://schema.org/name')\n", " if name:\n", " c.append(name[0])\n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### addressLocality" ] }, { "cell_type": "code", "execution_count": 870, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['Glenview',\n", " 'Den Bosch',\n", " 'Maidenhead',\n", " 'Saint-Jean-sur-Richelieu',\n", " 'Imperial'],\n", " 1643,\n", " Counter({str: 1643}))" ] }, "execution_count": 870, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/address')\n", " if address and isinstance(address[0], dict):\n", " a = address[0].get('http://schema.org/addressLocality')\n", " if a:\n", " c.append(a[0])\n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 872, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['Reggio Emilia provincia',\n", " 'Pahrump',\n", " 'Philadelphia',\n", " 'Norcross',\n", " 'Hillsboro'],\n", " 962,\n", " Counter({str: 952, dict: 10}))" ] }, "execution_count": 872, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/address')\n", " if address and isinstance(address[0], dict):\n", " a = address[0].get('http://schema.org/PostalAddress/addressLocality')\n", " if a:\n", " c.append(a[0])\n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 874, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{}, {}, {}, {}, {}, {}, {}, {}, {}, {}]" ] }, "execution_count": 874, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[a for a in c if isinstance(a, dict)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### addressRegion" ] }, { "cell_type": "code", "execution_count": 877, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['IL', 'NB', 'Berkshire', 'QC', 'California'], 1509, Counter({str: 1509}))" ] }, "execution_count": 877, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/address')\n", " if address and isinstance(address[0], dict):\n", " a = address[0].get('http://schema.org/addressRegion')\n", " if a:\n", " c.append(a[0])\n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 879, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['Reggio Emilia provincia',\n", " 'Pahrump',\n", " 'Philadelphia',\n", " 'Norcross',\n", " 'Hillsboro'],\n", " 962,\n", " Counter({str: 952, dict: 10}))" ] }, "execution_count": 879, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/address')\n", " if address and isinstance(address[0], dict):\n", " a = address[0].get('http://schema.org/PostalAddress/addressLocality')\n", " if a:\n", " c.append(a[0])\n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 880, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{}, {}, {}, {}, {}, {}, {}, {}, {}, {}]" ] }, "execution_count": 880, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[a for a in c if isinstance(a, dict)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### postalCode" ] }, { "cell_type": "code", "execution_count": 881, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['60026', '5223 MA', 'SL6 8ND', 'J3A1B6', '92251'],\n", " 994,\n", " Counter({str: 972, int: 22}))" ] }, "execution_count": 881, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/address')\n", " if address and isinstance(address[0], dict):\n", " a = address[0].get('http://schema.org/postalCode')\n", " if a:\n", " c.append(a[0])\n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 882, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['19113', '30071', '97124', '45804', '95841'],\n", " 354,\n", " Counter({str: 344, dict: 10}))" ] }, "execution_count": 882, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/address')\n", " if address and isinstance(address[0], dict):\n", " a = address[0].get('http://schema.org/PostalAddress/postalCode')\n", " if a:\n", " c.append(a[0])\n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### streetAddress" ] }, { "cell_type": "code", "execution_count": 884, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['21 Lassell Gardens',\n", " '-',\n", " 'East Aten Road 380',\n", " '11101 South Parker Rd',\n", " '古城町4丁目53'],\n", " 628,\n", " Counter({str: 628}))" ] }, "execution_count": 884, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('jobLocation', 'Place'):\n", " address = x.get('http://schema.org/address')\n", " if address and isinstance(address[0], dict):\n", " a = address[0].get('http://schema.org/streetAddress')\n", " if a:\n", " c.append(a[0])\n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 885, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['UNKNOWN',\n", " '8, Rue du Pont',\n", " 'Luxembourg',\n", " '-',\n", " 'Nr Mulki \\nBappanadu tempale'],\n", " 206,\n", " Counter({str: 205, dict: 1}))" ] }, "execution_count": 885, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('jobLocation', 'Place', False):\n", " address = x.get('http://schema.org/Place/address')\n", " if address and isinstance(address[0], dict):\n", " a = address[0].get('http://schema.org/PostalAddress/streetAddress')\n", " if a:\n", " c.append(a[0])\n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Base Salary" ] }, { "cell_type": "code", "execution_count": 887, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({'http://schema.org/MonetaryAmount': 847,\n", " 'Unknown Object': 5,\n", " str: 12}),\n", " Counter({'http://schema.org/MonetaryAmount': 320,\n", " str: 234,\n", " 'https://schema.org/MonetaryAmount': 34,\n", " 'http://schema.org/PriceSpecification': 4,\n", " 'https://schema.org/PriceSpecification': 1,\n", " 'http:/schema.orgMonetaryAmount': 4}))" ] }, "execution_count": 887, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'baseSalary')), Counter(extract_types(graphs, 'JobPosting/baseSalary'))" ] }, { "cell_type": "code", "execution_count": 893, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/MonetaryAmount': 847})" ] }, "execution_count": 893, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount'):\n", " dtype = x.get('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')\n", " if dtype:\n", " c.append(dtype[0])\n", "Counter(c)" ] }, { "cell_type": "code", "execution_count": 894, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/MonetaryAmount': 320})" ] }, "execution_count": 894, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount', False):\n", " dtype = x.get('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')\n", " if dtype:\n", " c.append(dtype[0])\n", "Counter(c)" ] }, { "cell_type": "code", "execution_count": 896, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/currency': 692,\n", " 'http://schema.org/value': 814,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 847,\n", " '_label': 847,\n", " 'http://schema.org/minValue': 28,\n", " 'http://schema.org/maxValue': 28,\n", " 'http://schema.org/unitText': 11,\n", " 'http://schema.org/validFrom': 1,\n", " 'http://schema.org/validThrough': 1,\n", " 'http://schema.org/name': 1,\n", " 'http://schema.org/description': 1})" ] }, "execution_count": 896, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount'):\n", " c += list(x)\n", "Counter(c)" ] }, { "cell_type": "code", "execution_count": 897, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/MonetaryAmount/value': 244,\n", " 'http://schema.org/MonetaryAmount/currency': 311,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 320,\n", " '_label': 320,\n", " 'http://schema.org/MonetaryAmount/maxValue': 49,\n", " 'http://schema.org/MonetaryAmount/minValue': 67,\n", " 'http://schema.org/MonetaryAmount/unitText': 9,\n", " 'http://schema.org/MonetaryAmount/baseSalary': 1})" ] }, "execution_count": 897, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount', False):\n", " c += list(x)\n", "Counter(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### currency" ] }, { "cell_type": "code", "execution_count": 900, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['GBP', 'USD', 'USD', 'USD', 'JPY'], 692, Counter({str: 692}))" ] }, "execution_count": 900, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount'):\n", " a = x.get('http://schema.org/currency')\n", " if a:\n", " c.append(a[0])\n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 901, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['USD', 'USD', 'USD', 'RUB', 'EUR'], 311, Counter({str: 311}))" ] }, "execution_count": 901, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount', False):\n", " a = x.get('http://schema.org/MonetaryAmount/currency')\n", " if a:\n", " c.append(a[0])\n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### value" ] }, { "cell_type": "code", "execution_count": 911, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['0.00',\n", " 'nach Vereinbarung',\n", " '12500-28500/-',\n", " '-',\n", " '',\n", " '25000',\n", " 'A convenir',\n", " 'Hourly',\n", " '25000',\n", " '$10,500'],\n", " 814,\n", " Counter({dict: 785, str: 28, bool: 1}))" ] }, "execution_count": 911, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount'):\n", " a = x.get('http://schema.org/value')\n", " if a:\n", " c.append(a[0])\n", "[_ for _ in c if type(_) == str][:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 916, "metadata": {}, "outputs": [], "source": [ "rdftype = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'" ] }, { "cell_type": "code", "execution_count": 918, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/QuantitativeValue': 780,\n", " 'http://schema.org/PropertyValue': 2,\n", " 'http://schema.org/MonetaryAmount': 1})" ] }, "execution_count": 918, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter([_[rdftype][0] for _ in c if isinstance(_, dict) and rdftype in _])" ] }, { "cell_type": "code", "execution_count": 919, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['\\n \\n \\n \\n 9\\n 10\\n \\n HOUR\\n \\n ',\n", " '\\n \\n \\n \\n 120000\\n \\n YEAR\\n \\n ',\n", " '\\n \\n \\n \\n 12.00\\n \\n HOUR\\n \\n ',\n", " '\\n \\n \\n \\n 150000.00\\n \\n YEAR\\n \\n ',\n", " '\\n \\n \\n \\n 47500\\n \\n YEAR\\n \\n ',\n", " 'As Per Rules',\n", " '\\n A convenir\\n Year\\n ',\n", " '\\n \\n \\n \\n 195000\\n \\n YEAR\\n \\n ',\n", " 'Null',\n", " '\\n \\n \\n \\n 140000\\n \\n YEAR\\n \\n '],\n", " 244,\n", " Counter({str: 109, dict: 135}))" ] }, "execution_count": 919, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount', False):\n", " a = x.get('http://schema.org/MonetaryAmount/value')\n", " if a:\n", " c.append(a[0])\n", "[_ for _ in c if type(_) == str][:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 920, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'https://schema.org/QuantitativeValue': 2,\n", " 'http://schema.org/QuantitativeValue': 133})" ] }, "execution_count": 920, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter([_[rdftype][0] for _ in c if isinstance(_, dict) and rdftype in _])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quantitative Value" ] }, { "cell_type": "code", "execution_count": 930, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/unitText': 643,\n", " 'http://schema.org/minValue': 307,\n", " 'http://schema.org/value': 532,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 780,\n", " 'http://schema.org/maxValue': 298,\n", " 'http://schema.org/Value': 3,\n", " 'http://schema.org/maxvalue': 2,\n", " 'http://schema.org/description': 1})" ] }, "execution_count": 930, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount'):\n", " a = x.get('http://schema.org/value')\n", " if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:\n", " c += a[0]\n", "Counter(c)" ] }, { "cell_type": "code", "execution_count": 931, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/QuantitativeValue/minValue': 115,\n", " 'http://schema.org/QuantitativeValue/unitText': 132,\n", " 'http://schema.org/QuantitativeValue/value': 43,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 133,\n", " 'http://schema.org/QuantitativeValue/maxValue': 81})" ] }, "execution_count": 931, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount', False):\n", " a = x.get('http://schema.org/MonetaryAmount/value')\n", " if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:\n", " c += a[0]\n", "Counter(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### unitText" ] }, { "cell_type": "code", "execution_count": 935, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['WEEK', 'HOUR', 'YEAR', 'p.a.', 'HOUR'], 643, Counter({str: 643}))" ] }, "execution_count": 935, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount'):\n", " a = x.get('http://schema.org/value')\n", " if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:\n", " v = a[0].get('http://schema.org/unitText')\n", " if v:\n", " c.append(v[0]) \n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 937, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('YEAR', 262),\n", " ('MONTH', 149),\n", " ('HOUR', 118),\n", " ('', 33),\n", " ('DAY', 21),\n", " ('ANNUM', 20),\n", " ('year', 7),\n", " ('WEEK', 5),\n", " ('Month', 3),\n", " ('-', 3)]" ] }, "execution_count": 937, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(Counter(c).items(), key=lambda x:x[1], reverse=True)[:10]" ] }, { "cell_type": "code", "execution_count": 939, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['Null', 'MONTH', 'MONTH', 'MONTH', 'MONTH'], 132, Counter({str: 132}))" ] }, "execution_count": 939, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount', False):\n", " a = x.get('http://schema.org/MonetaryAmount/value')\n", " if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:\n", " v = a[0].get('http://schema.org/QuantitativeValue/unitText')\n", " if v:\n", " c.append(v[0]) \n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### minValue" ] }, { "cell_type": "code", "execution_count": 940, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['400', '69.00', 0, '850', 0], 307, Counter({str: 111, int: 159, float: 37}))" ] }, "execution_count": 940, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount'):\n", " a = x.get('http://schema.org/value')\n", " if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:\n", " v = a[0].get('http://schema.org/minValue')\n", " if v:\n", " c.append(v[0]) \n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 941, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['0.0', '6000', '50000', '35000', '76000'], 115, Counter({str: 115}))" ] }, "execution_count": 941, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount', False):\n", " a = x.get('http://schema.org/MonetaryAmount/value')\n", " if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:\n", " v = a[0].get('http://schema.org/QuantitativeValue/minValue')\n", " if v:\n", " c.append(v[0]) \n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 945, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['550', '69.00', 0, '1000', 0], 298, Counter({str: 107, int: 154, float: 37}))" ] }, "execution_count": 945, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount'):\n", " a = x.get('http://schema.org/value')\n", " if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:\n", " v = a[0].get('http://schema.org/maxValue')\n", " if v:\n", " c.append(v[0]) \n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 946, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['0.0', '10000', '150000', '112000', '65000000'], 81, Counter({str: 81}))" ] }, "execution_count": 946, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount', False):\n", " a = x.get('http://schema.org/MonetaryAmount/value')\n", " if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:\n", " v = a[0].get('http://schema.org/QuantitativeValue/maxValue')\n", " if v:\n", " c.append(v[0]) \n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 947, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['400', 0, '', 0, '£30000.00 - £35000.00 per annum'],\n", " 532,\n", " Counter({str: 420, int: 71, float: 40, dict: 1}))" ] }, "execution_count": 947, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount'):\n", " a = x.get('http://schema.org/value')\n", " if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:\n", " v = a[0].get('http://schema.org/value')\n", " if v:\n", " c.append(v[0]) \n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 948, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['Null', '6000', '65000000', '80000', '27000'], 43, Counter({str: 43}))" ] }, "execution_count": 948, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount', False):\n", " a = x.get('http://schema.org/MonetaryAmount/value')\n", " if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:\n", " v = a[0].get('http://schema.org/QuantitativeValue/value')\n", " if v:\n", " c.append(v[0]) \n", "c[:5], len(c), Counter(map(type, c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Monetary Amount minvalue" ] }, { "cell_type": "code", "execution_count": 950, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([25000,\n", " 30000000,\n", " '1000',\n", " '40000',\n", " '20,000/-',\n", " '',\n", " 10000000,\n", " 10000000,\n", " '40000',\n", " '40000'],\n", " 28,\n", " Counter({int: 10, str: 17, float: 1}))" ] }, "execution_count": 950, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount'):\n", " a = x.get('http://schema.org/minValue')\n", " if a:\n", " c.append(a[0])\n", "c[:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 954, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['30000',\n", " '35000',\n", " '43000',\n", " '40000',\n", " '9.00',\n", " '41900',\n", " '9.94',\n", " '52000',\n", " '58000',\n", " '60000'],\n", " 67,\n", " Counter({str: 67}))" ] }, "execution_count": 954, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount', False):\n", " a = x.get('http://schema.org/MonetaryAmount/minValue')\n", " if a:\n", " c.append(a[0])\n", "c[:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 953, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([30000,\n", " 0,\n", " '21000',\n", " '50000',\n", " '42,000/-',\n", " '',\n", " 12000000,\n", " 12000000,\n", " '50000',\n", " '50000'],\n", " 28,\n", " Counter({int: 10, str: 17, float: 1}))" ] }, "execution_count": 953, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount'):\n", " a = x.get('http://schema.org/maxValue')\n", " if a:\n", " c.append(a[0])\n", "c[:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 955, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['40000',\n", " '60000',\n", " '78000',\n", " '9.00',\n", " '76000',\n", " '9.96',\n", " '52000',\n", " '75000',\n", " '15000',\n", " '13,500,000'],\n", " 49,\n", " Counter({str: 49}))" ] }, "execution_count": 955, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('baseSalary', 'MonetaryAmount', False):\n", " a = x.get('http://schema.org/MonetaryAmount/maxValue')\n", " if a:\n", " c.append(a[0])\n", "c[:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Date Posted" ] }, { "cell_type": "code", "execution_count": 610, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({'http://schema.org/Date': 1835, str: 2}),\n", " Counter({str: 1617, datetime.date: 206}))" ] }, "execution_count": 610, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'datePosted')), Counter(extract_types(graphs, 'JobPosting/datePosted'))" ] }, { "cell_type": "code", "execution_count": 672, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[rdflib.term.Literal('2019-08-01 17:48:55', datatype=rdflib.term.URIRef('http://schema.org/Date'))],\n", " [rdflib.term.Literal('2019-07-09', datatype=rdflib.term.URIRef('http://schema.org/Date'))],\n", " [rdflib.term.Literal('2014-12-13T00:43:45', datatype=rdflib.term.URIRef('http://schema.org/Date'))]]" ] }, "execution_count": 672, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(json_graphs, 'datePosted'))[:3]" ] }, { "cell_type": "code", "execution_count": 673, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['11/20/2019 08:55:24 AM'], ['2019-10-30'], ['\\nSeptember 24, 2017\\n']]" ] }, "execution_count": 673, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(graphs, 'JobPosting/datePosted'))[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hiring Organization" ] }, { "cell_type": "code", "execution_count": 612, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({'http://schema.org/Organization': 1731,\n", " str: 45,\n", " 'http://schema.org/EmploymentAgency': 2,\n", " 'URI': 18,\n", " 'Unknown Object': 16}),\n", " Counter({'http://schema.org/Organization': 923,\n", " 'URI': 172,\n", " str: 499,\n", " 'http:/schema.orgOrganization': 12,\n", " 'https://schema.org/Organization': 59,\n", " 'http://schema.org/LocalBusiness': 2,\n", " 'https:/schema.orgOrganization': 1,\n", " 'http://schema.org/Healthclub': 1,\n", " 'http://schema.org/EmploymentAgency': 1,\n", " 'http://schema.org/Corporation': 2}))" ] }, "execution_count": 612, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'hiringOrganization')), Counter(extract_types(graphs, 'JobPosting/hiringOrganization'))" ] }, { "cell_type": "code", "execution_count": 674, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Organization'],\n", " 'http://schema.org/name': ['Anixter International'],\n", " '_label': ['http://jobs.anixter.com/jobs/inventory-management/glenview-il-60026-/category-manager-prof-audio-visual-solutions/153414552962719?lang=en_us']}],\n", " [{'http://schema.org/logo': ['https://dgivdslhqe3qo.cloudfront.net/careers/photos/41241/thumb_photo_1504517641.png'],\n", " 'http://schema.org/name': ['Stage lopen bij Social Deal'],\n", " 'http://schema.org/sameAs': ['https://www.socialdeal.nl',\n", " 'https://twitter.com/SocialDeal_NL',\n", " 'https://www.instagram.com/social.deal/',\n", " 'https://www.facebook.com/SocialDealNL/?fref=ts',\n", " 'https://www.linkedin.com/company/social-deal?trk=nav_account_sub_nav_company_admin'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Organization'],\n", " '_label': ['http://stage.socialdeal.nl/o/stage-commerciele-economie-2']}],\n", " [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Organization'],\n", " 'http://schema.org/name': ['A-Z Poster Distribution'],\n", " 'http://schema.org/sameAs': ['http://www.poster-campaign.com/poster-distributors/'],\n", " '_label': ['http://www.poster-campaign.com/poster-distributors/']}]]" ] }, "execution_count": 674, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(json_graphs, 'hiringOrganization'))[:3]" ] }, { "cell_type": "code", "execution_count": 675, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[{'http://schema.org/Organization/name': ['Manpower S.r.l.'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Organization'],\n", " '_label': ['http://cambiolavoro.com/clav/bacheca.nsf/AnnunciDiLavoroNew/ADDETTO_ALLA_PIANIFICAZIONE_DELLA_PRODUZIONE_JUNIOR_REGIONE_EMILIA_ROMAGNA_REGGIO_EMILIA_2F4B6DB7F4B2420DC1258486004FDCCA?OpenDocument']}],\n", " [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Organization'],\n", " 'http://schema.org/Organization/name': ['FCS'],\n", " '_label': ['http://careers.cnsjobmarket.psychiatrist.com/jobs/psychiatric-nurse-practitioner-pahrump-nv-108424726-d']}],\n", " [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Organization'],\n", " 'http://schema.org/Organization/name': ['Zara'],\n", " '_label': ['http://emploi.lalibre.be/fr/emploi/37819/visual-merchandiser-zara-men-arnhem-fulltime']}]]" ] }, "execution_count": 675, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(graphs, 'JobPosting/hiringOrganization'))[:3]" ] }, { "cell_type": "code", "execution_count": 959, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1731,\n", " 'http://schema.org/name': 1724,\n", " '_label': 1731,\n", " 'http://schema.org/logo': 987,\n", " 'http://schema.org/sameAs': 1110,\n", " 'http://schema.org/url': 86,\n", " 'http://schema.org/department': 1,\n", " 'http://schema.org/address': 8,\n", " 'http://schema.org/email': 5,\n", " 'http://schema.org/employee': 1,\n", " 'http://schema.org/image': 21,\n", " 'http://schema.org/description': 15,\n", " 'http://schema.org/aggregateRating': 1,\n", " 'http://schema.org/telephone': 5,\n", " 'http://schema.org/contactPoint': 14,\n", " 'http://schema.org/legalName': 4,\n", " 'http://schema.org/knowsAbout': 1,\n", " 'http://schema.org/brand': 1,\n", " 'http://schema.org/location': 1})" ] }, "execution_count": 959, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('hiringOrganization', 'Organization'):\n", " c += x\n", "Counter(c)" ] }, { "cell_type": "code", "execution_count": 960, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/Organization/name': 898,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 923,\n", " '_label': 923,\n", " 'http://schema.org/Organization/sameAs': 169,\n", " 'http://schema.org/Organization/logo': 285,\n", " 'http://schema.org/Organization/url': 214,\n", " 'http://schema.org/Organization/employmentType': 22,\n", " 'http://schema.org/Organization/jobLocation': 22,\n", " 'http://schema.org/Organization/description': 31,\n", " 'http://schema.org/Organization/legalName': 34,\n", " 'http://schema.org/Organization/telephone': 24,\n", " 'http://schema.org/Organization/address': 29,\n", " 'http://schema.org/Organization/brand': 1,\n", " 'http://schema.org/Organization/department': 2,\n", " 'http://schema.org/Organization/sameAS': 1,\n", " 'http://schema.org/Organization/image': 6,\n", " 'http://schema.org/Organization/email': 9,\n", " 'http://schema.org/Organization/employee': 3,\n", " 'http://schema.org/Organization/faxNumber': 4,\n", " 'http://schema.org/Organization/title': 1,\n", " 'http://schema.org/Organization/aggregateRating': 1,\n", " 'http://schema.org/Organization/foundingDate': 2,\n", " 'http://schema.org/Organization/member': 2,\n", " 'http://schema.org/Organization/baseSalary': 1,\n", " 'http://schema.org/Organization/contactPoint': 3,\n", " 'http://schema.org/Organization/datePosted': 1,\n", " 'http://schema.org/Organization/location': 1,\n", " 'http://schema.org/Organization/legalname': 1,\n", " 'http://schema.org/Organization/occupationalCategory': 1})" ] }, "execution_count": 960, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('hiringOrganization', 'Organization', False):\n", " c += x\n", "Counter(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### name" ] }, { "cell_type": "code", "execution_count": 961, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['Anixter International',\n", " 'Stage lopen bij Social Deal',\n", " 'A-Z Poster Distribution',\n", " 'Division Industrielle',\n", " 'Imperial Valley College',\n", " 'FedEx',\n", " '辛麺屋 桝元',\n", " 'Africa Jobs | CA Global Headhunters',\n", " 'Bonnier News',\n", " 'Bold'],\n", " 1724,\n", " Counter({str: 1724}))" ] }, "execution_count": 961, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('hiringOrganization', 'Organization'):\n", " a = x.get('http://schema.org/name')\n", " if a:\n", " c.append(a[0])\n", "c[:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 962, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['Manpower S.r.l.',\n", " 'FCS',\n", " 'Zara',\n", " 'LGC Associates, LLC',\n", " 'Corporate & Technical Recruiters, Inc.',\n", " 'Integrated Talent Strategies',\n", " 'TempStar',\n", " 'http://vieclam.hufi.edu.vn/viec-lam-cong-ty-cong-ty-tnhh-sieu-nhat-thanh-e3909-vi',\n", " 'Vertrouwelijk',\n", " 'Lelie zorggroep\\xa0'],\n", " 898,\n", " Counter({str: 898}))" ] }, "execution_count": 962, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('hiringOrganization', 'Organization', False):\n", " a = x.get('http://schema.org/Organization/name')\n", " if a:\n", " c.append(a[0])\n", "c[:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### sameAs" ] }, { "cell_type": "code", "execution_count": 963, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['https://www.socialdeal.nl',\n", " 'http://www.poster-campaign.com/poster-distributors/',\n", " 'https://dev.prim-web.com/integration',\n", " 'https://www.imperial.edu',\n", " 'https://careers.fedex.com',\n", " 'https://caglobal.catsone.com/careers/35041-General/jobs/12462383-Afreximbank-Associate-Intra-African-Trade-Initiative-Junior-Professional-Programme-Cairo-Egypt?host=caglobal.catsone.com&portalID=37801',\n", " 'https://www.bonniernews.se/bonnier-news-tech/',\n", " 'https://www.linkedin.com/company/boldteam',\n", " 'https://jobs.marriott.com',\n", " 'https://employeebenefitsjobs.com/m/job.cgi?n=H151599'],\n", " 1110,\n", " Counter({str: 1110}))" ] }, "execution_count": 963, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('hiringOrganization', 'Organization'):\n", " a = x.get('http://schema.org/sameAs')\n", " if a:\n", " c.append(a[0])\n", "c[:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 964, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['http://www.lgcassociates.com',\n", " 'http://www.ctrecruiters.com',\n", " 'http://www.wehirepeople.com',\n", " 'http://www.tempstarstaffing.com',\n", " 'https://www.realstreet.com',\n", " 'https://www.geckohospitality.com',\n", " 'http://www.ktemedicaljobs.com',\n", " 'http://www.anodyne-services.com',\n", " 'http://www.reply.com/',\n", " 'https://wuzzuf.net/jobs/careers/Ain-Shams-University-Egypt-17109'],\n", " 169,\n", " Counter({str: 169}))" ] }, "execution_count": 964, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('hiringOrganization', 'Organization', False):\n", " a = x.get('http://schema.org/Organization/sameAs')\n", " if a:\n", " c.append(a[0])\n", "c[:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### logo" ] }, { "cell_type": "code", "execution_count": 987, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['https://dgivdslhqe3qo.cloudfront.net/careers/photos/41241/thumb_photo_1504517641.png',\n", " 'https://dev.prim-web.com/logo.png',\n", " 'https://academiccareers.com/files/pictures/Imperial_Valley_College.jpg',\n", " 'https://arbeit.nifty.com/img/renewal/gfj/arbeit_icon.png',\n", " 'https://media-eu.jobylon.com/CACHE/companies/company-logo/bonnier-news/bonniernews_logga.cf400009/50543f78631f256ad1ec83aa48286362.jpg',\n", " 'https://dgivdslhqe3qo.cloudfront.net/careers/photos/138241/thumb_photo_1571770896.png',\n", " 'https://assets.jibecdn.com/prod/marriott/0.0.102/assets/brands/gaylord_hotels.jpg',\n", " 'https://d3jh33bzyw1wep.cloudfront.net/s3/W1siZiIsIjIwMTgvMDMvMjYvMDkvMzEvMzYvNTUyL2hheXMgbmV3LmpwZyJdXQ',\n", " 'https://kaigoworker.jp/img/gfjimg_kaigo.png',\n", " 'https://s3.amazonaws.com/resumator/customer_20170727203532_LH43VKY3ZSIPLHSC/logos/20170816150621_Image-PNG-Transparent-Exact-Large.png'],\n", " 987,\n", " Counter({str: 969, dict: 18}))" ] }, "execution_count": 987, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('hiringOrganization', 'Organization'):\n", " a = x.get('http://schema.org/logo')\n", " if a:\n", " c.append(a[0])\n", "c[:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 988, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/imageObject': 1,\n", " 'http://schema.org/ImageObject': 16})" ] }, "execution_count": 988, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter([a[rdftype][0] for a in c if isinstance(a, dict) and rdftype in a])" ] }, { "cell_type": "code", "execution_count": 989, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 17,\n", " 'http://schema.org/url': 17,\n", " 'http://schema.org/name': 5,\n", " 'http://schema.org/height': 9,\n", " 'http://schema.org/width': 9,\n", " 'http://schema.org/alternateName': 1})" ] }, "execution_count": 989, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter([k for a in c if isinstance(a, dict) and rdftype in a for k in a])" ] }, { "cell_type": "code", "execution_count": 990, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/imageObject'],\n", " 'http://schema.org/url': ['https://www.hiq.se/globalassets/bilder/hiq_bg_bild_some.jpg']},\n", " {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/url': ['public://styles/logo/public/sub-organisations/L&CDUNDEE.png']},\n", " {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/url': ['https://teltonika-iot-group.com/img/teltonika-logo-blue.png']},\n", " {'http://schema.org/name': ['TRN Logo with Website'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/url': ['https://i1.wp.com/www.ohioworksnow.com/wp-content/uploads/company_logos/2019/10/TRN-Logo-with-Website-23.jpg?fit=1800%2C1043'],\n", " 'http://schema.org/height': [1043],\n", " 'http://schema.org/width': [1800]},\n", " {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/url': ['https://dbx9jsyriv02l.cloudfront.net/website/company-profile/3121/volkswagen_financial_services_vwfsuk_profile_200x200_1509098984.png'],\n", " 'http://schema.org/alternateName': ['company logo']},\n", " {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/url': ['https://teltonika-gps.com/img/teltonika-logo-blue.png']},\n", " {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/width': [600],\n", " 'http://schema.org/height': [60],\n", " 'http://schema.org/url': ['https://s3-eu-west-1.amazonaws.com/park-je/uploads/public/580/49a/c9b/58049ac9b38db372654069.png']},\n", " {'http://schema.org/width': [150],\n", " 'http://schema.org/height': [75],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/name': ['Company Logo for User #2 (jobsphnet)'],\n", " 'http://schema.org/url': ['https://pn9uz32ejav3o9drn23skfub-wpengine.netdna-ssl.com/wp-content/uploads/company_logos/2018/08/tpcunitednewlogosmall40-150x75-1_company_logo.png']},\n", " {'http://schema.org/height': [78],\n", " 'http://schema.org/url': ['https://technicaljobs.ie/wp-content/uploads/company_logos/2014/08/peglobal-logo-v21.png'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/width': [244],\n", " 'http://schema.org/name': ['peglobal-logo-v2']},\n", " {'http://schema.org/url': ['https://www.sdim.nl/wp-content/uploads/2019/08/logo-sdi.png'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject']}]" ] }, "execution_count": 990, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[a for a in c if isinstance(a, dict) and rdftype in a][:10]" ] }, { "cell_type": "code", "execution_count": 991, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['Null',\n", " 'http://cdn.haleymarketing.com/templates/61968/logos/ctrecruiters-socialmedia.png',\n", " 'Null',\n", " 'http://cdn.haleymarketing.com/templates/62095/logos/tempstarstaffing-hml.png',\n", " {'http://schema.org/ImageObject/contentUrl': ['https://bancadati.corrierelavoro.ch/custom_corrieredelticino/media/logo/logo_2545887.jpg'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject']},\n", " 'https://slb3.adicio.com/files/ys-c-02/2014-05/30/09/27/web_5388b18fb225388b18f3b3f7.jpg',\n", " 'https://cdn.nationalevacaturebank.nl/vacature/logo/8945397/152x54',\n", " 'https://slb3.adicio.com/files/ys-c-02/2019-03/19/08/31/5c910b4bb645.png',\n", " 'Null',\n", " 'https://slb4.adicio.com/files/ys-c-01/2019-06/25/12/47/5d127a5d93f7.png'],\n", " 285,\n", " Counter({str: 254, dict: 31}))" ] }, "execution_count": 991, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('hiringOrganization', 'Organization', False):\n", " a = x.get('http://schema.org/Organization/logo')\n", " if a:\n", " c.append(a[0])\n", "c[:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 992, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'http://schema.org/ImageObject/contentUrl': 30,\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 31,\n", " 'https://schema.org/ImageObject/url': 1,\n", " 'https://schema.org/ImageObject/height': 1,\n", " 'https://schema.org/ImageObject/width': 1})" ] }, "execution_count": 992, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter([k for a in c if isinstance(a, dict) and rdftype in a for k in a])" ] }, { "cell_type": "code", "execution_count": 993, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'http://schema.org/ImageObject/contentUrl': ['https://bancadati.corrierelavoro.ch/custom_corrieredelticino/media/logo/logo_2545887.jpg'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject']},\n", " {'http://schema.org/ImageObject/contentUrl': ['https://media.rabota.ru/processor/logo/small/2019/10/15/servis-zakaza-taksi-maksim3-e7a6a43b5602774de1f8a4384618689c.png'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject']},\n", " {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/ImageObject/contentUrl': ['https://media.rabota.ru/processor/logo/small/2010/04/08/silajjn.gif']},\n", " {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/ImageObject/contentUrl': ['https://www.robots.jobs/jobs/robotics-research-engineer-in-pittsburgh-allegheny-county-pennsylvania-us///www.robots.jobs/app/jobs/company/5db300cd521982480f81198b/logo?ts=1572012491']},\n", " {'http://schema.org/ImageObject/contentUrl': ['https://kirov.rabota.ru/vacancy/42847623//static/images/company-no-logo.svg'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject']},\n", " {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/ImageObject/contentUrl': ['https://yahroma.rabota.ru/vacancy/42901767//static/images/company-no-logo.svg']},\n", " {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/ImageObject/contentUrl': ['https://golitsyno.rabota.ru/vacancy/42532559//static/images/company-no-logo.svg']},\n", " {'http://schema.org/ImageObject/contentUrl': ['https://klimovsk.rabota.ru/vacancy/41865376//static/images/company-no-logo.svg'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject']},\n", " {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/ImageObject/contentUrl': ['https://careers.alispa.it/job/viewAd.php?job_id=11520&jobdescription=INFORMATICO_in-MASSA&language=it/Null']},\n", " {'http://schema.org/ImageObject/contentUrl': ['https://media.rabota.ru/processor/logo/small/2015/09/03/ooofiksprajjs.gif'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject']}]" ] }, "execution_count": 993, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[a for a in c if isinstance(a, dict) and rdftype in a][:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### url" ] }, { "cell_type": "code", "execution_count": 975, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['https://lavoro.informazione.it/offerte-di-lavoro-di-Iqm%20Selezione%20S.R.L.',\n", " 'https://venturefizz.com/jobs/boston/mid-market-sales-representative-boston-at-crimson-hexagon-boston-ma-0',\n", " 'https://www.adeccousa.com',\n", " 'https://jobs.merck.com/us/en/job/CLI008609/Senior-Clinical-Research-Associate-Oncology-San-Francisco',\n", " 'http://www.alibdaapalestine.com/',\n", " 'https://careers.oceaneering.com/global/en/job/15823/Designer',\n", " 'https://www.alphajump.de/unternehmen/ATLANTIC-Bonn',\n", " 'https://www.hiq.se/fi/',\n", " 'https://www.jobscout24.ch/de/job/charpentier-%C3%A8re/5117126/',\n", " 'https://job-like.com/company/375268/'],\n", " 86,\n", " Counter({str: 86}))" ] }, "execution_count": 975, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('hiringOrganization', 'Organization'):\n", " a = x.get('http://schema.org/url')\n", " if a:\n", " c.append(a[0])\n", "c[:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "code", "execution_count": 976, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['http://www.lgcassociates.com',\n", " 'http://www.ctrecruiters.com',\n", " 'http://www.wehirepeople.com',\n", " 'http://www.tempstarstaffing.com',\n", " 'https://bancadati.corrierelavoro.ch/job/viewAd.php?job_id=6225073&jobdescription=FINANCIAL%20SYSTEMS%20CONSULTANT%20(6%20months%20fixed-term%20Contract)_in-Lugano&language=de//employer/viewCompany.php?id=2545887&companyName=sidler-sa',\n", " 'https://diversity.careercast.com/jobs/network-build-provision-engineer-tysons-vienna-va-22180-115329630-d?contextType=browse//jobs/at-t-353757-cd',\n", " 'https://disability.careercast.com/jobs/system-business-analyst-migrations-6006846007152019-rotterdam-zuid-holland-3012-114980762-d//jobs/adp-1204821-cd',\n", " 'https://jobs.mashable.com/jobs/lead-cybersecurity-analyst-hunt-red-team-incident-response-platform-engineer-50640-riverwoods-il-60015-115001464-d//jobs/discover-1788278-cd',\n", " 'https://www.realstreet.com',\n", " 'https://medivacature.nl/vacatures/vakantiemedewerkers/raamwerk/showvac/272391//exit/www.hetraamwerk.nl'],\n", " 214,\n", " Counter({str: 214}))" ] }, "execution_count": 976, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = []\n", "for x in extract_subtype('hiringOrganization', 'Organization', False):\n", " a = x.get('http://schema.org/Organization/url')\n", " if a:\n", " c.append(a[0])\n", "c[:10], len(c), Counter(map(type, c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## validThrough" ] }, { "cell_type": "code", "execution_count": 613, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({'http://schema.org/DateTime': 939,\n", " 'http://schema.org/Date': 174,\n", " str: 2}),\n", " Counter({str: 544, datetime.date: 103}))" ] }, "execution_count": 613, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'validThrough')), Counter(extract_types(graphs, 'JobPosting/validThrough'))" ] }, { "cell_type": "code", "execution_count": 676, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[rdflib.term.Literal('2019-11-11', datatype=rdflib.term.URIRef('http://schema.org/DateTime'))],\n", " [rdflib.term.Literal('1970-01-01T00:00:00', datatype=rdflib.term.URIRef('http://schema.org/Date'))],\n", " [rdflib.term.Literal('2019-12-11', datatype=rdflib.term.URIRef('http://schema.org/DateTime'))]]" ] }, "execution_count": 676, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(json_graphs, 'validThrough'))[:3]" ] }, { "cell_type": "code", "execution_count": 678, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['2019-11-28'], ['2019-11-29'], ['2019-12-22']]" ] }, "execution_count": 678, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(graphs, 'JobPosting/validThrough'))[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### url" ] }, { "cell_type": "code", "execution_count": 614, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({'URI': 418, str: 1}),\n", " Counter({'URI': 325, str: 249, 'http://schema.org/URL': 1}))" ] }, "execution_count": 614, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'url')), Counter(extract_types(graphs, 'JobPosting/url'))" ] }, { "cell_type": "code", "execution_count": 679, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['https://academiccareers.com/job/4595/pt-faculty-pool-apprenticeship-electrical-iid/'],\n", " ['https://careers.fedex.com/office/jobs/26086-392004?lang=en-US'],\n", " ['https://arbeit.nifty.com/miyazaki/nobeoka-station/froma_Y002SEC1/']]" ] }, "execution_count": 679, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(json_graphs, 'url'))[:3]" ] }, { "cell_type": "code", "execution_count": 680, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['http://business.colbychamber.com/jobs/info/non-profit-and-social-services-abc-home-visitor-remote-location-170'],\n", " ['https://ad.searchwidget.nationalevacaturebank.nl/vacature/bladeren/Barneveld/Zinzia%20medisch%20verpleegkundige%20zorggroep/2//vacature/57f2a946-3b08-45be-8e48-d51e1c805d37/verpleegkundige'],\n", " ['https://buscadordetrabajo.cl/administrativo-contable//administracion-empresas/metropolitana/58207/alumno-practica-administrativo-contable']]" ] }, "execution_count": 680, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(graphs, 'JobPosting/url'))[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### industry" ] }, { "cell_type": "code", "execution_count": 615, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 722}), Counter({str: 580, 'URI': 7}))" ] }, "execution_count": 615, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'industry')), Counter(extract_types(graphs, 'JobPosting/industry'))" ] }, { "cell_type": "code", "execution_count": 616, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UNAVAILABLEEngineeringTechnologyEducationSalesInformation TechnologyGesundheitswesen/Medizin/SozialesEinzel- und GroßhandelHealthcareAccountingMarketingBanking & Financial ServicesSales & MarketingAccountancy & FinanceAccounting & FinanceHospitalityFinanceSoftware DevelopmentMaschinen-, Anlagen u. Fahrzeugbau
020.00000020.00000012.0000008.0000007.0000007.0000007.0000007.0000006.0000006.0000005.0000005.0000004.0000004.0000004.0000004.0000004.0000004.0000004.0000004.000000
pct0.0264550.0264550.0158730.0105820.0092590.0092590.0092590.0092590.0079370.0079370.0066140.0066140.0052910.0052910.0052910.0052910.0052910.0052910.0052910.005291
\n", "
" ], "text/plain": [ " UNAVAILABLE Engineering Technology Education Sales \\\n", "0 20.000000 20.000000 12.000000 8.000000 7.000000 7.000000 \n", "pct 0.026455 0.026455 0.015873 0.010582 0.009259 0.009259 \n", "\n", " Information Technology Gesundheitswesen/Medizin/Soziales \\\n", "0 7.000000 7.000000 \n", "pct 0.009259 0.009259 \n", "\n", " Einzel- und Großhandel Healthcare Accounting Marketing \\\n", "0 6.000000 6.000000 5.000000 5.000000 \n", "pct 0.007937 0.007937 0.006614 0.006614 \n", "\n", " Banking & Financial Services Sales & Marketing Accountancy & Finance \\\n", "0 4.000000 4.000000 4.000000 \n", "pct 0.005291 0.005291 0.005291 \n", "\n", " Accounting & Finance Hospitality Finance Software Development \\\n", "0 4.000000 4.000000 4.000000 4.000000 \n", "pct 0.005291 0.005291 0.005291 0.005291 \n", "\n", " Maschinen-, Anlagen u. Fahrzeugbau \n", "0 4.000000 \n", "pct 0.005291 " ] }, "execution_count": 616, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(industry for industries in extract_property(json_graphs, 'industry') for industry in industries).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(20).T" ] }, { "cell_type": "code", "execution_count": 617, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ПродажиИнформационные технологии, интернет, телекомНачало карьеры, студентыТранспорт, логистикаБухгалтерия, управленческий учет, финансы предприятияМедицина, фармацевтикаСтроительство, недвижимостьNullEngineeringПроизводствоНаука, образованиеConstructionАдминистративный персоналElectricalБезопасностьТуризм, гостиницы, рестораныSafetyМаркетинг, реклама, PRРабочий персоналManufacturing
041.00000020.0000016.00000014.00000012.00000011.00000010.0000010.000009.0000009.0000009.0000009.0000009.0000008.0000007.0000007.0000007.0000006.0000006.0000005.000000
pct0.0527670.025740.0205920.0180180.0154440.0141570.012870.012870.0115830.0115830.0115830.0115830.0115830.0102960.0090090.0090090.0090090.0077220.0077220.006435
\n", "
" ], "text/plain": [ " Продажи Информационные технологии, интернет, телеком \\\n", "0 41.000000 20.00000 \n", "pct 0.052767 0.02574 \n", "\n", " Начало карьеры, студенты Транспорт, логистика \\\n", "0 16.000000 14.000000 \n", "pct 0.020592 0.018018 \n", "\n", " Бухгалтерия, управленческий учет, финансы предприятия \\\n", "0 12.000000 \n", "pct 0.015444 \n", "\n", " Медицина, фармацевтика Строительство, недвижимость Null \\\n", "0 11.000000 10.00000 10.00000 \n", "pct 0.014157 0.01287 0.01287 \n", "\n", " Engineering Производство Наука, образование Construction \\\n", "0 9.000000 9.000000 9.000000 9.000000 \n", "pct 0.011583 0.011583 0.011583 0.011583 \n", "\n", " Административный персонал Electrical Безопасность \\\n", "0 9.000000 8.000000 7.000000 \n", "pct 0.011583 0.010296 0.009009 \n", "\n", " Туризм, гостиницы, рестораны Safety Маркетинг, реклама, PR \\\n", "0 7.000000 7.000000 6.000000 \n", "pct 0.009009 0.009009 0.007722 \n", "\n", " Рабочий персонал Manufacturing \n", "0 6.000000 5.000000 \n", "pct 0.007722 0.006435 " ] }, "execution_count": 617, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(industry for industries in extract_property(graphs, 'JobPosting/industry') for industry in industries).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(20).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### educationRequirements" ] }, { "cell_type": "code", "execution_count": 681, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 187, 'http://schema.org/EducationalOccupationalCredential': 2}),\n", " Counter({str: 190}))" ] }, "execution_count": 681, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'educationRequirements')), Counter(extract_types(graphs, 'JobPosting/educationRequirements'))" ] }, { "cell_type": "code", "execution_count": 682, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UNAVAILABLEAbgeschlossene Berufsausbildung / LehrabschlussNot SpecifiedAbschluss Hochschule / Berufsakademie / Duales StudiumMBOKhông yêu cầuSonstigesNot ApplicableMittlere ReifeBerufslehreTrung cấpAbitur<p style="text-align: justify;">Diplu00f4mu00e9(e) du2019un Bac ou du Bac+2, vous justifiez de plusieurs annu00e9es du2019expu00e9rience en secru00e9tariat ou sur un poste u00e9quivalent.<br>Les outils bureautiques nu2019ont pas de secret pour vous. Vous u00eates capable de tenir une conversation, ru00e9diger, lire et comprendre un document relatif u00e0 votre activitu00e9 en anglais.</p>HBODegreeNone学歴不問Abitur / FachabiturVmbo
021.00000016.00000013.0000007.0000004.0000004.0000004.0000004.0000004.0000004.0000004.0000002.0000002.0000002.0000002.0000002.0000002.0000002.0000002.0000002.000000
pct0.1111110.0846560.0687830.0370370.0211640.0211640.0211640.0211640.0211640.0211640.0211640.0105820.0105820.0105820.0105820.0105820.0105820.0105820.0105820.010582
\n", "
" ], "text/plain": [ " UNAVAILABLE Abgeschlossene Berufsausbildung / Lehrabschluss \\\n", "0 21.000000 16.000000 13.000000 \n", "pct 0.111111 0.084656 0.068783 \n", "\n", " Not Specified Abschluss Hochschule / Berufsakademie / Duales Studium \\\n", "0 7.000000 4.000000 \n", "pct 0.037037 0.021164 \n", "\n", " MBO Không yêu cầu Sonstiges Not Applicable Mittlere Reife \\\n", "0 4.000000 4.000000 4.000000 4.000000 4.000000 \n", "pct 0.021164 0.021164 0.021164 0.021164 0.021164 \n", "\n", " Berufslehre Trung cấp Abitur \\\n", "0 4.000000 2.000000 2.000000 \n", "pct 0.021164 0.010582 0.010582 \n", "\n", " <p style="text-align: justify;">Diplu00f4mu00e9(e) du2019un Bac ou du Bac+2, vous justifiez de plusieurs annu00e9es du2019expu00e9rience en secru00e9tariat ou sur un poste u00e9quivalent.<br>Les outils bureautiques nu2019ont pas de secret pour vous. Vous u00eates capable de tenir une conversation, ru00e9diger, lire et comprendre un document relatif u00e0 votre activitu00e9 en anglais.</p> \\\n", "0 2.000000 \n", "pct 0.010582 \n", "\n", " HBO Degree None 学歴不問 Abitur / Fachabitur Vmbo \n", "0 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 \n", "pct 0.010582 0.010582 0.010582 0.010582 0.010582 0.010582 " ] }, "execution_count": 682, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(json_graphs, 'educationRequirements') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(20).T" ] }, { "cell_type": "code", "execution_count": 683, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Null\\n любое\\nMBO\\n не имеет значенияHBOHigh School or EquivalentBachelor's Degreeне важно\\n среднее\\nСреднеепїЅпїЅ пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅНе имеет значения�� ����� ��������WODegreeНе важноBSOverigHigh School Diploma\\n высшее
021.00000020.0000008.0000007.0000006.0000006.0000005.0000005.0000004.0000003.0000003.0000003.0000003.0000003.0000002.0000002.0000002.0000002.0000002.0000002.000000
pct0.1065990.1015230.0406090.0355330.0304570.0304570.0253810.0253810.0203050.0152280.0152280.0152280.0152280.0152280.0101520.0101520.0101520.0101520.0101520.010152
\n", "
" ], "text/plain": [ " Null \\n любое\\n MBO \\\n", "0 21.000000 20.000000 8.000000 \n", "pct 0.106599 0.101523 0.040609 \n", "\n", " \\n не имеет значения HBO \\\n", "0 7.000000 6.000000 \n", "pct 0.035533 0.030457 \n", "\n", " High School or Equivalent Bachelor's Degree не важно \\\n", "0 6.000000 5.000000 5.000000 \n", "pct 0.030457 0.025381 0.025381 \n", "\n", " \\n среднее\\n Среднее \\\n", "0 4.000000 3.000000 \n", "pct 0.020305 0.015228 \n", "\n", " пїЅпїЅ пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅ Не имеет значения \\\n", "0 3.000000 3.000000 \n", "pct 0.015228 0.015228 \n", "\n", " �� ����� �������� WO Degree Не важно BS Overig \\\n", "0 3.000000 3.000000 2.000000 2.000000 2.000000 2.000000 \n", "pct 0.015228 0.015228 0.010152 0.010152 0.010152 0.010152 \n", "\n", " High School Diploma \\n высшее \n", "0 2.000000 2.000000 \n", "pct 0.010152 0.010152 " ] }, "execution_count": 683, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(graphs, 'JobPosting/educationRequirements') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(20).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### workHours" ] }, { "cell_type": "code", "execution_count": 627, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 145}), Counter({str: 284}))" ] }, "execution_count": 627, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'workHours')), Counter(extract_types(graphs, 'JobPosting/workHours'))" ] }, { "cell_type": "code", "execution_count": 686, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['differs from day to day'],\n", " ['UNAVAILABLE'],\n", " [''],\n", " ['11:00~24:00 週2日'],\n", " ['UNAVAILABLE'],\n", " ['10:00~19:00'],\n", " ['nach Vereinbarung'],\n", " ['32 hours per week'],\n", " ['a combinar'],\n", " ['A combinar.']]" ] }, "execution_count": 686, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(json_graphs, 'workHours'))[:10]" ] }, { "cell_type": "code", "execution_count": 687, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['16 - 24 uur'],\n", " ['32 - 40 uur'],\n", " ['40 uur'],\n", " ['40 hours per week'],\n", " ['\\n свободный график\\n '],\n", " ['\\n полный рабочий день\\n '],\n", " ['Arbeider '],\n", " ['Null'],\n", " ['полный рабочий день'],\n", " ['9:30 am - 6:30pm | Monday to Saturday']]" ] }, "execution_count": 687, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(graphs, 'JobPosting/workHours'))[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### experienceRequirements" ] }, { "cell_type": "code", "execution_count": 688, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 161, int: 3}), Counter({str: 252}))" ] }, "execution_count": 688, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'experienceRequirements')), Counter(extract_types(graphs, 'JobPosting/experienceRequirements'))" ] }, { "cell_type": "code", "execution_count": 691, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Mid LevelEntry LevelExperiencedKhông yêu cầuNot Applicable
012.00000011.0000009.0000008.0000004.0000003.000000
pct0.0677970.0621470.0508470.0451980.0225990.016949
\n", "
" ], "text/plain": [ " Mid Level Entry Level Experienced Không yêu cầu \\\n", "0 12.000000 11.000000 9.000000 8.000000 4.000000 \n", "pct 0.067797 0.062147 0.050847 0.045198 0.022599 \n", "\n", " Not Applicable \n", "0 3.000000 \n", "pct 0.016949 " ] }, "execution_count": 691, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(json_graphs, 'experienceRequirements') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(6).T" ] }, { "cell_type": "code", "execution_count": 692, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Null\\n не имеет значения\\n\\n от 1 года\\nне требуетсяот 1 годаот 3 лет
023.00000012.0000006.0000005.000004.0000003.000000
pct0.0891470.0465120.0232560.019380.0155040.011628
\n", "
" ], "text/plain": [ " Null \\n не имеет значения\\n \\n от 1 года\\n \\\n", "0 23.000000 12.000000 6.000000 \n", "pct 0.089147 0.046512 0.023256 \n", "\n", " не требуется от 1 года от 3 лет \n", "0 5.00000 4.000000 3.000000 \n", "pct 0.01938 0.015504 0.011628 " ] }, "execution_count": 692, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(graphs, 'JobPosting/experienceRequirements') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(6).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### occupationalCategory" ] }, { "cell_type": "code", "execution_count": 634, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 166}), Counter({str: 226, 'URI': 3}))" ] }, "execution_count": 634, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'occupationalCategory')), Counter(extract_types(graphs, 'JobPosting/occupationalCategory'))" ] }, { "cell_type": "code", "execution_count": 636, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Information TechnologyOtherTransportationEngineeringHospitalityCustomer ServiceEducationRetailAccountingGeneral LaborITSkilled LabourEntry LevelCorporateManagementFinanceAdmin-ClericalEvent PlanningRecreation
04.0000004.0000004.0000004.0000003.0000003.0000003.0000002.0000002.0000002.0000002.0000002.0000002.0000002.0000002.0000002.0000002.0000002.0000001.0000001.000000
pct0.0168780.0168780.0168780.0168780.0126580.0126580.0126580.0084390.0084390.0084390.0084390.0084390.0084390.0084390.0084390.0084390.0084390.0084390.0042190.004219
\n", "
" ], "text/plain": [ " Information Technology Other Transportation Engineering \\\n", "0 4.000000 4.000000 4.000000 4.000000 \n", "pct 0.016878 0.016878 0.016878 0.016878 \n", "\n", " Hospitality Customer Service Education Retail Accounting \\\n", "0 3.000000 3.000000 3.000000 2.000000 2.000000 2.000000 \n", "pct 0.012658 0.012658 0.012658 0.008439 0.008439 0.008439 \n", "\n", " General Labor IT Skilled Labour Entry Level Corporate \\\n", "0 2.000000 2.000000 2.000000 2.000000 2.000000 \n", "pct 0.008439 0.008439 0.008439 0.008439 0.008439 \n", "\n", " Management Finance Admin-Clerical Event Planning Recreation \n", "0 2.000000 2.000000 2.000000 1.000000 1.000000 \n", "pct 0.008439 0.008439 0.008439 0.004219 0.004219 " ] }, "execution_count": 636, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(json_graphs, 'occupationalCategory') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(20).T" ] }, { "cell_type": "code", "execution_count": 637, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NullHealthcareEngineeringAnalistaSales / Business DevelopmentAP MechanicEstagiário\\n\\t\\t Political or Public Affairs\\n \\tManagementRetail / Wholesale\\n Service Manager IT-Dienstleistungen ecommerce SaaS ITSM Design Manager Ausschreibung\\n \\n\\n\\n Commercie / Verkoop\\nLabor and HelpWeingarten / Au�enbetriebPharmaceuticals,Medical Sales RepresentativeDoktersassistent\\n\\n Educational\\nSecretary / Front OfficeExternal AccountancyEducation Instruction
025.0000003.000003.000003.000002.000002.000002.000002.000002.000002.000001.000001.000001.000001.000001.000001.000001.000001.000001.000001.00000
pct0.0902530.010830.010830.010830.007220.007220.007220.007220.007220.007220.003610.003610.003610.003610.003610.003610.003610.003610.003610.00361
\n", "
" ], "text/plain": [ " Null Healthcare Engineering Analista \\\n", "0 25.000000 3.00000 3.00000 3.00000 \n", "pct 0.090253 0.01083 0.01083 0.01083 \n", "\n", " Sales / Business Development AP Mechanic Estagiário \\\n", "0 2.00000 2.00000 2.00000 \n", "pct 0.00722 0.00722 0.00722 \n", "\n", " \\n\\t\\t Political or Public Affairs\\n \\t Management \\\n", "0 2.00000 2.00000 \n", "pct 0.00722 0.00722 \n", "\n", " Retail / Wholesale \\\n", "0 2.00000 \n", "pct 0.00722 \n", "\n", " \\n Service Manager IT-Dienstleistungen ecommerce SaaS ITSM Design Manager Ausschreibung\\n \\n \\\n", "0 1.00000 \n", "pct 0.00361 \n", "\n", " \\n\\n Commercie / Verkoop\\n \\\n", "0 1.00000 \n", "pct 0.00361 \n", "\n", " Labor and Help \\\n", "0 1.00000 \n", "pct 0.00361 \n", "\n", " Weingarten / Au�enbetrieb \\\n", "0 1.00000 \n", "pct 0.00361 \n", "\n", " Pharmaceuticals,Medical Sales Representative Doktersassistent \\\n", "0 1.00000 1.00000 \n", "pct 0.00361 0.00361 \n", "\n", " \\n\\n Educational\\n Secretary / Front Office External Accountancy \\\n", "0 1.00000 1.00000 1.00000 \n", "pct 0.00361 0.00361 0.00361 \n", "\n", " Education Instruction \n", "0 1.00000 \n", "pct 0.00361 " ] }, "execution_count": 637, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(graphs, 'JobPosting/occupationalCategory') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(20).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### qualifications" ] }, { "cell_type": "code", "execution_count": 640, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 132}), Counter({str: 172, 'URI': 1}))" ] }, "execution_count": 640, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'qualifications')), Counter(extract_types(graphs, 'JobPosting/qualifications'))" ] }, { "cell_type": "code", "execution_count": 643, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UNAVAILABLESie müssen Personaler eines Unternehmens seinAbility to work in a team environment with members of varying skill levels. Highly motivated. Learns quickly.
019.00000012.0000009.0000002.000000
pct0.1439390.0909090.0681820.015152
\n", "
" ], "text/plain": [ " UNAVAILABLE Sie müssen Personaler eines Unternehmens sein \\\n", "0 19.000000 12.000000 9.000000 \n", "pct 0.143939 0.090909 0.068182 \n", "\n", " Ability to work in a team environment with members of varying skill levels. Highly motivated. Learns quickly. \n", "0 2.000000 \n", "pct 0.015152 " ] }, "execution_count": 643, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(json_graphs, 'qualifications') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(4).T" ] }, { "cell_type": "code", "execution_count": 646, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NullSemi SeniorYou must hold a BS degree\\n Qualifications\\n Sigma Six\\n
023.0000003.0000002.0000001.00000
pct0.1064810.0138890.0092590.00463
\n", "
" ], "text/plain": [ " Null Semi Senior You must hold a BS degree \\\n", "0 23.000000 3.000000 2.000000 \n", "pct 0.106481 0.013889 0.009259 \n", "\n", " \\n Qualifications\\n Sigma Six\\n \n", "0 1.00000 \n", "pct 0.00463 " ] }, "execution_count": 646, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(graphs, 'JobPosting/qualifications') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(4).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### identifier" ] }, { "cell_type": "code", "execution_count": 650, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({'http://schema.org/PropertyValue': 676,\n", " str: 71,\n", " int: 9,\n", " 'Unknown Object': 8}),\n", " Counter({str: 47,\n", " 'http://schema.org/PropertyValue': 149,\n", " 'Unknown Object': 1}))" ] }, "execution_count": 650, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'identifier')), Counter(extract_types(graphs, 'JobPosting/identifier'))" ] }, { "cell_type": "code", "execution_count": 693, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[{'http://schema.org/value': ['inventory-management/glenview-il-60026-/category-manager-prof-audio-visual-solutions/153414552962719'],\n", " 'http://schema.org/name': ['Anixter International'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PropertyValue'],\n", " '_label': ['http://jobs.anixter.com/jobs/inventory-management/glenview-il-60026-/category-manager-prof-audio-visual-solutions/153414552962719?lang=en_us']}],\n", " [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PropertyValue'],\n", " 'http://schema.org/name': ['Stage lopen bij Social Deal'],\n", " 'http://schema.org/value': [331661],\n", " '_label': ['http://stage.socialdeal.nl/o/stage-commerciele-economie-2']}],\n", " [{'http://schema.org/value': ['1262'],\n", " 'http://schema.org/name': ['Division Industrielle'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PropertyValue'],\n", " '_label': ['https://dev.prim-web.com/jobs/view/montreal-machiniste-anglais-francais/xy6ml/po/kd1qx/fr']}]]" ] }, "execution_count": 693, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(json_graphs, 'identifier'))[:3]" ] }, { "cell_type": "code", "execution_count": 694, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['40165587'],\n", " ['39576074'],\n", " [{'http://schema.org/PropertyValue/name': ['Byrd'],\n", " 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PropertyValue'],\n", " 'http://schema.org/PropertyValue/value': ['3841'],\n", " '_label': ['https://www.jobfluent.com/jobs/senior-fullstack-developer-berlin-21de6d?result=14']}]]" ] }, "execution_count": 694, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(graphs, 'JobPosting/identifier'))[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### salaryCurrency" ] }, { "cell_type": "code", "execution_count": 651, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 277}), Counter({str: 123}))" ] }, "execution_count": 651, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'salaryCurrency')), Counter(extract_types(graphs, 'JobPosting/salaryCurrency'))" ] }, { "cell_type": "code", "execution_count": 654, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GBPUSDAUDEURJPYINRSGDTHBHKD
0117.00000034.00000033.00000024.00000012.00000011.0000006.0000004.000004.000004.00000
pct0.4223830.1227440.1191340.0866430.0433210.0397110.0216610.014440.014440.01444
\n", "
" ], "text/plain": [ " GBP USD AUD EUR € JPY \\\n", "0 117.000000 34.000000 33.000000 24.000000 12.000000 11.000000 \n", "pct 0.422383 0.122744 0.119134 0.086643 0.043321 0.039711 \n", "\n", " INR SGD THB HKD \n", "0 6.000000 4.00000 4.00000 4.00000 \n", "pct 0.021661 0.01444 0.01444 0.01444 " ] }, "execution_count": 654, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(json_graphs, 'salaryCurrency') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(10).T" ] }, { "cell_type": "code", "execution_count": 695, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CZKAUDGBPUSDRUBRURруб.EURNullUSD
022.00000017.00000016.00000013.00000012.0000008.0000004.0000003.0000003.0000002.000000
pct0.1774190.1370970.1290320.1048390.0967740.0645160.0322580.0241940.0241940.016129
\n", "
" ], "text/plain": [ " CZK AUD GBP USD RUB RUR \\\n", "0 22.000000 17.000000 16.000000 13.000000 12.000000 8.000000 \n", "pct 0.177419 0.137097 0.129032 0.104839 0.096774 0.064516 \n", "\n", " руб. EUR Null USD \n", "0 4.000000 3.000000 3.000000 2.000000 \n", "pct 0.032258 0.024194 0.024194 0.016129 " ] }, "execution_count": 695, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(graphs, 'JobPosting/salaryCurrency') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(10).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### employmentType" ] }, { "cell_type": "code", "execution_count": 698, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 1505}), Counter({str: 1085, 'URI': 2}))" ] }, "execution_count": 698, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'employmentType')), Counter(extract_types(graphs, 'JobPosting/employmentType'))" ] }, { "cell_type": "code", "execution_count": 696, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FULL_TIMEPermanentPART_TIMEOTHERCONTRACTORContractFull TimeTEMPORARYINTERN
0661.00000214.00000083.00000066.00000057.00000050.00000041.00000032.00000029.00000028.000000
pct0.419150.1357010.0526320.0418520.0361450.0317060.0259990.0202920.0183890.017755
\n", "
" ], "text/plain": [ " FULL_TIME Permanent PART_TIME OTHER CONTRACTOR Contract \\\n", "0 661.00000 214.000000 83.000000 66.000000 57.000000 50.000000 \n", "pct 0.41915 0.135701 0.052632 0.041852 0.036145 0.031706 \n", "\n", " Full Time TEMPORARY INTERN \n", "0 41.000000 32.000000 29.000000 28.000000 \n", "pct 0.025999 0.020292 0.018389 0.017755 " ] }, "execution_count": 696, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(json_graphs, 'employmentType') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(10).T" ] }, { "cell_type": "code", "execution_count": 697, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FULL_TIMEPaid WorkFull TimeFull-timeNullPermanentContractCDITemporaryVollzeit
0176.000000173.00000092.00000044.00000037.00000037.00000023.00000020.00000013.00000012.000000
pct0.1543860.1517540.0807020.0385960.0324560.0324560.0201750.0175440.0114040.010526
\n", "
" ], "text/plain": [ " FULL_TIME Paid Work Full Time Full-time Null Permanent \\\n", "0 176.000000 173.000000 92.000000 44.000000 37.000000 37.000000 \n", "pct 0.154386 0.151754 0.080702 0.038596 0.032456 0.032456 \n", "\n", " Contract CDI Temporary Vollzeit \n", "0 23.000000 20.000000 13.000000 12.000000 \n", "pct 0.020175 0.017544 0.011404 0.010526 " ] }, "execution_count": 697, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(graphs, 'JobPosting/employmentType') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(10).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When it's multiple it's normally a listing" ] }, { "cell_type": "code", "execution_count": 584, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['アルバイト', '正社員'],\n", " ['PART_TIME', 'FULL_TIME'],\n", " ['OTHER', 'FULL_TIME'],\n", " ['CONTRACTOR', 'FULL_TIME'],\n", " ['PART_TIME', 'INTERN', 'OTHER'],\n", " ['FULL_TIME', 'OTHER'],\n", " ['PART_TIME', 'FULL_TIME'],\n", " ['CONTRACTOR', 'FULL_TIME', 'TEMPORARY'],\n", " ['PART_TIME', 'PERMANENT'],\n", " ['PART_TIME', 'FULL_TIME'],\n", " ['CONTRACTOR', 'FULL_TIME'],\n", " ['PART_TIME', 'FULL_TIME'],\n", " ['CONTRACTOR', 'FULL_TIME'],\n", " ['TEMPORARY', 'FULL_TIME'],\n", " ['CONTRACTOR', 'FULL_TIME'],\n", " ['PART_TIME', 'INTERNSHIP'],\n", " ['PART_TIME', 'FULL_TIME'],\n", " ['CONTRACTOR', 'PER_DIEM', 'FULL_TIME', 'PART_TIME'],\n", " ['INTERN', 'FULL_TIME'],\n", " ['CONTRACTOR', 'PART_TIME', 'FULL_TIME', 'TEMPORARY']]" ] }, "execution_count": 584, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(x for x in extract_property(json_graphs, 'employmentType') if len(x) > 1)[:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### jobBenefits" ] }, { "cell_type": "code", "execution_count": 656, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 142}), Counter({str: 51}))" ] }, "execution_count": 656, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'jobBenefits')), Counter(extract_types(graphs, 'JobPosting/jobBenefits'))" ] }, { "cell_type": "code", "execution_count": 700, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['UNAVAILABLE'],\n", " ['待遇<br>◆車・バイク通勤OK\\u3000◆制服あり\\u3000◆昇給(規定有)\\u3000◆研修2~3ヶ月([P]900円、[A]大学生850円、高校生800円)'],\n", " ['VISION, SICK_DAYS, DOMESTIC_PARTNER, VACATION, DENTAL, LIFE_INSURANCE, PARENTAL_LEAVE, RETIREMENT_PLAN, MEDICAL'],\n", " [' < インセンティブ > \\n 業績連動賞与年3回(8月、12月、4月)\\n\\n < 諸手当 >\\n ・通勤交通費支給\\r\\n・自転車通勤補助金\\n\\n < 保険 >\\n社会保険制度あり\\n'],\n", " ['+bonus '],\n", " [''],\n", " ['Job Security, HRA, TA, DA'],\n", " ['Vale-transporte'],\n", " ['DWS Available'],\n", " ['Car or Car Allowance, Pension']]" ] }, "execution_count": 700, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(json_graphs, 'jobBenefits'))[:10]" ] }, { "cell_type": "code", "execution_count": 702, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['\\n Accent Jobs est parfaitement conscient que le marché du travail est constitué de différents groupes cibles chacun ayant ses propres souhaits et exigences.Nous gérons cette diversité en l?abordant à travers différents départements spécialisés.Ainsi nous pouvons aider chaque personne en connaissance de cause.Lors du processus de candidature nous jouons le rôle du coach pour vous apporter aide et conseil. Notre objectif? Vous aider à dénicher le job de vos rêves!\\n '],\n", " ['\\n All your information will be kept confidential according to EEO guidelines. '],\n", " ['Het startsalaris is €9,94 bruto per uur, exclusief vakantietoeslag en reiskostenvergoeding;Wil je graag veel werken, dat kan! Hier krijg je de mogelijkheid voor voorman of -vrouw of teamleider;Reiskostenvergoeding vanaf 10 km;Werken in een duurzaam bedrijf met de mooiste bloemen;Jij maakt deel uit van een gezellig en hardwerkend team.Kan jij niet wachten om aan de slag te gaan? Solliciteer dan vandaag nog! Wij nemen op werkdagen binnen 24 uur contact met je op om de sollicitatie met je te bespreken. Zijn we een match? Dan nodigen we je uit voor een gesprek op kantoor in Barendrecht. ']]" ] }, "execution_count": 702, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(graphs, 'JobPosting/jobBenefits'))[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Skills" ] }, { "cell_type": "code", "execution_count": 703, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 141}), Counter({str: 118, 'URI': 1}))" ] }, "execution_count": 703, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'skills')), Counter(extract_types(graphs, 'JobPosting/skills'))" ] }, { "cell_type": "code", "execution_count": 704, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['Must be reasonably fit and good at talking to people'],\n", " ['UNAVAILABLE'],\n", " ['Branch Coordinator'],\n", " ['UNAVAILABLE'],\n", " ['以下すべてのご経験をお持ちの方からのご応募をおまちしています!\\n・何らかのシステム開発経験\\u3000実務3年以上\\n・PHP 実務3年以上\\n'],\n", " [''],\n", " ['JavaScript, Apple iOS, Android'],\n", " ['Klantvriendelijk, Representatief, Leergierig'],\n", " ['Computer Literacy_old, Agreeableness, Information gathering & synthesis, English comprehension, Customer Service Situation Handling'],\n", " ['scala', 'akka', 'node.js', 'functional-programming', 'java']]" ] }, "execution_count": 704, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(json_graphs, 'skills'))[:10]" ] }, { "cell_type": "code", "execution_count": 705, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[' ASP.net, Crystal reports, mobile app, MsSql Server, mvc '],\n", " ['Null'],\n", " ['VUE.js, ReactJS, Python, English, APIs, AngularJS, Agile']]" ] }, "execution_count": 705, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(graphs, 'JobPosting/skills'))[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### image" ] }, { "cell_type": "code", "execution_count": 663, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({'URI': 59,\n", " str: 1,\n", " 'http://schema.org/ImageObject': 45,\n", " 'Unknown Object': 1}),\n", " Counter({'URI': 167,\n", " 'http://schema.org/ImageObject': 5,\n", " str: 37,\n", " 'https://schema.org/ImageObject': 1}))" ] }, "execution_count": 663, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'image')), Counter(extract_types(graphs, 'JobPosting/image'))" ] }, { "cell_type": "code", "execution_count": 708, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['https://arbeit.nifty.com/arbeit_images/froma/05457077.jpg'],\n", " ['https://s3-ap-northeast-1.amazonaws.com/paiza-webapp/job_offers/photo1s/000/007/660/medium/img_uniaim_01.jpg?1564365756'],\n", " ['https://cfs.pokepara.jp/Pokepara/Images/shopc/shop6922/photo/q_420_300_man_search.jpg']]" ] }, "execution_count": 708, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(json_graphs, 'image'))[:3]" ] }, { "cell_type": "code", "execution_count": 707, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['https://chambermaster.blob.core.windows.net/images/customers/3079/members/641/jobs/170/JOB_MAIN/LiveWell_Logo.jpg'],\n", " [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],\n", " 'http://schema.org/ImageObject/url': ['https://weinjobs.de/index.php?mod=details&id=2459/thumbnails/67057945.jpg'],\n", " 'http://schema.org/ImageObject/width': ['200'],\n", " 'http://schema.org/ImageObject/height': ['250'],\n", " '_label': ['https://weinjobs.de/index.php?mod=details&id=2459']}],\n", " ['Null']]" ] }, "execution_count": 707, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(graphs, 'JobPosting/image'))[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### jobLocationType" ] }, { "cell_type": "code", "execution_count": 709, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 52}), Counter({str: 9}))" ] }, "execution_count": 709, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'jobLocationType')), Counter(extract_types(graphs, 'JobPosting/jobLocationType'))" ] }, { "cell_type": "code", "execution_count": 712, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TELECOMMUTEam Arbeitsplatz (z.B. Büro)
048.0000003.0000001.000000
pct0.9230770.0576920.019231
\n", "
" ], "text/plain": [ " TELECOMMUTE am Arbeitsplatz (z.B. Büro)\n", "0 48.000000 3.000000 1.000000\n", "pct 0.923077 0.057692 0.019231" ] }, "execution_count": 712, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(json_graphs, 'jobLocationType') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(10).T" ] }, { "cell_type": "code", "execution_count": 713, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TELECOMMUTE
09.0
pct1.0
\n", "
" ], "text/plain": [ " TELECOMMUTE\n", "0 9.0\n", "pct 1.0" ] }, "execution_count": 713, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(x for xs in extract_property(graphs, 'JobPosting/jobLocationType') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(10).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### incentiveCompensation" ] }, { "cell_type": "code", "execution_count": 714, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Counter({str: 47}), Counter({str: 14}))" ] }, "execution_count": 714, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(extract_types(json_graphs, 'incentiveCompensation')), Counter(extract_types(graphs, 'JobPosting/incentiveCompensation'))" ] }, { "cell_type": "code", "execution_count": 715, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[\"Wat bieden wij jou: De opdrachtgever biedt jouw een uidagende en afwisselende functie binnen een oganisatie die continu in beweging is. Je werkt met jonge gemotiveerde collega's met korte lijnen en veel eigen verantwoordelijkheid, waar medewerkers worden gestimuleerd zichzelf te ontwikkelen. Voor deze functie zoeken wij een enthousiaste verkoper voor 32 uur op de afdeling witgoed/huishoudelijk.\"],\n", " [''],\n", " ['Provides Equity'],\n", " [''],\n", " [''],\n", " [''],\n", " [''],\n", " ['Up to £9.75 per hour'],\n", " ['Expenses Covered'],\n", " ['1時間\\u30002500円']]" ] }, "execution_count": 715, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(json_graphs, 'incentiveCompensation'))[:10]" ] }, { "cell_type": "code", "execution_count": 717, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['Unterkunft wird gestellt: Ja'],\n", " ['あり\\u3000前年度実績\\u3000年2回・計2.90月分'],\n", " ['Concentra is an Equal Opportunity Employer,\\xa0including disability/veterans'],\n", " ['$42,000 - $47,000 Base Salary (DOE) PLUS Bonus - None hourly'],\n", " ['\\n\\t\\t\\t\\t\\t\\t\\t\\t£24,000 plus location allowance where applicable\\t\\t\\t\\t\\t\\t\\t\\t'],\n", " ['\\nPartnership Opportunity:\\nUnknown\\n'],\n", " ['- Fulltime dienstverband;\\n- € 15,67 per uur (incl. reserveringen en o.b.v. ervaring);\\n- Goede bonusregeling (gemiddeld €1500 pm!);\\n- Doorgroeimogelijkheden;\\n- Borrels en teamuitjes.'],\n", " ['\\n Remuneration\\n Working for Optoma, you can expect a competitive salary with additional corporate benefits such as medical insurance, dental cover, pension and up to 27 days holiday per year - subject to service requirements.\\n\\n '],\n", " ['\\n -Оформление по ТК РФ.-График 5/2, с 08:00 до 17:00.-Предоставляется спецодежда, спецобувь и инструмент.-Для иногородних предоставляется общежитие.\\n '],\n", " ['Bonus, Uang Makan, Uang Bensin, THR']]" ] }, "execution_count": 717, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(extract_property(graphs, 'JobPosting/incentiveCompensation'))[:10]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }