## Make a new `JournalCrawler` (soup)

<div class="frame" style="border: solid 1.0px #000000; padding: 0.5em 1em; margin: 2em 0;">
    <h3 style="color: #880000; text-decoration: underline">Where you need to update</h3>
    <ul>
        <li><code>gummy.utils.journal_utils.py</code></li>
        <li><code>gummy.journals.py</code></li>
        <li><code>tests.data.py</code></li>
        <li><a href="https://github.com/iwasakishuto/Translation-Gummy/wiki/Supported-journals">Wiki</a></li>
    </ul>
</div>

You can create a new `JournalCrawler` whose `crawl_type` is **"soup"**.

In [1]:
from gummy.utils import get_driver
from gummy.journals import *

[32m[success][0m local driver can be built.
[31m[failure][0m remote driver can't be built.
DRIVER_TYPE: [32mlocal[0m


In [2]:
class GoogleJournal(GummyAbstJournal):
    pass
self = GoogleJournal()

In [3]:
def get_soup_driver(url):
    with get_driver() as driver:
        soup = self.get_soup_source(url=url, driver=driver)
        cano_url = canonicalize(url=url, driver=driver)
    return soup, cano_url

In [4]:
def get_soup(url):
    cano_url = canonicalize(url=url, driver=None)
    soup = self.get_soup_source(url=url, driver=None)
    return soup, cano_url

In [5]:
url = input()

https://www.google.com/


## create `get_contents_soup`

### With Driver Ver.

In [6]:
soup, cano_url = get_soup_driver(url)
self._store_crawling_logs(cano_url=cano_url)
print(f"canonicalized URL: {toBLUE(cano_url)}")

Use [32mUselessGateWay[0m.[34m_pass2others[0m method.
Wait up to 3[s] for all page elements to load.
Scroll down to the bottom of the page.

Decompose unnecessary tags to make it easy to parse.
Decomposed [32m<i>[0m tag (0)
Decomposed [32m<link>[0m tag (1)
Decomposed [32m<meta>[0m tag (4)
Decomposed [32m<noscript>[0m tag (0)
Decomposed [32m<script>[0m tag (13)
Decomposed [32m<style>[0m tag (24)
Decomposed [32m<sup>[0m tag (0)
Decomposed [32m<None>[0m tag (0)
canonicalized URL: [34mhttps://www.google.com/[0m


#### `get_title_from_soup`

In [7]:
title = find_target_text(soup=soup, name="div", attrs={"id": "SIvCob"}, strip=True, default=self.default_title)
print(f"title: {toGREEN(title)}")

title: [32mGoogle 検索は次の言語でもご利用いただけます: English[0m


#### `get_sections_from_soup`

In [8]:
sections = soup.find_all(name="center")
print(f"num sections: {toBLUE(len(sections))}")

num sections: [34m3[0m


#### `get_head_from_section`

In [9]:
def get_head_from_section(section):
    head = section.find(name="input")
    return head

self.get_head_from_section = get_head_from_section

In [10]:
contens = self.get_contents_from_soup_sections(sections)


Show contents of the paper.
[1/3] 
[2/3] 
[3/3] 


### No Driver Ver.

In [11]:
soup, cano_url = get_soup(url)
self._store_crawling_logs(cano_url=cano_url)
print(f"canonicalized URL: {toBLUE(cano_url)}")

Get HTML content from [34mhttps://www.google.com/[0m

Decompose unnecessary tags to make it easy to parse.
Decomposed [32m<i>[0m tag (0)
Decomposed [32m<link>[0m tag (0)
Decomposed [32m<meta>[0m tag (4)
Decomposed [32m<noscript>[0m tag (0)
Decomposed [32m<script>[0m tag (6)
Decomposed [32m<style>[0m tag (2)
Decomposed [32m<sup>[0m tag (0)
Decomposed [32m<None>[0m tag (0)
canonicalized URL: [34mhttps://www.google.com/[0m


#### `get_title_from_soup`

In [12]:
title = find_target_text(soup=soup, name="div", attrs={"id": "SIvCob"}, strip=True, default=self.default_title)
print(f"title: {toGREEN(title)}")

title: [32m2020-09-30@17.42.18[0m


#### `get_sections_from_soup`

In [13]:
sections = soup.find_all(name="center")
print(f"num sections: {toBLUE(len(sections))}")

num sections: [34m1[0m


#### `get_head_from_section`

In [14]:
def get_head_from_section(section):
    head = section.find(name="input")
    return head

self.get_head_from_section = get_head_from_section

In [15]:
contens = self.get_contents_from_soup_sections(sections)


Show contents of the paper.
[1/1] 


***

## Confirmation

<font color="red"><b>NOTE:</b></font> You also have to modify these variables:

- [`gummy.journals.TranslationGummyJournalCrawlers`](https://github.com/iwasakishuto/Translation-Gummy/blob/master/gummy/journals.py)
- [`gummy.utils.journal_utils.DOMAIN2JOURNAL`](https://github.com/iwasakishuto/Translation-Gummy/blob/master/gummy/utils/journal_utils.py)

In [16]:
from gummy import TranslationGummy

In [17]:
# model = TranslationGummy()
# model.toPDF(url=url)

If successful, edit here too:

- [Wiki: Supported journals](https://github.com/iwasakishuto/Translation-Gummy/wiki/Supported-journals)
- [tests.data](https://github.com/iwasakishuto/Translation-Gummy/blob/master/tests/data.py)