{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"# **Stop Words**\n",
"**[불용어 언어목록 보는 파이썬 코드내용](https://stackoverflow.com/questions/48641112/why-the-num-of-languages-in-nltk-corpus-stop-words-is-different-depending-on-t)**\n",
"
\n",
"## **1 불용어 처리**\n",
"Stop Words"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 영문 내용을 소문자로 전처리\n",
"texts = 'I like such a Wonderful Snow Ice Cream'\n",
"texts = texts.lower()\n",
"texts"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# text를 token으로 변환\n",
"from nltk import word_tokenize\n",
"tokens = word_tokenize(texts)\n",
"tokens"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import nltk\n",
"# nltk.download('stopwords')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Stopwords 사용가능한 언어목록 \n",
"from nltk.corpus import stopwords\n",
"stopwords.ensure_loaded\n",
"stopwords.__dict__.get('_fileids')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from nltk.corpus import stopwords\n",
"stopwords.words(\"english\")[::18]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tokens = [word for word in tokens \n",
" if word not in stopwords.words(\"english\")]\n",
"print(tokens)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"## **2 한글의 불용어 처리**\n",
"인터넷에 공개되어 있는 불용어100 자료 (idf 값까지 txt에는 포함)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 한글 텍스트자료 불러오기\n",
"f = open('../data/stopword_kr.txt', 'r', encoding='utf-8')\n",
"s = f.read()\n",
"f.close()\n",
"\n",
"stop_words = [ txt.split('\\t')[:3] for txt in s.split('\\n') ]\n",
"stop_words[:10]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}