Студенческий форум -> (2015) Ч.1. Работа с текстовыми данными в scikit-

Помощь

Поиск

Участники

Календарь

Новости

Учебные Материалы

ВАЛтест

Фотогалерея

Правила форума

Виртуальные тренажеры

Мемуары

Здравствуйте Гость ( Вход | Регистрация )

Выслать повторно письмо для активации

Студенческий форум -> Студенческие форумы НИЯУ МИФИ -> Авторские публикации разных лет -> Международный конгресс юмористов (в Питере) + Text mining

(2015) Ч.1. Работа с текстовыми данными в scikit-, learn (перевод документации) Data Mining

Подписка на тему | Сообщить другу | Версия для печати

VAL

Дата 11.03.2019 16:47

Offline

Мэтр, проФАН любви... proFAN of love

Профиль
Группа: Администраторы
Сообщений: 38059
Пользователь №: 1
Регистрация: 6.03.2004

(2015) Работа с текстовыми данными в scikit-learn (перевод документации) — часть 1
Источники:
- https://habr.com/en/post/264339/ - Ч.1.
- https://habr.com/en/post/266025/ - Ч.2.
- https://scikit-learn.org/0.15/tutorial/text..._text_data.html - оригинальный источник

QUOTE

Цель этой главы — это исследование некоторых из самых важных инструментов в scikit-learn на одной частной задаче: анализ коллекции текстовых документов (новостные статьи) на 20 различных тематик.
В этой главе мы рассмотрим как:

загрузить содержимое файла и категории
выделить вектора признаков, подходящих для машинного обучения
обучить одномерную модель выполнять категоризацию
использовать стратегию grid search, чтобы найти наилучшую конфигурацию для извлечения признаков и для классификатора

--------------------

www.valinfo.ru
Всегда... Always....
Quod licet jovi, non licet bovi!

VAL

Дата 11.03.2019 16:47

Offline

Мэтр, проФАН любви... proFAN of love

Профиль
Группа: Администраторы
Сообщений: 38059
Пользователь №: 1
Регистрация: 6.03.2004

QUOTE

Extracting features from text files

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.
Bags of words

The most intuitive way to do so is the bags of words representation:

assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
for each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary

The bags of words representation implies that n_features is the number of distinct words in the corpus: this number is typically larger that 100,000.

If n_samples == 10000, storing X as a numpy array of type float32 would require 10000 x 100000 x 4 bytes = 4GB in RAM which is barely manageable on today’s computers.

Fortunately, most values in X will be zeros since for a given document less than a couple thousands of distinct words will be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.

scipy.sparse matrices are data structures that do exactly this, and scikit-learn has built-in support for these structures.

--------------------

www.valinfo.ru
Всегда... Always....
Quod licet jovi, non licet bovi!

VAL	Дата 11.03.2019 16:48
Offline Мэтр, проФАН любви... proFAN of love Профиль Группа: Администраторы Сообщений: 38059 Пользователь №: 1 Регистрация: 6.03.2004	Working With Text Data Tutorial setup Loading the 20 newgroups dataset Extracting features from text files Bags of words Tokenizing text with scikit-learn From occurrences to frequencies Training a classifier Building a pipeline Evaluation of the performance on the test set Parameter tuning using grid search Exercises Exercise 1: Language identification Exercise 2: Sentiment Analysis on movie reviews Exercise 3: CLI text classification utility Where to from here «Working With Text Data The goal of this guide is to explore some of the main scikit-learn tools on a single practical task: analysing a collection of text documents (newsgroups posts) on twenty different topics. In this section we will see how to: load the file contents and the categories extract feature vectors suitable for machine learning -------------------- www.valinfo.ru Всегда... Always.... Quod licet jovi, non licet bovi!

VAL	Дата 9.11.2019 19:32
Offline Мэтр, проФАН любви... proFAN of love Профиль Группа: Администраторы Сообщений: 38059 Пользователь №: 1 Регистрация: 6.03.2004	:doh: -------------------- www.valinfo.ru Всегда... Always.... Quod licet jovi, non licet bovi!

1 Пользователей читают эту тему (1 Гостей и 0 Скрытых Пользователей)

0 Пользователей:

« Предыдущая тема | Международный конгресс юмористов (в Питере) + Text mining | Следующая тема »

Powered by Invision Power Board(U) v1.3 Final © 2003 IPS, Inc.
Установка, модификация и поддержка:
Barsum | 1px Design Group & Xac | OппаRU форум