Package `textcl`

TextCL

Introduction

The TextCL package aims to clean text data for later use in Natural Language Processing tasks. It can be used as an initial step in text analysis as well as in predictive, classification or text generation models.

The quality of the models strongly depends on the quality of the input data. Common problems in the data sets include:

If data are coming from a optical character recognition (OCR) platform, text in tables and columns is usually not processed correctly and will add noise to the models.
Some parts of large texts scopes may contain sentences from different languages rather than the target language of the model and have to be filtered out.
Real-world texts often have duplicated sentences due to the use of templates. In text generation tasks, this can cause model overfitting and duplications in generated texts or summaries.
Data sets may contain text that is different from the main topic, such as a weather forecast in an accounting report.

Features

The TextCL package allows the user to perform the following text pre-processing tasks:

Split texts into sentences.
Language filtering, for removing sentences from text not in the target language.
Perplexity filtering, for removing linguistically unconnected sentences, that can be produced by OCR modules. For example: Sustainability Report 2019 36 3%?!353? 1. 5В°C 1} 33%.
Duplicate sentences filtering using Jaccard similarity, for removing duplicate sentences from the text.
Unsupervised outlier detection for revealing texts that are outside of the main data set topic distribution. Four methods are included with package for this purpose:
TONMF: Block Coordinate Descent Framework (source article, matlab implementation)
RPCA: Robust Principal Component Analysis (source article, python implementation)
SVD: Singular Value Decomposition (based on the NumPy SVD implementation)

Documentation

Requirements

Python >= 3.6
pytorch_pretrained_bert >= 0.6.2
langdetect >= 1.0.8
numpy >= 1.16.5, < 1.20.0
pandas >= 1.0.3
lxml >= 4.6.2
protobuf >= 3.14.0
nltk >= 3.4.5

How to install

From PyPI

pip install textcl

From source/GitHub

pip install git+https://github.com/alinapetukhova/textcl.git#egg=textcl

License

MIT License

Developer's guide

Contributing to TextCL is easy. First, clone this repository and cd into the project's folder:

git clone https://github.com/alinapetukhova/textcl.git
cd textcl

Then create a virtual development environment to test and experiment with the package:

python3 -m venv env
source env/bin/activate
pip install -e .

The pytest, pytest-cov and pdoc3 packages are required for testing TextCL and generating its documentation:

pip install pytest pytest-cov pdoc3

Running the unit tests can be done with the following command from the project's root folder:

pytest

To check test coverage, execute the following command:

pytest --cov=textcl --cov-report=html

Project documentation can be generated with pdoc3. For example, running the following command in the project's root folder generates the HTML documentation and places it in the docs folder:

pdoc3 --html --output-dir docs textcl/

Expand source code

"""
.. include:: ../README.md
.. include:: ../doc/devguide.md
"""

from .preprocessing import split_into_sentences
from .preprocessing import language_filtering
from .preprocessing import jaccard_sim_filtering
from .preprocessing import perplexity_filtering
from .preprocessing import join_sentences_by_label
from .outliers_detection import outlier_detection
from .outliers_detection import tonmf
from .outliers_detection import rpca_implementation
from .outliers_detection import svd

Sub-modules

textcl.outliers_detection: This module contains functions for performing unsupervised outlier detection.
textcl.preprocessing: This module contains functions for general text preprocessing and filtering.