Natural language processing (NLP) is the field of computer sciences focused on retrieving information from textual input generated by human beings.
Create a term frequency matrix
The simplest approach to the problem (and the most commonly used so far) is to split sentences into tokens. Simplifying, words have abstract and subjective meanings to the people using and receiving them, tokens have an objective interpretation: an ordered sequence of characters (or bytes). Once sentences are split, the order of the token is disregarded. This approach to the problem in known as bag of words model.
A term frequency is a dictionary, in which to each token is assigned a weight. In the first example, we construct a term frequency matrix from a corpus corpus (a collection of documents) with the R package
require(tm) doc1 <- "drugs hospitals doctors" doc2 <- "smog pollution environment" doc3 <- "doctors hospitals healthcare" doc4 <- "pollution environment water" corpus <- c(doc1, doc2, doc3, doc4) tm_corpus <- Corpus(VectorSource(corpus))
In this example, we created a corpus of class
Corpus defined by the package
tm with two functions
VectorSource, which returns a
VectorSource object from a character vector. The object
tm_corpus is a list our documents with additional (and optional) metadata to describe each document.
str(tm_corpus) List of 4 $ 1:List of 2 ..$ content: chr "drugs hospitals doctors" ..$ meta :List of 7 .. ..$ author : chr(0) .. ..$ datetimestamp: POSIXlt[1:1], format: "2017-06-03 00:31:34" .. ..$ description : chr(0) .. ..$ heading : chr(0) .. ..$ id : chr "1" .. ..$ language : chr "en" .. ..$ origin : chr(0) .. ..- attr(*, "class")= chr "TextDocumentMeta" ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument" [truncated]
Once we have a
Corpus, we can proceed to preprocess the tokens contained in the
Corpus to improve the quality of the final output (the term frequency matrix). To do this we use the
tm_map, which similarly to the
apply family of functions, transform the documents in the corpus by applying a function to each document.
tm_corpus <- tm_map(tm_corpus, tolower) tm_corpus <- tm_map(tm_corpus, removeWords, stopwords("english")) tm_corpus <- tm_map(tm_corpus, removeNumbers) tm_corpus <- tm_map(tm_corpus, PlainTextDocument) tm_corpus <- tm_map(tm_corpus, stemDocument, language="english") tm_corpus <- tm_map(tm_corpus, stripWhitespace) tm_corpus <- tm_map(tm_corpus, PlainTextDocument)
Following these transformations, we finally create the term frequency matrix with
tdm <- TermDocumentMatrix(tm_corpus)
which gives a
<<TermDocumentMatrix (terms: 8, documents: 4)>> Non-/sparse entries: 12/20 Sparsity : 62% Maximal term length: 9 Weighting : term frequency (tf)
that we can view by transforming it to a matrix
as.matrix(tdm) Docs Terms character(0) character(0) character(0) character(0) doctor 1 0 1 0 drug 1 0 0 0 environ 0 1 0 1 healthcar 0 0 1 0 hospit 1 0 1 0 pollut 0 1 0 1 smog 0 1 0 0 water 0 0 0 1
Each row represents the frequency of each token - that as you noticed have been stemmed (e.g.
environ) - in each document (4 documents, 4 columns).
In the previous lines, we have weighted each pair token/document with the absolute frequency (i.e. the number of instances of the token that appear in the document).