
BTW I said you should do this first, I lied. The simplest assumption is that each line a file represents a group of tokens but you need to verify this assumption. By this I mean are you tokenising and grouping together all words on a line, in a sentence, all words in a paragraph or all words in a document. It’s important to know how you want to represent your text when it is dived into blocks. Tokenisation is also usually as simple as splitting the text on white-space. This is just a fancy way of saying split the data into individual words that can be processed separately. Typically the first thing to do is to tokenise the text.

You could use Markdown if your text is stored in Markdown. If your data is embedded in HTML, for example, you could look at using a package like BeautifulSoup to get access to the raw text before proceeding. In the following sections I’m assuming that you have plain text and your text is not embedded in HTML or Markdown or anything like that. For the more advanced concepts, consider their inclusion here as pointers for further personal research. Some techniques are simple, some more advanced. This guide is a very basic introduction to some of the approaches used in cleaning text data. But why do we need to clean text, can we not just eat it straight out of the tin? The answer is yes, if you want to, you can use the raw data exactly as you’ve received it, however, cleaning your data will increase the accuracy of your model. However, before you can use TF-IDF you need to clean up your text data. Suffice it to say that TF-IDF will assign a value to every word in every document you want to analyse and, the higher the TF-IDF value, the more important or predictive the word will typically be. The TF-IDF weight for a word i in document j is given as:Ī detailed background and explanation of TF-IDF, including some Python examples, is given here Analyzing Documents with TF-IDF. This higher score makes that word a good discriminator between documents. This means terms that only appear in a single document, or in a small percentage of the documents, will receive a higher score. The nature of the IDF value is such that terms which appear in a lot of documents will have a lower score or weight. Inverse Document Frequency (IDF) then shows the importance of a word within the entire collection of documents or corpus. The TF weighting of a word in a document shows its importance within that single document. This means that the more times a word appears in a document the larger its value for TF will get. Term Frequency (TF) is the number of times a word appears in a document. These two vectors and could now be be used as input into your data mining model.Ī more sophisticated way to analyse text is to use a measure called Term Frequency – Inverse Document Frequency (TF-IDF). These phrases can be broken down into the following vector representations with a simple measure of the count of the number of times each word appears in the document (phrase): Word “ The cat in the hat sat in the window“.A measure of the presence of known words.The model is only concerned with whether known words occur in the document, not where in the document. It is called a “ bag” of words, because any information about the order or structure of words in the document is discarded. A bag of words is a representation of text as a set of independent words with no relationship to each other.

When training a model or classifier to identify documents of different types a bag of words approach is a commonly used, but basic, method to help determine a document’s class. The first concept to be aware of is a Bag of Words. What, for example, if you wanted to identify a post on a social media site as cyber bullying. What do you do, however, if you want to mine text data to discover hidden insights or to predict the sentiment of the text. Machine Learning is super powerful if your data is numeric.
