bdewilde.github.io - Intro to Natural Language Processing (2)









Search Preview

Intro to Natural Language Processing (2)

bdewilde.github.io
data scientist / physicist / filmmaker
.io > bdewilde.github.io

SEO audit: Content analysis

Language Error! No language localisation is found.
Title Intro to Natural Language Processing (2)
Text / HTML ratio 52 %
Frame Excellent! The website does not use iFrame solutions.
Flash Excellent! The website does not have any flash contents.
Keywords cloud = text words NLP word noun import NLTK POS sentence article ' it’s normalize pipeline post Python return def thing
Keywords consistency
Keyword Content Title Description Headings
= 28
text 27
words 17
NLP 13
word 12
noun 10
Headings
H1 H2 H3 H4 H5 H6
1 0 0 0 0 0
Images We found 0 images on this web page.

SEO Keywords (Single)

Keyword Occurrence Density
= 28 1.40 %
text 27 1.35 %
words 17 0.85 %
NLP 13 0.65 %
word 12 0.60 %
noun 10 0.50 %
import 10 0.50 %
NLTK 10 0.50 %
POS 7 0.35 %
sentence 7 0.35 %
article 6 0.30 %
' 6 0.30 %
it’s 6 0.30 %
normalize 6 0.30 %
pipeline 5 0.25 %
post 5 0.25 %
Python 5 0.25 %
return 5 0.25 %
def 5 0.25 %
thing 5 0.25 %

SEO Keywords (Two Word)

Keyword Occurrence Density
text = 6 0.30 %
in the 5 0.25 %
natural language 4 0.20 %
a terrible 4 0.20 %
of the 4 0.20 %
to waste 4 0.20 %
thing to 4 0.20 %
terrible thing 4 0.20 %
normalize == 4 0.20 %
which is 4 0.20 %
and if 4 0.20 %
on the 4 0.20 %
in a 4 0.20 %
my next 3 0.15 %
process of 3 0.15 %
the process 3 0.15 %
is the 3 0.15 %
for a 3 0.15 %
' ' 3 0.15 %
with NLTK 3 0.15 %

SEO Keywords (Three Word)

Keyword Occurrence Density Possible Spam
terrible thing to 4 0.20 % No
a terrible thing 4 0.20 % No
thing to waste 4 0.20 % No
' ' text 3 0.15 % No
the process of 3 0.15 % No
Natural Language Processing 3 0.15 % No
do the same 2 0.10 % No
wake up and 2 0.10 % No
up and get 2 0.10 % No
and get our 2 0.10 % No
get our act 2 0.10 % No
our act together 2 0.10 % No
act together as 2 0.10 % No
together as a 2 0.10 % No
as a country 2 0.10 % No
and if the 2 0.10 % No
going to regret 2 0.10 % No
all really going 2 0.10 % No
really going to 2 0.10 % No
and if we 2 0.10 % No

SEO Keywords (Four Word)

Keyword Occurrence Density Possible Spam
a terrible thing to 4 0.20 % No
terrible thing to waste 4 0.20 % No
our act together as 2 0.10 % No
to waste and as 2 0.10 % No
going to regret it 2 0.10 % No
about what a relative 2 0.10 % No
what a relative luxury 2 0.10 % No
tokenize each sentence into 2 0.10 % No
at the world today 2 0.10 % No
look at the world 2 0.10 % No
in the form of 2 0.10 % No
also a terrible thing 2 0.10 % No
thing to waste and 2 0.10 % No
all really going to 2 0.10 % No
is also a terrible 2 0.10 % No
the process of extracting 2 0.10 % No
is a terrible thing 2 0.10 % No
crisis is a terrible 2 0.10 % No
a crisis is a 2 0.10 % No
get our act together 2 0.10 % No

Internal links in - bdewilde.github.io

About Me
About Me
Archive
Archive
Intro to Automatic Keyphrase Extraction
Intro to Automatic Keyphrase Extraction
On Starting Over with Jekyll
On Starting Over with Jekyll
Friedman Corpus (3) — Occurrence and Dispersion
Friedman Corpus (3) — Occurrence and Dispersion
Background and Creation
Friedman Corpus (1) — Background and Creation
Data Quality and Corpus Stats
Friedman Corpus (2) — Data Quality and Corpus Stats
While I Was Away
While I Was Away
Intro to Natural Language Processing (2)
Intro to Natural Language Processing (2)
a brief, conceptual overview
Intro to Natural Language Processing (1)
A Data Science Education?
A Data Science Education?
Connecting to the Data Set
Connecting to the Data Set
Data, Data, Everywhere
Data, Data, Everywhere
← previous
Burton DeWilde

Bdewilde.github.io Spined HTML


Intro to Natural Language Processing (2) Burton DeWildeWell-nighMe Archive CV Intro to Natural Language Processing (2) 2013-04-16 information extraction natural language processing pos-tagging tokenization web scraping A couple months ago, I posted a brief, conceptual overview of Natural Language Processing (NLP) as unromantic to the worldwide task of information extraction (IE) —– that is, the process of extracting structured data from unstructured data, the majority of which is text. A significant component of my job at HI involves scraping text from websites, printing articles, social media, and other sources, then analyzing the quantity and expressly quality of the discussion as it relates to a mucosa and/or social issue. Although humans are inarguably largest than machines at understanding natural language, it’s impractical for humans to unriddle large numbers of documents for themes, trends, content, sentiment, etc., and to do so unceasingly throughout. This is where NLP comes in. In this post, I’ll requite practical details and example lawmaking for vital NLP tasks; in the next post, I’ll delve deeper into the standard tokenization-tagging-chunking pipeline; and in subsequent posts, I’ll move on to increasingly interesting NLP tasks, including keyterm/keyphrase extraction, topic modeling, document classification, sentiment analysis, and text generation. The first thing we need to get started is, of course, some sample text. Let’s use this recent op-ed in the New York Times by Thomas Friedman, which is well-nigh as tropical to lorem ipsum as natural language gets. Although copy-pasting the text works fine for a single article, it quickly becomes a hassle for multiple articles; instead, let’s do this programmatically and put our web scraping skillz to good use. A bare-bones Python script gets the job done: import bs4 import requests # GET html from NYT server, and parse it response = requests.get('http://www.nytimes.com/2013/04/07/opinion/sunday/friedman-weve-wasted-our-timeout.html') soup = bs4.BeautifulSoup(response.text) vendible = '' # select all tags containing vendible text, then pericope the text from each paragraphs = soup.find_all('p', itemprop='articleBody') for paragraph in paragraphs: vendible += paragraph.get_text() We have indeed retrieved the text of Friedman’s vapid commentary —– YES, it’s true — a slipperiness is a terrible thing to waste. But a “timeout” is moreover a terrible thing to waste, and as I squint at the world today I wonder if that’s exactly what we’ve just done. We’ve wasted a five-year timeout from geopolitics, and if we don’t wake up and get our act together as a country — and if the Chinese, Russians and Europeans don’t do the same — we’re all really going to regret it. Think well-nigh what a relative luxury we’ve enjoyed since theUnconfinedRecession hit in 2008… –— but it’s not yet fit for analysis. The first steps in any NLP wringer are text cleaning and normalization. Although the specific steps we should take to wipe and normalize our text depend on the wringer we midpoint to wield to it, a decent, general-purpose cleaning procedure removes any digits, non-ASCII characters, URLs, and HTML markup; standardizes white space and line breaks; and converts all text to lowercase. Like so: def clean_text(text): from nltk import clean_html import re # strip html markup with handy NLTK function text = clean_html(text) # remove digits with regular expression text = re.sub(r'\d', ' ', text) # remove any patterns matching standard url format url_pattern = r'((http|ftp|https):\/\/)?[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?' text = re.sub(url_pattern, ' ', text) # remove all non-ascii notation text = ''.join(character for weft in text if ord(character)<128) # standardize white space text = re.sub(r'\s+', ' ', text) # waif capitalization text = text.lower() return textWithoutpassing the vendible through clean_text, it comes out like this: yes, its true a slipperiness is a terrible thing to waste. but a timeout is moreover a terrible thing to waste, and as i squint at the world today i wonder if thats exactly what weve just done. weve wasted a five-year timeout from geopolitics, and if we dont wake up and get our act together as a country and if the chinese, russians and europeans dont do the same were all really going to regret it. think well-nigh what a relative luxury weve enjoyed since the unconfined recession hit in … It may squint worse to your eyes, but machines tend to perform largest without the irrelevant features. As an spare step on top of cleaning, normalization comes in two varieties: stemming and lemmatization. Stemming strips off word affixes, leaving just the root stem, while lemmatization replaces a word by its root word or lemma, as might be found in a dictionary. For example, the word “grieves” is stemmed into “grieve” but lemmatized into “grief.” The spanking-new NLTK Python library, with which I do much of my NLP work, provides an easy interface to multiple stemmers (Porter, Lancaster, Snowball) and a standard lemmatizer (WordNet, which is much increasingly than just a lemmatizer). Since normalization is unromantic word-by-word, it is inextricably linked with tokenization, the process of splitting text into pieces, i.e. sentences and words. For some analyses, tokenizing a document or a hodgepodge of documents (called a corpus) directly into words is fine; for others, it’s necessary to first tokenize a text into sentences, then tokenize each sentence into words, resulting in nested lists. Although this seems like a straightforward task — words are separated by spaces, duh! — one notable multiplicity arises from punctuation. Should “don’t know” be tokenized as [“don’t”, “know”], [“don”, “‘t”, “know”], or [“don”, “’”, “t”, “know”]? I don’t know. ;) It’s common, but not unchangingly applicable, to filter out high-frequency words with little lexical content like “the,” “it,” and “so,” tabbed stop words. Of course, there’s no universally-accepted list, so you have to use your own judgement! Lastly, it’s usually a good idea to put an upper unseat on the length of words you’ll keep. In English, stereotype word length is well-nigh five letters, and the longest word in Shakespeare’s works is 27 letters; errors in text sources or weird HTML cruft, however, can produce much longer villenage of letters. It’s a pretty unscratched bet to filter out words longer than 25 reports long. As you can see below, NLTK and Python make all of this relatively easy: def tokenize_and_normalize_doc(doc, filter_stopwords=True, normalize='lemma'): import nltk.corpus from nltk.stem import PorterStemmer, WordNetLemmatizer from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize from string import punctuation # use NLTK's default set of english stop words stops_list = nltk.corpus.stopwords.words('english') if normalize == 'lemma': # lemmatize with WordNet normalizer = WordNetLemmatizer() elif normalize == 'stem': # stem with Porter normalizer = PorterStemmer() # tokenize the document into sentences with NLTK default sents = sent_tokenize(doc) # tokenize each sentence into words with NLTK default tokenized_sents = [wordpunct_tokenize(sent) for sent in sents] # filter out "bad" words, normalize good ones normalized_sents = [] for tokenized_sent in tokenized_sents: good_words = [word for word in tokenized_sent # filter out too-long words if len(word) < 25 # filter out yellowish punctuation if word not in list(punctuation)] if filter_stopwords is True: good_words = [word for word in good_words # filter out stop words if word not in stops_list] if normalize == 'lemma': normalized_sents.append([normalizer.lemmatize(word) for word in good_words]) elif normalize == 'stem': normalized_sents.append([normalizer.stem(word) for word in good_words]) else: normalized_sents.append([word for word in good_words]) return normalized_sents Running our sample vendible through the grinder gives us this: [‘yes’, ‘true’, ‘crisis’, ‘terrible’, ‘thing’, ‘waste’], [‘timeout’, ‘also’, ‘terrible’, ‘thing’, ‘waste’, ‘look’, ‘world’, ‘today’, ‘wonder’, ‘thats’, ‘exactly’, ‘weve’, ‘done’], [‘weve’, ‘wasted’, ‘five-year’, ‘timeout’, ‘geopolitics’, ‘dont’, ‘wake’, ‘get’, ‘act’, ‘together’, ‘country’, ‘chinese’, ‘russian’, ‘european’, ‘dont’, ‘really’, ‘going’, ‘regret’], [‘think’, ‘relative’, ‘luxury’, ‘weve’, ‘enjoyed’, ‘since’, ‘great’, ‘recession’, ‘hit’], … Slowly but surely, Friedman’s insipid words are taking on a standardized, machine-friendly format. The next key step in a typical NLP pipeline is part-of-speech (POS) tagging: classifying words into their context-appropriate part-of-speech and labeling them as such. Again, this seems like something that ought to be straightforward (kids are taught how to do this at a fairly young age, right?), but in practice it’s not so simple. In general, the incredible uncertainty of natural language has a way of misunderstanding NLP algorithms —– and occasionally humans, too. For instance, think well-nigh all the ways “well” can be used in a sentence: noun, verb, adverb, adjective, and interjection (any others?). Plus, there’s no “official” POS tagset for English, although the conventional sets, e.g. Penn Treebank, have upwards of 50 unshared parts of speech. The simplest POS tagger out there assigns a default tag to each word; in English, singular nouns (“NN”) are probably your weightier bet, although you’ll only be right well-nigh 15% of the time! Other simple taggers determine POS from spelling: words ending in “-ment” tend to be nouns, “-ly” adverbs, “-ing” gerunds, and so on. Smarter taggers use the context of surrounding words to assign POS tags to each word. Basically, you summate the frequency that a tag has occurred in each context based on pre-tagged training data, then for a new word, assign the tag with the highest frequency for the given context. The models can get rather elaborate (more on this in my next post), but this is the gist. NLTK comes pre-loaded with a pretty decent POS tagger trained using a Maximum Entropy classifier on the Penn Treebank corpus (I think). See here: def pos_tag_sents(tokenized_sents): from nltk.tag import pos_tag tagged_sents = [pos_tag(sent) for sent in tokenized_sents] return tagged_sents Each tokenized word is now paired with its prescribed part of speech in the form of (word, tag) tuples: [[(‘yes’, ‘NNS’), (‘its’, ‘PRP$’), (‘true’, ‘JJ’), (‘a’, ‘DT’), (‘crisis’, ‘NN’), (‘is’, ‘VBZ’), (‘a’, ‘DT’), (‘terrible’, ‘JJ’), (‘thing’, ‘NN’), (‘to’, ‘TO’), (‘waste’, ‘VB’)], [(‘but’, ‘CC’), (‘a’, ‘DT’), (‘timeout’, ‘NN’), (‘is’, ‘VBZ’), (‘also’, ‘RB’), (‘a’, ‘DT’), (‘terrible’, ‘JJ’), (‘thing’, ‘NN’), (‘to’, ‘TO’), (‘waste’, ‘VB’), (‘and’, ‘CC’), (‘as’, ‘IN’), (‘i’, ‘PRP’), (‘look’, ‘VBP’), (‘at’, ‘IN’), (‘the’, ‘DT’), (‘world’, ‘NN’), (‘today’, ‘NN’), (‘i’, ‘PRP’), (‘wonder’, ‘VBP’), (‘if’, ‘IN’), (‘thats’, ‘NNS’), (‘exactly’, ‘RB’), (‘what’, ‘WP’), (‘weve’, ‘VBP’), (‘just’, ‘RB’), (‘done’, ‘VBN’)], … Great! The first word is incorrect: “yes” is not a plural noun (“NNS”). But without that, once you exclude weirdness welling from how I dealt with punctuation (by stripping it out, turning “it’s” into “its,” which was consequently tagged as a possessive pronoun), the tagger did pretty well. Note that I pulled when a bit from our previous text normalization by subtracting stop words when in and not lemmatizing: as I said, that’s not towardly for every task. One final, fundamental task in NLP is chunking: the process of extracting standalone phrases, or “chunks,” from a POS-tagged sentence without fully parsing the sentence (on a related note, chunking is moreover known as partial or shallow parsing). Chunking, for instance, can be used to identify the noun phrases present in a sentence, while full parsing could say which is the subject of the sentence and which the object. So why stop at chunking? Well, full parsing is computationally expensive and not very robust; in contrast, chunking is both fast and reliable, as well as sufficient for many practical uses in information extraction, relation recognition, and so on. A simple chunker can use patterns in part-of-speech tags to determine the types and extents of chunks. For example, a noun phrase (NP) in English often consists of a determiner, followed by an adjective, followed by a noun: the/DT fierce/JJ queen/NN. A increasingly thorough definition might include a possessive pronoun, any number of adjectives, and increasingly than one (singular/plural, proper) noun: his/PRP$ adorable/JJ fluffy/JJ kitties/NNS. I’ve implemented one such regular expression-based chunker in NLTK, which looks for noun, prepositional, and verb phrases, as well as full clauses: def chunk_tagged_sents(tagged_sents): from nltk.chunk import regexp # pinpoint a permafrost "grammar", i.e. chunking rules grammar = r""" NP: {<DT|PP\$>?<JJ>*<NN.*>+} # noun phrase PP: {<IN><NP>} # prepositional phrase VP: {<MD>?<VB.*><NP|PP>} # verb phrase CLAUSE: {<NP><VP>} # full clause """ chunker = regexp.RegexpParser(grammar, loop=2) chunked_sents = [chunker.parse(tagged_sent) for tagged_sent in tagged_sents] return chunked_sents def get_chunks(chunked_sents, chunk_type='NP'): all_chunks = [] # chunked sentences are in the form of nested trees for tree in chunked_sents: chunks = [] # iterate through subtrees / leaves to get individual chunks raw_chunks = [subtree.leaves() for subtree in tree.subtrees() if subtree.node == chunk_type] for raw_chunk in raw_chunks: permafrost = [] for word_tag in raw_chunk: # waif POS tags, alimony words chunk.append(word_tag[0]) chunks.append(' '.join(chunk)) all_chunks.append(chunks) return all_chunks I moreover included a function that iterates through the resulting parse trees and grabs only chunks of a unrepealable type, e.g. noun phrases. Here’s how Friedman fares: [[‘yes’, ‘a crisis’, ‘a terrible thing’], [‘a timeout’, ‘a terrible thing’, ‘the world today’, ‘thats’], [‘weve’, ‘a five-year timeout’, ‘geopolitics’, ‘act’, ‘a country’, ‘the chinese russians’, ‘europeans’], … Well, could be worse for a vital run-through! We’ve grabbed a handful of simple NPs, and since this is Thomas Friedman’s writing, I suppose that’s all one can reasonably hope for. (There’s probably a “garbage in, garbage out” joke to be made here.) You can see that removing punctuation has persisted in causing trouble —– “weve” is not a noun phrase —– which underscores how important text cleaning is and how decisions older in the pipeline stupefy results remoter along. In my next NLP post, I’ll discuss how to modernize this vital pipeline and thereby modernize subsequent, higher-level results. For increasingly information, trammels out Natural Language Processing with Python (free here), a unconfined introduction to NLP and NLTK. Another practical resource is streamhacker.com and the associated book, Python Text Processing with NLTK 2.0 Cookbook. If you want NLP without NLTK, Stanford’s CoreNLP software is a standalone Java implementation of the vital NLP pipeline that requires minimal lawmaking on the user’s part (note: I tried it and was not particularly impressed). Or you could just wait for my next post. :) ← previous ↑ next → Please enable JavaScript to view the comments powered by Disqus. comments powered by Disqus Burton DeWilde data scientist / physicist / filmmaker © 2014 Burton DeWilde. All rights reserved.