bdewilde.github.io - a brief, conceptual overview
Intro to Natural Language Processing (1)

Search Preview

Intro to Natural Language Processing (1)

bdewilde.github.io
data scientist / physicist / filmmaker
.io > bdewilde.github.io

SEO audit: Content analysis

Language

Error! No language localisation is found.

Title

Intro to Natural Language Processing (1)

Text / HTML ratio

69 %

Frame

Excellent! The website does not use iFrame solutions.

Flash

Excellent! The website does not have any flash contents.

Keywords cloud

words list information NLP text natural sentences language entities step Stan data speech sentence word case Lee writer impact process

Keywords consistency

Keyword	Content	Title	Description	Headings
words	11
list	8
information	6
NLP	6
text	6
natural	5

Headings

H1	H2	H3	H4	H5	H6
1	0	0	0	0	0

Images

We found 0 images on this web page.

SEO Keywords (Single)

Keyword	Occurrence	Density
words	11	0.55 %
list	8	0.40 %
information	6	0.30 %
NLP	6	0.30 %
text	6	0.30 %
natural	5	0.25 %
sentences	5	0.25 %
language	5	0.25 %
entities	4	0.20 %
step	4	0.20 %
Stan	4	0.20 %
data	4	0.20 %
speech	4	0.20 %
sentence	4	0.20 %
word	4	0.20 %
case	3	0.15 %
Lee	3	0.15 %
writer	3	0.15 %
impact	3	0.15 %
process	3	0.15 %

SEO Keywords (Two Word)

Keyword	Occurrence	Density
list of	7	0.35 %
in the	6	0.30 %
of the	6	0.30 %
want to	5	0.25 %
natural language	5	0.25 %
is a	5	0.25 %
a list	5	0.25 %
this step	4	0.20 %
of speech	3	0.15 %
nested list	3	0.15 %
Lee and	3	0.15 %
NLP is	3	0.15 %
of sentences	3	0.15 %
Stan Lee	3	0.15 %
writer Stan	3	0.15 %
by writer	3	0.15 %
of natural	3	0.15 %
in a	3	0.15 %
to identify	3	0.15 %
a nested	3	0.15 %

SEO Keywords (Three Word)

Keyword	Occurrence	Density	Possible Spam
a list of	5	0.25 %	No
of natural language	3	0.15 %	No
a nested list	3	0.15 %	No
Stan Lee and	3	0.15 %	No
by writer Stan	3	0.15 %	No
list of sentences	3	0.15 %	No
is a list	2	0.10 %	No
of its constituent	2	0.10 %	No
available for free	2	0.10 %	No
we want to	2	0.10 %	No
the case of	2	0.10 %	No
the raw text	2	0.10 %	No
this step outputs	2	0.10 %	No
nested list of	2	0.10 %	No
in which each	2	0.10 %	No
to identify the	2	0.10 %	No
and artist Jack	2	0.10 %	No
Lee and artist	2	0.10 %	No
writer Stan Lee	2	0.10 %	No
step outputs a	2	0.10 %	No

SEO Keywords (Four Word)

Keyword	Occurrence	Density	Possible Spam
list of its constituent	2	0.10 %	No
writer Stan Lee and	2	0.10 %	No
this step outputs a	2	0.10 %	No
is a list of	2	0.10 %	No
a nested list of	2	0.10 %	No
nested list of sentences	2	0.10 %	No
a list of its	2	0.10 %	No
“created by writer Stan	2	0.10 %	No
by writer Stan Lee	2	0.10 %	No
Stan Lee and artist	2	0.10 %	No
Lee and artist Jack	2	0.10 %	No
In the case of	2	0.10 %	No
in which each word	1	0.05 %	No
list in which each	1	0.05 %	No
which each word is	1	0.05 %	No
each word is stored	1	0.05 %	No
word is stored as	1	0.05 %	No
nested list in which	1	0.05 %	No
is stored as a	1	0.05 %	No
a nested list in	1	0.05 %	No

Internal links in - bdewilde.github.io

About Me
About Me

Archive
Archive

Intro to Automatic Keyphrase Extraction
Intro to Automatic Keyphrase Extraction

On Starting Over with Jekyll
On Starting Over with Jekyll

Friedman Corpus (3) — Occurrence and Dispersion
Friedman Corpus (3) — Occurrence and Dispersion

Background and Creation
Friedman Corpus (1) — Background and Creation

Data Quality and Corpus Stats
Friedman Corpus (2) — Data Quality and Corpus Stats

While I Was Away
While I Was Away

Intro to Natural Language Processing (2)
Intro to Natural Language Processing (2)

a brief, conceptual overview
Intro to Natural Language Processing (1)

A Data Science Education?
A Data Science Education?

Connecting to the Data Set
Connecting to the Data Set

Data, Data, Everywhere
Data, Data, Everywhere

← previous
Burton DeWilde

Bdewilde.github.io Spined HTML

Intro to Natural Language Processing (1) Burton DeWilde About Me Archive CV Intro to Natural Language Processing (1) 2012-12-16 Harmony Institute information extraction natural language processing zombies First, the big news: I got a job! I’m now a data scientist at a non-profit organization here in Manhattan tabbed Harmony Institute, where we study the science of influence through entertainment. Basically, simple metrics like box office sales and television viewers don’t ratherish quantify a mucosa or show’s social impact; we use theory-driven methodology to increasingly fully assess this impact in individuals and wideness networks. In the specimen of, say, a social justice documentary that was made specifically to have such an impact, we are worldly-wise to quantify the film’s level of success.Increasinglyon this over time, I’m sure. Second, increasingly big news: “Decay,” the feature-length, physics-themed zombie movie I co-produced, filmed, and edited while moreover earning my physics PhD at CERN has been released online! You can watch or download it for free! We’ve had over 200,000 views and downloads in the first week alone, and a pearly value of international press. I’ve been tracking our “buzz” for the past couple months and will do some fancy analytics without we’re off the post-release peak, so stay tuned. And trammels out the movie, people seem to have really enjoyed it. :) Now, when to data science! I wanted to properly introduce Natural Language Processing (NLP for short) without transiently mentioning it in my previous post on web scraping. As you probably know from personal experience, much of the information misogynist online comes in the form of “natural language” like English or Spanish (as opposed to “structured language” like Python or mathematics). Broadly speaking, NLP is computer manipulation of natural language: from word counts to AutoCorrect, machine translation to sentiment analysis, part-of-speech tagging to speech recognition. NLP is a huge and increasingly vital field, permitting for increasingly intuitive human-computer interaction and, of particular importance to data scientists, increasingly constructive extraction of structured information from unstructured text. Here I’ll focus on that last bit, the very useful task of information extraction. Essentially, we want to identify information expressed in a natural language document and convert it into a structured, machine-friendly representation for remoter analysis. In practice, however, it’s much easier to focus on asking specific questions, i.e. looking for specific “entity relations” in the text. Let’s say we want to know who created the X-men comics given the first paragraph of their Wikipedia page: The X-Men are a superhero team in the Marvel Comics Universe. They were created by writer Stan Lee and versifier Jack Kirby, and first appeared in The X-Men #1 (September 1963). The vital concept of the X-Men is that under a deject of increasing anti-mutant sentiment, Professor Xavier created a oasis at his Westchester mansion to train young mutants to use their powers for the goody of humanity, and to prove mutants can be heroes. Xavier recruited Cyclops, Iceman, Angel, Beast, and Marvel Girl, calling them “X-Men” considering they possess special powers due to their possession of the “X-gene,” a gene which normal humans lack and which gives mutants their abilities. A person can read this and readily wordplay the question — Stan Lee and Jack Kirby –— but the complexity of natural language makes it difficult for a machine to do the same. It helps to split the task into an ordered pipeline of sub-tasks, starting from the raw text of a document and ending with a list of relations: Sentence Segmentation: Before manipulating text at the level of individual words, it is often necessary to split or “segment” the text into sentences. This isn’t trivial, since periods are used in acronyms (U.S.A., Mr.) as well sentence endings –— sometimes simultaneously —– and other sentence-ending punctuation (?!) may be used in different, non-standard ways. (Note that this is for English; other languages do things differently!) Using the raw text as input, this step outputs a list of its plug-in sentences, e.g. [‘The X-men are a superhero team in the Marvel Comics Universe.’, ‘They were created by writer Stan Lee and versifier Jack Kirby, and first appeared in The X-Men #1 (September 1963).’, …] Word Tokenization: Before trying to understand the meanings of words, we first have to identify the words themselves. In English, words are often delimited by white space, though that fails in the worldwide specimen of contractions (“it’s” = “it” and “is”; “won’t” = “will” and “not”). In other languages such as Chinese or Thai, where words are not delimited, word tokenization is much harder. In this step, our input is a list of sentences and our output is a nested list of sentences in which each sentence is represented by a list of its plug-in words, e.g. [[‘The’, ‘X-men’, ‘are’, ‘a’, ‘superhero’, ‘team’, ‘in’, ‘the’, ‘Marvel’, ‘Comics’, ‘Universe,’ ‘.’], …] Part-of-speech Tagging: The process of classifying words by their parts-of-speech and labeling them therefrom is known as part-of-speech tagging, or POST. This is flipside necessary precursor to understanding the relationships between words in a sentence, given that the same words may represent variegated parts of speech in variegated contexts. For example: “Gas prices are up [adverb].” vs “He climbed up [preposition] the ladder.” vs “They’ve had some ups [noun] and downs.” From a nested list of sentences of words, this step outputs a nested list in which each word is stored as a pair, with one value for the word itself and flipside for its part of speech, e.g. [[(‘The’, ‘DT’), (‘X-Men’, ‘JJ’), (‘are’, ‘VBP’), (‘a’, ‘DT’), (‘superhero’, ‘NN’), (‘team’, ‘NN’), (‘in’, ‘IN’), (‘the’, ‘DT’), (‘Marvel’, ‘NNP’), (‘Comics’, ‘NNP’), ‘Universe’, ‘NNP’), (‘.’, ‘.’)], …] Entity Recognition: Higher-level conceptual entities are recognized as such through a process tabbed chunking, a worldwide precursor to relation recognition (our end goal) in which linked sets of words are grouped or “chunked” together. You can permafrost noun phrases, or verb phrases, or prepositional phrases based on words’ ordering and parts-of-speech in a sentence; often, you want to squint for named entities (“NE”) that correspond to people, places, organizations, etc. The output of this step is a list of hierarchical trees, e.g. [Tree(‘S’, [(‘The’, ‘DT’), (‘X-Men’, ‘JJ’), (‘are’, ‘VBP’), (‘a’, ‘DT’), (‘superhero’, ‘NN’), (‘team’, ‘NN’), (‘in’, ‘IN’), (‘the’, ‘DT’), Tree(‘ORGANIZATION’, [(‘Marvel’, ‘NNP’), (‘Comics’, ‘NNP’), (‘Universe’, ‘NNP’)]), (‘.’, ‘.’)])], …] Note: The chunker correctly recognized “Marvel Comics Universe” but missed “The X-Men,” partly considering the POST incorrectly classified “X-Men” as an adjective (“JJ”). NLP is a statistical process, and errors happen! Relation Recognition: Finally, we can try to identify the relations that exist between entities. This may be possible through a regular expression parser that looks for a particular order of subsequent parts of speech, or something fancier using the hierarchical relationships between named entities. In the specimen of the former, searching for one or increasingly verbs followed by a preposition followed by one or increasingly nouns (”<VB.*>+<IN><NN.*>+”) would find information of the form “created by writer Stan Lee.” As it turns out, that phrase is the only one in our example paragraph that fits the pattern, and it is indeed a partial wordplay to our question. Note, however, that it misses the full answer: “created by writer Stan Lee and versifier Jack Kirby.” This can be rumored for in the parser’s pattern, sure, but it’s much tougher knowing superiority of time what to search for and whether you’ve correctly found it. This has been a little preview of the power of natural language processing. Throughout, I’ve used a comprehensive Python package tabbed the Natural Language ToolKit (NLTK), misogynist for self-ruling download here; happily, a companion textbook is moreover misogynist for free, and together they provide an wieldy introduction to NLP that will take you pretty far. If you want to learn more, you might moreover want to sign up for an upcoming Coursera matriculation on the topic, though its stage is still TBD. I’ve been using NLP and NLTK quite a bit in my new job —– not to mention web scraping, among other things —– so I’m sure this topic will alimony coming up. Stay tuned. ← previous ↑ next → Please enable JavaScript to view the comments powered by Disqus. comments powered by Disqus Burton DeWilde data scientist / physicist / filmmaker © 2014 Burton DeWilde. All rights reserved.

bdewilde.github.io - a brief, conceptual overviewIntro to Natural Language Processing (1)

Search Preview

Intro to Natural Language Processing (1)

SEO audit: Content analysis

SEO Keywords (Single)

SEO Keywords (Two Word)

SEO Keywords (Three Word)

SEO Keywords (Four Word)

Internal links in - bdewilde.github.io

Bdewilde.github.io Spined HTML

bdewilde.github.io - a brief, conceptual overview
Intro to Natural Language Processing (1)