bdewilde.github.io - Intro to Automatic Keyphrase Extraction









Search Preview

Intro to Automatic Keyphrase Extraction

bdewilde.github.io
data scientist / physicist / filmmaker
.io > bdewilde.github.io

SEO audit: Content analysis

Language Error! No language localisation is found.
Title Intro to Automatic Keyphrase Extraction
Text / HTML ratio 50 %
Frame Excellent! The website does not use iFrame solutions.
Flash Excellent! The website does not have any flash contents.
Keywords cloud keyphrases = candidates keyphrase document words candidate extraction methods features frequency set phrases approach corpus approaches algorithm problem document’s keywords
Keywords consistency
Keyword Content Title Description Headings
keyphrases 52
= 51
candidates 43
keyphrase 32
document 24
words 23
Headings
H1 H2 H3 H4 H5 H6
1 0 2 2 2 0
Images We found 3 images on this web page.

SEO Keywords (Single)

Keyword Occurrence Density
keyphrases 52 2.60 %
= 51 2.55 %
candidates 43 2.15 %
keyphrase 32 1.60 %
document 24 1.20 %
words 23 1.15 %
candidate 22 1.10 %
extraction 19 0.95 %
methods 12 0.60 %
features 12 0.60 %
frequency 12 0.60 %
set 11 0.55 %
phrases 10 0.50 %
approach 10 0.50 %
corpus 9 0.45 %
approaches 9 0.45 %
algorithm 9 0.45 %
problem 8 0.40 %
document’s 8 0.40 %
keywords 7 0.35 %

SEO Keywords (Two Word)

Keyword Occurrence Density
keyphrase extraction 14 0.70 %
of the 11 0.55 %
a document 10 0.50 %
set of 10 0.50 %
and the 9 0.45 %
to the 8 0.40 %
of a 8 0.40 %
candidate keyphrases 8 0.40 %
to be 7 0.35 %
number of 7 0.35 %
in a 7 0.35 %
as a 7 0.35 %
candidates are 6 0.30 %
a document’s 6 0.30 %
words and 6 0.30 %
Keyphrase Extraction 5 0.25 %
first two 5 0.25 %
of this 5 0.25 %
keyphrase candidates 5 0.25 %
keyphrases are 5 0.25 %

SEO Keywords (Three Word)

Keyword Occurrence Density Possible Spam
automatic keyphrase extraction 5 0.25 % No
set of keyphrases 5 0.25 % No
the first two 4 0.20 % No
taken to be 3 0.15 % No
two sections of 3 0.15 % No
words and phrases 3 0.15 % No
implementation of the 3 0.15 % No
for sent in 3 0.15 % No
in a document 3 0.15 % No
of a document’s 3 0.15 % No
for w in 3 0.15 % No
sections of this 3 0.15 % No
are taken to 3 0.15 % No
of this post 3 0.15 % No
first two sections 3 0.15 % No
Automatic Keyphrase Extraction 3 0.15 % No
to be the 3 0.15 % No
all words and 2 0.10 % No
? 2 0.10 % No
to the first 2 0.10 % No

SEO Keywords (Four Word)

Keyword Occurrence Density Possible Spam
are taken to be 3 0.15 % No
two sections of this 3 0.15 % No
first two sections of 3 0.15 % No
the first two sections 3 0.15 % No
for w in candidate_words 2 0.10 % No
that are stop words 2 0.10 % No
= setstringpunctuation stop_words = 2 0.10 % No
punct = setstringpunctuation stop_words 2 0.10 % No
punctuation punct = setstringpunctuation 2 0.10 % No
entirely punctuation punct = 2 0.10 % No
or entirely punctuation punct 2 0.10 % No
words or entirely punctuation 2 0.10 % No
stop words or entirely 2 0.10 % No
are stop words or 2 0.10 % No
string exclude candidates that 2 0.10 % No
exclude candidates that are 2 0.10 % No
stop_words = setnltkcorpusstopwordswords'english' tokenize 2 0.10 % No
nltk string exclude candidates 2 0.10 % No
itertools nltk string exclude 2 0.10 % No
import itertools nltk string 2 0.10 % No

Internal links in - bdewilde.github.io

About Me
About Me
Archive
Archive
Intro to Automatic Keyphrase Extraction
Intro to Automatic Keyphrase Extraction
On Starting Over with Jekyll
On Starting Over with Jekyll
Friedman Corpus (3) — Occurrence and Dispersion
Friedman Corpus (3) — Occurrence and Dispersion
Background and Creation
Friedman Corpus (1) — Background and Creation
Data Quality and Corpus Stats
Friedman Corpus (2) — Data Quality and Corpus Stats
While I Was Away
While I Was Away
Intro to Natural Language Processing (2)
Intro to Natural Language Processing (2)
a brief, conceptual overview
Intro to Natural Language Processing (1)
A Data Science Education?
A Data Science Education?
Connecting to the Data Set
Connecting to the Data Set
Data, Data, Everywhere
Data, Data, Everywhere
← previous
Burton DeWilde

Bdewilde.github.io Spined HTML


Intro toWill-lessKeyphrase Extraction Burton DeWildeWell-nighMe Archive CV Intro toWill-lessKeyphrase Extraction 2014-09-23 full-length diamond frequency statistics keyphrase extraction graph-based ranking NLP task reformulation I often wield natural language processing for purposes of automatically extracting structured information from unstructured (text) datasets. One such task is the extraction of important topical words and phrases from documents, wontedly known as terminology extraction or will-less keyphrase extraction. Keyphrases provide a transitory unravelment of a document’s content; they are useful for document categorization, clustering, indexing, search, and summarization; quantifying semantic similarity with other documents; as well as conceptualizing particular knowledge domains. Despite wide applicability and much research, keyphrase extraction suffers from poor performance relative to many other cadre NLP tasks, partly considering there’s no objectively “correct” set of keyphrases for a given document. While human-labeled keyphrases are often considered to be the gold standard, humans disagree well-nigh what that standard is! As a unstipulated rule of thumb, keyphrases should be relevant to one or increasingly of a document’s major topics, and the set of keyphrases describing a document should provide good coverage of all major topics. (They should moreover be understandable and grammatical, of course.) The fundamental difficulty lies in determining which keyphrases are the most relevant and provide the weightier coverage. As described inWill-lessKeyphrase Extraction: A Survey of the State of the Art, several factors contribute to this difficulty, including document length, structural inconsistency, changes in topic, and (a lack of) correlations between topics. MethodologyWill-lesskeyphrase extraction is typically a two-step process: first, a set of words and phrases that could convey the topical content of a document are identified, then these candidates are scored/ranked and the “best” are selected as a document’s keyphrases. 1. Candidate Identification A brute-force method might consider all words and/or phrases in a document as candidate keyphrases. However, given computational financing and the fact that not all words and phrases in a document are equally likely to convey its content, heuristics are typically used to identify a smaller subset of largest candidates.Worldwideheuristics include removing stop words and punctuation; filtering for words with unrepealable parts of speech or, for multi-word phrases, unrepealable POS patterns; and using external knowledge bases like WordNet or Wikipedia as a reference source of good/bad keyphrases. For example, rather than taking all of the n-grams (where 1 ≤ n ≤ 5) in this post’s first two paragraphs as candidates, we might limit ourselves to only noun phrases matching the POS pattern {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+} (a regular expression written in a simplified format used by NLTK’s RegexpParser()). This matches any number of adjectives followed by at least one noun that may be joined by a preposition to one other adjective(s)+noun(s) sequence, and results in the pursuit candidates: ['art', will-less keyphrase extraction', 'changes in topic', 'concise description', 'content', 'coverage', 'difficulty', 'document', 'document categorization', 'document length', 'extraction of important topical words', 'fundamental difficulty', 'general rule of thumb', 'gold standard', 'good coverage', 'human-labeled keyphrases', 'humans', 'indexing', 'keyphrases', 'major topics', 'many other cadre nlp tasks', 'much research', 'natural language processing for purposes', 'particular knowledge domains', 'phrases from documents', 'search', 'semantic similarity with other documents', 'set of keyphrases', 'several factors', 'state', 'structural inconsistency', 'summarization', 'survey', 'terminology extraction', 'topics', 'wide applicability', 'work'] Compared to the brute gravity result, which gives 1100+ candidate n-grams, most of which are scrutinizingly certainly not keyphrases (e.g. “task”, “relative to”, “and the set”, “survey of the state”, …), this seems like a much smaller and increasingly likely set of candidates, right? As document length increases, though, plane the number of likely candidates can get quite large. Selecting the weightier keyphrase candidates is the objective of step 2. 2. Keyphrase Selection Researchers have devised a plethora of methods for distinguishing between good and bad (or largest and worse) keyphrase candidates. The simplest rely solely on frequency statistics, such as TF*IDF or BM25, to score candidates, thesping that a document’s keyphrases tend to be relatively frequent within the document as compared to an external reference corpus. Unfortunately, their performance is mediocre; researchers have demonstrated that the weightier keyphrases aren’t necessarily the most frequent within a document. (For a statistical wringer of human-generated keyphrases, trammels out Descriptive Keyphrases for Text Visualization.) A next struggle might score candidates using multiple statistical features combined in an ad hoc or heuristic manner, but this tideway only goes so far.Increasinglysophisticated methods wield machine learning to the problem. They fall into two wholesale categories. Unsupervised Unsupervised machine learning methods struggle to discover the underlying structure of a dataset without the assistance of already-labeled examples (“training data”). The canonical unsupervised tideway to will-less keyphrase extraction uses a graph-based ranking method, in which the importance of a candidate is unswayable by its relatedness to other candidates, where “relatedness” may be measured by two terms’ frequency of co-occurrence or semantic relatedness. This method assumes that increasingly important candidates are related to a greater number of other candidates, and that increasingly of those related candidates are moreover considered important; it does not, however, ensure that selected keyphrases imbricate all major topics, although multiple variations try to recoup for this weakness. Essentially, a document is represented as a network whose nodes are candidate keyphrases (typically only key words) and whose edges (optionally weighted by the stratum of relatedness) connect related candidates. Then, a graph-based ranking algorithm, such as Google’s famous PageRank, is run over the network, and the highest-scoring terms are taken to be the document’s keyphrases. The most famous instantiation of this tideway is TextRank; a variation that attempts to ensure good topic coverage is DivRank. For a increasingly wide-stretching breakdown, see Conundrums in Unsupervised Keyphrase Extraction, which includes an example of a topic-based clustering method, the other main matriculation of unsupervised keyphrase extraction algorithms (which I’m not going to delve into). Unsupervised approaches have at least one notable strength: No training data required! In an age of massive but unlabled datasets, this can be a huge wholesomeness over other approaches. As for disadvantages, unsupervised methods make assumptions that don’t necessarily hold wideness variegated domains, and up until recently, their performance has been junior to supervised methods. Which brings me to the next section. Supervised Supervised machine learning methods use training data to infer a function that maps a set of input variables tabbed features to some desired (and known) output value; ideally, this function can correctly predict the (unknown) output values of new examples based on their features alone. The two primary developments in supervised approaches to will-less keyphrase extraction deal with task reformulation and full-length design. Early implementations recast the problem of extracting keyphrases from a document as a binary nomenclature problem, in which some fraction of candidates are classified as keyphrases and the rest as non-keyphrases. This is a well-understood problem, and there are many methods to solve it: Naive Bayes, visualization trees, and support vector machines, among others. However, this reformulation of the task is conceptually problematic; humans don’t judge keyphrases independently of one another, instead they judge unrepealable phrases as increasingly key than others in a intrinsically relative sense. As such, increasingly recently the problem has been reformulated as a ranking problem, in which a function is trained to rank candidates pairwise equal to stratum of “keyness”. The weightier candidates rise to the top, and the top N are taken to be the document’s keyphrases. The second line of research into supervised approaches has explored a wide variety of features used to discriminate between keyphrases and non-keyphrases. The most worldwide are the same frequency statistics, withal with a grab-bag of other statistical features: phrase length (number of plug-in words), phrase position (normalized position within a document of first and/or last occurrence therein), and “supervised keyphraseness” (number of times a keyphrase appears as such in the training data). Some models take wholesomeness of a document’s structural features — titles, abstracts, intros and conclusions, metadata, and so on — considering a candidate is increasingly likely to be a keyphrase if it appears in notable sections. Others are external resource-based features: “Wikipedia-based keyphraseness” assumes that keyphrases are increasingly likely to towards as Wiki vendible links and/or titles, while phrase commonness compares a candidate’s frequency in a document with respect to its frequency in an external corpus. The list of possible features goes on and on. A well-known implementation of the binary nomenclature method, KEA (as published in PracticalWill-lessKeyphrase Extraction), used TF*IDF and position of first occurrence (while filtering on phrase length) to identify keyphrases. In A RankingTidewayto Keyphrase Extraction, researchers used a Linear Ranking SVM to rank candidate keyphrases with much success (but failed to requite their algorithm a tricky name). Supervised approaches have often achieved largest performance than unsupervised approaches; however, good training data is nonflexible to find (although here’s a decent place to start), and the danger of training a model that doesn’t generalize to unseen examples is something to unchangingly baby-sit versus (e.g. through cross-validation). Results Okay, now that I’ve scared/bored yonder all but the truly interested, let’s dig into some lawmaking and results! As an example document, I’ll use all of the text in this post up to this results section; as a reference corpus, I’ll use all other posts on this blog. In principle, a reference corpus isn’t necessary for single-document keyphrase extraction (case in point: TextRank), but it’s often helpful to compare a document’s candidates versus other documents’ in order to typify its particular content. Consider that tf*idf reduces to just tf (term frequency) in the specimen of a single document, since idf (inverse document frequency) is the same value for every candidate. As mentioned, there are many ways to pericope candidate keyphrases from a document; here’s a simplified and meaty implementation of the “noun phrases only” heuristic method: def extract_candidate_chunks(text, grammar=r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'): import itertools, nltk, string # exclude candidates that are stop words or entirely punctuation punct = set(string.punctuation) stop_words = set(nltk.corpus.stopwords.words('english')) # tokenize, POS-tag, and permafrost using regular expressions chunker = nltk.chunk.regexp.RegexpParser(grammar) tagged_sents = nltk.pos_tag_sents(nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text)) all_chunks = list(itertools.chain.from_iterable(nltk.chunk.tree2conlltags(chunker.parse(tagged_sent)) for tagged_sent in tagged_sents)) # join plug-in permafrost words into a single chunked phrase candidates = [' '.join(word for word, pos, permafrost in group).lower() for key, group in itertools.groupby(all_chunks, lambda (word,pos,chunk): permafrost != 'O') if key] return [cand for cand in candidates if cand not in stop_words and not all(char in punct for char in cand)] When text is prescribed to the first two paragraphs of this post, set(extract_candidate_chunks(text)) returns increasingly or less the same set of candidate keyphrases as listed in 1. Candidate Identification. (Additional cleaning and filtering lawmaking improves the list a bit and helps to makes up for tokenizing/tagging/chunking errors.) For comparison, the original TextRank algorithm performs weightier when extracting all (unigram) nouns and adjectives, like so: def extract_candidate_words(text, good_tags=set(['JJ','JJR','JJS','NN','NNP','NNS','NNPS'])): import itertools, nltk, string # exclude candidates that are stop words or entirely punctuation punct = set(string.punctuation) stop_words = set(nltk.corpus.stopwords.words('english')) # tokenize and POS-tag words tagged_words = itertools.chain.from_iterable(nltk.pos_tag_sents(nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text))) # filter on unrepealable POS tags and lowercase all words candidates = [word.lower() for word, tag in tagged_words if tag in good_tags and word.lower() not in stop_words and not all(char in punct for char in word)] return candidates In this case, set(extract_candidate_words(text)) gives basically the same set of words visualized as a network in the sub-section on unsupervised methods.Lawmakingfor keyphrase selection depends entirely on the tideway taken, of course. It’s relatively straightforward to implement the simplest, frequency statistic-based tideway using scikit-learn or gensim: def score_keyphrases_by_tfidf(texts, candidates='chunks'): import gensim, nltk # pericope candidates from each text in texts, either chunks or words if candidates == 'chunks': boc_texts = [extract_candidate_chunks(text) for text in texts] elif candidates == 'words': boc_texts = [extract_candidate_words(text) for text in texts] # make gensim wordlist and corpus wordlist = gensim.corpora.Dictionary(boc_texts) corpus = [dictionary.doc2bow(boc_text) for boc_text in boc_texts] # transform corpus with tf*idf model tfidf = gensim.models.TfidfModel(corpus) corpus_tfidf = tfidf[corpus] return corpus_tfidf, wordlist First we assign texts to a list of normalized text content (stripped of various YAML, HTML, and Markdown formatting) from all previous blog posts plus the first two sections of this post, then we undeniability score_keyphrases_by_tfidf(texts) to get all posts when in a sparse, tf*idf-weighted representation. It’s now trivial to print out the 20 candidate keyphrases with the highest tf*idf values for this blog post: keyphrase tfidf ----------------------------------------- keyphrases......................... 0.573 document........................... 0.375 candidates......................... 0.306 approaches......................... 0.191 approach........................... 0.115 candidate.......................... 0.115 major topics....................... 0.115 methods............................ 0.115 will-less keyphrase extraction..... 0.076 frequency statistics............... 0.076 keyphrase.......................... 0.076 keyphrase candidates............... 0.076 network............................ 0.076 relatedness........................ 0.076 researchers........................ 0.076 set of keyphrases.................. 0.076 state.............................. 0.076 survey............................. 0.076 function........................... 0.075 performance........................ 0.075 Not too shabby! Although you can unmistakably see how stemming or lemmatizing candidates would modernize results (candidate / candidates, tideway / approaches, and keyphrase / keyphrases would normalize together). You can moreover see that this tideway seems to favor unigram keyphrases, likely owing to their much higher frequencies of occurrence in natural language texts. Considering that human-selected keyphrases are most often bigrams (according to the wringer in Descriptive Keyphrases for Text Visualization), this seems to be flipside limitation of such simplistic methods. Now, let’s try a bare-bones implementation of the TextRank algorithm. To alimony it simple, only unigram candidates (not chunks or n-grams) are widow to the network as nodes, the co-occurrence window size is stock-still at 2 (so only proximal words are said to “co-occur”), and the edges between nodes are unweighted (rather than weighted by the number of co-occurrences). The N top-scoring candidates are taken to be its keywords; sequences of proximal keywords are merged to form key phrases and their individual PageRank scores are averaged, so as not to bias for longer keyphrases. def score_keyphrases_by_textrank(text, n_keywords=0.05): from itertools import takewhile, tee, izip import networkx, nltk # tokenize for all words, and pericope *candidate* words words = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] candidates = extract_candidate_words(text) # build graph, each node is a unique candidate graph = networkx.Graph() graph.add_nodes_from(set(candidates)) # iterate over word-pairs, add unweighted edges into graph def pairwise(iterable): """s -> (s0,s1), (s1,s2), (s2, s3), ...""" a, b = tee(iterable) next(b, None) return izip(a, b) for w1, w2 in pairwise(candidates): if w2: graph.add_edge(*sorted([w1, w2])) # score nodes using default pagerank algorithm, sort by score, alimony top n_keywords ranks = networkx.pagerank(graph) if 0 < n_keywords < 1: n_keywords = int(round(len(candidates) * n_keywords)) word_ranks = {word_rank[0]: word_rank[1] for word_rank in sorted(ranks.iteritems(), key=lambda x: x[1], reverse=True)[:n_keywords]} keywords = set(word_ranks.keys()) # merge keywords into keyphrases keyphrases = {} j = 0 for i, word in enumerate(words): if i < j: protract if word in keywords: kp_words = list(takewhile(lambda x: x in keywords, words[i:i+10])) avg_pagerank = sum(word_ranks[w] for w in kp_words) / float(len(kp_words)) keyphrases[' '.join(kp_words)] = avg_pagerank # counter as hackish way to ensure merged keyphrases are non-overlapping j = i + len(kp_words) return sorted(keyphrases.iteritems(), key=lambda x: x[1], reverse=True) With text as the first two sections of this post, calling score_keyphrases_by_textrank(text) returns the pursuit top 20 keyphrases: keyphrase textrank -------------------------------------------- keyphrases......................... 0.028 candidates......................... 0.022 document........................... 0.022 candidate keyphrases............... 0.019 weightier keyphrases.................... 0.018 keyphrase candidates............... 0.017 likely candidates.................. 0.015 weightier candidates.................... 0.015 weightier keyphrase candidates.......... 0.014 features........................... 0.013 keyphrase.......................... 0.012 keyphrase extraction............... 0.012 extraction......................... 0.012 methods............................ 0.011 candidate.......................... 0.01 words.............................. 0.01 will-less keyphrase extraction..... 0.01 approaches......................... 0.009 problem............................ 0.009 set................................ 0.008 Again, not too shabby, but obviously there’s room for improvement. You can see that this algorithm occasionally produces novel and high-quality keyphrases, but there’s a pearly value of noise, too. Normalization of candidates (keyphrase / keyphrases, …) could help, as could largest cleaning and filtering. Furthermore, experimenting with variegated aspects of the algorithm — like DivRank, SingleRank, ExpandRank, CollabRank, and others — including co-occurrence window size, weighted graphs, and the manner in which keywords are merged into keyphrases, has been shown to produce largest results. Lastly, let’s try a supervised algorithm. I prefer a ranking tideway over binary classification, for conceptual as well as result quality reasons. Conveniently, someone has once implemented a pairwise Ranking SVM in Python — and blogged well-nigh it!Full-lengthdesign is something of an art; drawing on multiple sources for inspiration, I extracted a diverse grab-bag of features: frequency-based: term frequency, $g^2$, corpus and web “commonness” (as specified here) statistical: term length, spread, lexical cohesion, max word length grammatical: “is acronym”, “is named entity” positional: normalized positions of first and last occurrence, “is in title”, “is in key excerpt” (such as an utopian or introductory paragraph)Full-lengthextraction can get very complicated and convoluted. In the interest of brevity and simplicity, then, here’s a partial example: def extract_candidate_features(candidates, doc_text, doc_excerpt, doc_title): import collections, math, nltk, re candidate_scores = collections.OrderedDict() # get word counts for document doc_word_counts = collections.Counter(word.lower() for sent in nltk.sent_tokenize(doc_text) for word in nltk.word_tokenize(sent)) for candidate in candidates: pattern = re.compile(r'\b'+re.escape(candidate)+r'(\b|[,;.!?]|\s)', re.IGNORECASE) # frequency-based # number of times candidate appears in document cand_doc_count = len(pattern.findall(doc_text)) # count could be 0 for multiple reasons; shit happens in a simplified example if not cand_doc_count: print '**WARNING:', candidate, 'not found!' protract # statistical candidate_words = candidate.split() max_word_length = max(len(w) for w in candidate_words) term_length = len(candidate_words) # get frequencies for term and plug-in words sum_doc_word_counts = float(sum(doc_word_counts[w] for w in candidate_words)) try: # lexical cohesion doesn't make sense for 1-word terms if term_length == 1: lexical_cohesion = 0.0 else: lexical_cohesion = term_length * (1 + math.log(cand_doc_count, 10)) * cand_doc_count / sum_doc_word_counts except (ValueError, ZeroDivisionError) as e: lexical_cohesion = 0.0 # positional # found in title, key excerpt in_title = 1 if pattern.search(doc_title) else 0 in_excerpt = 1 if pattern.search(doc_excerpt) else 0 # first/last position, difference between them (spread) doc_text_length = float(len(doc_text)) first_match = pattern.search(doc_text) abs_first_occurrence = first_match.start() / doc_text_length if cand_doc_count == 1: spread = 0.0 abs_last_occurrence = abs_first_occurrence else: for last_match in pattern.finditer(doc_text): pass abs_last_occurrence = last_match.start() / doc_text_length spread = abs_last_occurrence - abs_first_occurrence candidate_scores[candidate] = {'term_count': cand_doc_count, 'term_length': term_length, 'max_word_length': max_word_length, 'spread': spread, 'lexical_cohesion': lexical_cohesion, 'in_excerpt': in_excerpt, 'in_title': in_title, 'abs_first_occurrence': abs_first_occurrence, 'abs_last_occurrence': abs_last_occurrence} return candidate_scores As an example, candidate_scores["automatic keyphrase extraction"] returns the pursuit features: {'abs_first_occurrence': 0.029178287921046986, 'abs_last_occurrence': 0.9301652006007295, 'in_excerpt': 1, 'in_title': 1, 'lexical_cohesion': 0.9699006820274416, 'max_word_length': 10, 'spread': 0.9009869126796826, 'term_count': 6, 'term_length': 3} The last thing to do is train a Ranking SVM model on an already-labeled dataset; I used the SemEval 2010 keyphrase extraction dataset, plus a couple uneaten shit and pieces, which can be found in this GitHub repo. When unromantic to the first two sections of this blog post, the 20 top-scoring candidates are as follows: keyphrase ranksvm ------------------------------------------- keyphrase extraction............... 1.736 document categorization............ 1.151 particular knowledge domains....... 1.031 phrases from documents............. 1.014 keyphrase.......................... 0.97 terminology extraction............. 0.951 keyphrases......................... 0.909 set of keyphrases.................. 0.895 transitory description................ 0.873 document........................... 0.691 human-labeled keyphrases........... 0.643 candidate identification........... 0.642 frequency of co-occurrence......... 0.636 candidate keyphrases............... 0.624 wide applicability................. 0.604 rest as non-keyphrases............. 0.578 binary nomenclature problem...... 0.567 canonical unsupervised approach.... 0.566 structural inconsistency........... 0.556 paragraphs as candidates........... 0.548 Now that is a nice set of keyphrases! There’s some bias for longer keyphrases (and longer words within keyphrases), perhaps considering the training dataset was well-nigh 90% scientific articles, but it’s not inappropriate for this science-ish blog’s content. All of the lawmaking shown here has been pared lanugo and simplified for sit-in purposes. Adding wide-stretching candidate cleaning, filtering, case/syntactic normalization, and de-duplication can dramatically reduce noise and modernize results, as can incorporating spare features and external resources into the keyphrase selection algorithm. Furthermore, although all of these methods were presented in the context of single-document keyphrase extraction, there are ways to pericope keyphrases from multiple documents and thus categorize/cluster/summarize/index/conceptualize unshortened corpora. This really is just an introduction to an ongoing rencontre in natural language processing research. On a final note, just for kicks, here are the top 50 keyphrases from my long-neglected Thomas Friedman corpus: keyphrase score ------------------------------------------------ United States........................... 393.736 Bush Administration..................... 390.941 Administration.......................... 310.831 Israel.................................. 256.609 Palestine Liberation Organization....... 256.148 Middle East............................. 182.171 President Bush.......................... 170.812 Clinton................................. 166.669 Administration officials................ 164.812 Clinton Administration.................. 158.695 Lebanon................................. 150.466 Baker................................... 141.051 President Clinton....................... 138.680 Secretary of State...................... 135.036 Soviet Union............................ 133.976 West Bank............................... 128.490 Palestinian............................. 121.275 State Department........................ 107.860 Washington.............................. 102.507 Prime Minister.......................... 101.282 Saudi Arabia............................ 100.062 White House............................. 83.649 Beirut.................................. 81.046 Reagan Administration................... 80.338 Israeli officials....................... 80.119 Yasir Arafat............................ 79.334 Israeli................................. 74.289 Israeli Army............................ 70.909 China................................... 69.484 Saddam Hussein.......................... 68.645 United Nations.......................... 63.641 President............................... 62.621 America................................. 59.833 foreign policy.......................... 59.444 Bush.................................... 56.545 Lebanese Army........................... 55.878 Arafat.................................. 53.848 American officials...................... 52.991 President Obama......................... 51.924 Iraq.................................... 51.896 peace conference........................ 51.549 Bill Clinton............................ 50.438 west Beirut............................. 49.418 Jerusalem............................... 48.914 Israeli Government...................... 47.910 Gorbachev............................... 44.902 Syria................................... 44.791 Administration official................. 41.743 Palestinian guerrillas.................. 39.849 Lebanese................................ 39.299 Looks like a who’s who of U.S. politics and international relations over the past 30 years. Not too shabby, Friedman! ← previous ↑ Please enable JavaScript to view the comments powered by Disqus. comments powered by Disqus Burton DeWilde data scientist / physicist / filmmaker © 2014 Burton DeWilde. All rights reserved.