bdewilde.github.io - Friedman Corpus (3) — Occurrence and Dispersion









Search Preview

Friedman Corpus (3) — Occurrence and Dispersion

bdewilde.github.io
data scientist / physicist / filmmaker
.io > bdewilde.github.io

SEO audit: Content analysis

Language Error! No language localisation is found.
Title Friedman Corpus (3) — Occurrence and Dispersion
Text / HTML ratio 61 %
Frame Excellent! The website does not use iFrame solutions.
Flash Excellent! The website does not have any flash contents.
Keywords cloud words corpus frequencies dispersion frequency Friedman word counts values adjusted parts absolute DP distributions text linguistics Friedman’s time elements plot
Keywords consistency
Keyword Content Title Description Headings
words 25
corpus 23
frequencies 14
dispersion 12
frequency 10
Friedman 9
Headings
H1 H2 H3 H4 H5 H6
1 0 2 0 0 0
Images We found 7 images on this web page.

SEO Keywords (Single)

Keyword Occurrence Density
words 25 1.25 %
corpus 23 1.15 %
frequencies 14 0.70 %
dispersion 12 0.60 %
frequency 10 0.50 %
Friedman 9 0.45 %
word 9 0.45 %
counts 8 0.40 %
values 7 0.35 %
adjusted 7 0.35 %
parts 7 0.35 %
absolute 5 0.25 %
DP 5 0.25 %
distributions 5 0.25 %
text 5 0.25 %
linguistics 5 0.25 %
Friedman’s 5 0.25 %
time 4 0.20 %
elements 4 0.20 %
plot 4 0.20 %

SEO Keywords (Two Word)

Keyword Occurrence Density
in the 9 0.45 %
of the 8 0.40 %
corpus parts 7 0.35 %
n corpus 6 0.30 %
the n 6 0.30 %
of words 5 0.25 %
corpus linguistics 5 0.25 %
for the 4 0.20 %
adjusted frequencies 4 0.20 %
number of 4 0.20 %
but I 4 0.20 %
frequency distributions 4 0.20 %
that the 4 0.20 %
Friedman corpus 4 0.20 %
of a 4 0.20 %
want to 4 0.20 %
linguistic elements 3 0.15 %
but not 3 0.15 %
the most 3 0.15 %
You can 3 0.15 %

SEO Keywords (Three Word)

Keyword Occurrence Density Possible Spam
the n corpus 6 0.30 % No
n corpus parts 6 0.30 % No
of the n 3 0.15 % No
conditional frequency distributions 3 0.15 % No
in corpus linguistics 3 0.15 % No
against the overall 2 0.10 % No
normalized against the 2 0.10 % No
are normalized against 2 0.10 % No
which are normalized 2 0.10 % No
words in the 2 0.10 % No
expect given the 2 0.10 % No
to refer to 2 0.10 % No
would expect given 2 0.10 % No
measure of dispersion 2 0.10 % No
most of the 2 0.10 % No
given the sizes 2 0.10 % No
one would expect 2 0.10 % No
my Friedman corpus 2 0.10 % No
values close to 2 0.10 % No
indicate that a 2 0.10 % No

SEO Keywords (Four Word)

Keyword Occurrence Density Possible Spam
the n corpus parts 6 0.30 % No
of the n corpus 3 0.15 % No
indicate that a is 2 0.10 % No
that a is distributed 2 0.10 % No
a is distributed across 2 0.10 % No
is distributed across the 2 0.10 % No
distributed across the n 2 0.10 % No
across the n corpus 2 0.10 % No
would expect given the 2 0.10 % No
one would expect given 2 0.10 % No
expect given the sizes 2 0.10 % No
given the sizes of 2 0.10 % No
the sizes of the 2 0.10 % No
sizes of the n 2 0.10 % No
normalized against the overall 2 0.10 % No
which are normalized against 2 0.10 % No
can see that the 2 0.10 % No
are normalized against the 2 0.10 % No
You can see that 2 0.10 % No
the total number of 2 0.10 % No

Internal links in - bdewilde.github.io

About Me
About Me
Archive
Archive
Intro to Automatic Keyphrase Extraction
Intro to Automatic Keyphrase Extraction
On Starting Over with Jekyll
On Starting Over with Jekyll
Friedman Corpus (3) — Occurrence and Dispersion
Friedman Corpus (3) — Occurrence and Dispersion
Background and Creation
Friedman Corpus (1) — Background and Creation
Data Quality and Corpus Stats
Friedman Corpus (2) — Data Quality and Corpus Stats
While I Was Away
While I Was Away
Intro to Natural Language Processing (2)
Intro to Natural Language Processing (2)
a brief, conceptual overview
Intro to Natural Language Processing (1)
A Data Science Education?
A Data Science Education?
Connecting to the Data Set
Connecting to the Data Set
Data, Data, Everywhere
Data, Data, Everywhere
← previous
Burton DeWilde

Bdewilde.github.io Spined HTML


Friedman Corpus (3) — Occurrence and Dispersion Burton DeWildeWell-nighMe Archive CV Friedman Corpus (3) — Occurrence and Dispersion 2013-11-03 corpus linguistics dispersion natural language processing occurrence Thomas Friedman Thus far, I’ve pseudo-justified why a hodgepodge of NYT wares by Thomas Friedman would be interesting to study, unquestionably compiled/scraped the text and metadata (see Background and Creation post), improved/verified the quality of the data, and computed a handful of simple, corpus-level statistics (see Data Quality and Corpus Stats post). Now, onward to very natural language analysis! Occurrence I would oppose that the frequency of occurrence of words and other linguistic elements is the fundamental measure on which much NLP is based. In essence, we want to wordplay “How many times did something occur?” in both wool and relative terms. Since words are probably the most familiar “linguistic elements” of a language, I focused on word occurrence; however, other elements may moreover merit counting, including morphemes (“bits of words”) and parts-of-speech (nouns, verbs, …). Note: In the past I’ve been tumbled by the terminology used for wool and relative frequencies —– pretty sure it’s used inconsistently in the literature. I use count to refer to wool frequencies (whole, positive numbers: 1, 2, 3, …) and frequency to refer to relative frequencies (rational numbers between 0.0 and 1.0). These definitions sweep unrepealable complications under the rug, but I don’t want to get into it right now… Anyway, in order to count individual words, I had to split the corpus text into a list of its component words. I’ve discussed tokenization before, so I won’t go into details. Given that I scraped this text from the web, though, I should note that I cleaned it up a bit surpassing tokenizing: namely, I decoded any HTML entities; removed all HTML markup, URLs, and non-ASCII characters; and normalized white-space. Perhaps controversially, I moreover unpacked contractions (e.g., “don’t” => “do not”) in an effort to stave weird tokens that tingle in virtually apostrophes (e.g., “don”+”’”+”t” or “don”+”‘t”). Since any mistakes in tokenization propagate to results downstream, it’s probably weightier to use a “standard” tokenizer rather than something homemade; I’ve found NLTK’s defaults to be good unbearable (usually). Here’s some sample code: from itertools import uniting from nltk import clean_html, sent_tokenize, word_tokenize # combine all wares into single woodcut of text all_text = ' '.join([doc['full_text'] for doc in docs]) # partial cleaning as example: this uses nltk to strip residual HTML markup cleaned_text = clean_html(all_text) # tokenize text into sentences, sentences into words tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(cleaned_text)] # flatten list of lists into a single words list all_words = list(chain(*tokenized_text)) Now I had one last set of decisions to make: Which words do I want to count? Depends on what you want to do, of course! For example, this vendible explains how filtering for and studying unrepealable words helped computational linguists identify J.K. Rowling as the person overdue the tragedian Robert Galbraith. In my case, I just wanted to get a unstipulated feeling for the meaningful words Friedman has used the most. So, I filtered out stop words and yellowish punctuation tokens, and I lowercased all letters, but I did not stem or lemmatize the words; the total number of words dropped from 2.96M to 1.43M. I then used NLTK’s handy FreqDist() matriculation to get counts by word. Here are both counts and frequencies for the top 30 “good” words in my Friedman corpus: You can see that the distributions are identical, except for the y-axis values: as discussed above, counts are the wool number of occurrences for each word, while frequencies are those counts divided by the total number of words in the corpus. It’s interesting but not particularly surprising that Friedman’s top two meaningful words are mr. and said –— he’s a journalist, without all, and he’s quoted a lot of people. (Perhaps he met them on the way to/from a foreign airport…) Given what we know well-nigh Friedman’s career (as discussed in (1)), most of the other top words moreover sound well-nigh right: Israel/Israeli, president, American, people, world, Bush, … On a lark, I compared word counts for the five presidents that have held office during Friedman’s NYT career: Ronald Reagan, George H.W. Bush, Bill Clinton, George W. Bush, and Barack Obama: “reagan”: 761 “bush”: 3582 “clinton”: 2741 “obama”: 964 Yes, the two Bush’s got combined, and Hillary is definitely contaminating Bill’s counts (I didn’t finger like doing reference disambiguation on this, sorry!). I find it increasingly interesting to plot provisionary frequency distributions, i.e. a set of frequency distributions, one for each value of some condition. So, taking the article’s year of publication as the condition, I produced this plot of presidential mentions by year: Nice! You can unmistakably see frequencies peaking during a given president’s term(s), which makes sense. Plus, they show Friedman’s transpiration in focus over time: early on, he covered Middle Eastern conflict, not the American presidency; in 1994, a year in which Clinton was mentioned particularly frequently, Friedman was specifically tent the White House. I’m tempted to read remoter into the data, such as the long ripen of W. Bush mentions throughout —– and vastitude –— his second term possibly indicating his slide into irrelevance, but I shouldn’t without first inspecting context. Some other time, perhaps. I made a few other provisionary frequency distributions using NLTK’s ConditionalFreqDist() class, just for kicks. Here are two, presented without scuttlebutt (only hints of a raised eyebrow on the author’s part): These plots-over-time lead naturally into the concept of dispersion. Dispersion Although frequencies of (co-)occurrence are fundamental and ubiquitous in corpus linguistics, they are potentially misleading unless one moreover gives a measure of dispersion, i.e. the spread or variability of a distribution of values. It’s Statistics 101: You shouldn’t report a midpoint value without an associated dispersion! Counts/frequencies of words or other linguistic elements are often used to indicate importance in a corpus or language, but consider a corpus in which two words have the same counts, only the first word occurs in 99% of corpus documents, while the second word is well-matured in just 5%. Which word is “more important”? And how should we interpret subsequent statistics based on these frequencies if the second word’s upper value is unrepresentative of most of the corpus? In the specimen of my Friedman corpus, the provisionary frequency distributions over time (above) visualize, to a unrepealable extent, those terms’ dispersions. But we can do more. As it turns out, NLTK includes a small module to plot dispersion; like so: from nltk.draw import dispersion_plot dispersion_plot(all_words, ['reagan', 'bush', 'clinton', 'obama'], ignore_case=True) To be honest, I’m not plane sure how to interpret this plot –— for starters, why does Obama towards at what I think is the whence of the corpus?! Clearly, it would be nice to quantify dispersion as, like, a single, scalar value. Many dispersion measures have been proposed over the years (see [1] for a nice overview), but in the context of linguistic elements, most are poorly known, little studied, and suffer from a variety of statistical shortcomings.Moreoverin [1], the tragedian proposes an alternative, conceptually simple measure of dispersion tabbed DP, for deviation of proportions, whose derivation he gives as follows: Determine the sizes s of each of the n corpus parts (documents), which are normalized versus the overall corpus size and correspond to expected percentages which take differently-sized corpus parts into consideration. Determine the frequencies v with which word a occurs in the n corpus parts, which are normalized versus the overall number of occurrences of a and correspond to an observed percentage. Compute all n pairwise wool differences of observed and expected percentages, sum them up, and divide the result by two. The result is DP, which can theoretically range from approximately 0 to 1, where values tropical to 0 indicate that a is distributed wideness the n corpus parts as one would expect given the sizes of the n corpus parts. By contrast, values tropical to 1 indicate that a is distributed wideness the n corpus parts exactly the opposite way one would expect given the sizes of the n corpus parts. Sounds reasonable to me! (Read the cited paper if you disagree, I found it very convincing.) Using this definition, I calculated DP values for all words in the Friedman corpus and plotted those values versus their respective counts: As expected, the most frequent words tend to have lower DP values (be increasingly evenly distributed in the corpus), and vice-versa; however, note the wide spread in DP for a stock-still count, particularly in the middle range. Many words are definitely distributed unevenly in the Friedman corpus! A worldwide —– but not entirely platonic –— way to worth for dispersion in corpus linguistics is to compute the adjusted frequency of words, which is often just frequency multiplied by dispersion. (Other definitions exist, but I won’t get into it.) Such adjusted frequencies are by definition some fraction of the raw frequency, and words with low dispersion are penalized increasingly than those with upper dispersion. Here, I plotted the frequencies and adjusted frequencies of Friedman’s top 30 words from before: You can see that the rankings would transpiration if I used adjusted frequency to order the words! This difference can be quantified with, say, a Spearman correlation coefficient, for which a value of 1.0 indicates identical rankings and -1.0 indicates exactly opposite rankings. I calculated a value of 0.89 for frequency-ranks vs adjusted frequency-ranks: similar, but not the same! It’s well-spoken that the effect of (under-)dispersion should not be ignored in corpus linguistics. My big issue with adjusted frequencies is that they are increasingly difficult to interpret: What, exactly, does frequency*dispersion unquestionably mean? What units go with those values? Maybe smarter people than I will come up with a largest measure. Well, I’d meant to include word co-occurrence in this post, but it’s once too long. Congratulations for making it all the way through! :) Next time, then, I’ll get into bigrams/trigrams/n-grams and undertone measures. And without that, I get to the fun stuff! [1] Gries, Stefan Th. “Dispersions and adjusted frequencies in corpora.” International periodical of corpus linguistics 13.4 (2008): 403-437. ← previous ↑ next → Please enable JavaScript to view the comments powered by Disqus. comments powered by Disqus Burton DeWilde data scientist / physicist / filmmaker © 2014 Burton DeWilde. All rights reserved.