bdewilde.github.io - Data Quality and Corpus Stats









Search Preview

Friedman Corpus (2) — Data Quality and Corpus Stats

bdewilde.github.io
data scientist / physicist / filmmaker
.io > bdewilde.github.io

SEO audit: Content analysis

Language Error! No language localisation is found.
Title Friedman Corpus (2) — Data Quality and Corpus Stats
Text / HTML ratio 57 %
Frame Excellent! The website does not use iFrame solutions.
Flash Excellent! The website does not have any flash contents.
Keywords cloud Friedman articles = number year text data Analysis Thomas keyword corpus News OpEd average == writing word top —– results
Keywords consistency
Keyword Content Title Description Headings
Friedman 19
articles 9
= 8
number 8
year 7
text 7
Headings
H1 H2 H3 H4 H5 H6
1 0 2 0 0 0
Images We found 2 images on this web page.

SEO Keywords (Single)

Keyword Occurrence Density
Friedman 19 0.95 %
articles 9 0.45 %
= 8 0.40 %
number 8 0.40 %
year 7 0.35 %
text 7 0.35 %
data 6 0.30 %
Analysis 5 0.25 %
Thomas 5 0.25 %
keyword 5 0.25 %
corpus 5 0.25 %
News 4 0.20 %
OpEd 4 0.20 %
average 4 0.20 %
== 4 0.20 %
writing 4 0.20 %
word 4 0.20 %
top 4 0.20 %
—– 4 0.20 %
results 4 0.20 %

SEO Keywords (Two Word)

Keyword Occurrence Density
number of 8 0.40 %
in the 6 0.30 %
of articles 4 0.20 %
of the 4 0.20 %
keyword == 4 0.20 %
Thomas L 4 0.20 %
L Friedman 4 0.20 %
what I 3 0.15 %
for his 3 0.15 %
in a 3 0.15 %
elif keyword 3 0.15 %
So I 3 0.15 %
he won 3 0.15 %
data quality 3 0.15 %
the NYT 3 0.15 %
An Analysis 3 0.15 %
INTERNATIONAL RELATIONS 2 0.10 %
is a 2 0.10 %
Friedman was 2 0.10 %
UNITED STATES 2 0.10 %

SEO Keywords (Three Word)

Keyword Occurrence Density Possible Spam
Thomas L Friedman 4 0.20 % No
number of articles 3 0.15 % No
elif keyword == 3 0.15 % No
George Bush’s Secretary 2 0.10 % No
I used a 2 0.10 % No
the number of 2 0.10 % No
verify data quality 2 0.10 % No
2 Editorial 2 2 0.10 % No
I wanted to 2 0.10 % No
to verify data 2 0.10 % No
Bush’s Secretary of 2 0.10 % No
>>> df value_counts 2 0.10 % No
Secretary of State 2 0.10 % No
corpuswide scale I 1 0.05 % No
checked the number 1 0.05 % No
2013 Reasonably confident 1 0.05 % No
already checked the 1 0.05 % No
I already checked 1 0.05 % No
scale I already 1 0.05 % No
covered most of 1 0.05 % No

SEO Keywords (Four Word)

Keyword Occurrence Density Possible Spam
to verify data quality 2 0.10 % No
Bush’s Secretary of State 2 0.10 % No
George Bush’s Secretary of 2 0.10 % No
a corpuswide scale I 1 0.05 % No
number of articles by 1 0.05 % No
the number of articles 1 0.05 % No
checked the number of 1 0.05 % No
already checked the number 1 0.05 % No
I already checked the 1 0.05 % No
scale I already checked 1 0.05 % No
corpuswide scale I already 1 0.05 % No
Burton DeWilde About Me 1 0.05 % No
at a corpuswide scale 1 0.05 % No
of articles by type 1 0.05 % No
the data at a 1 0.05 % No
explore the data at 1 0.05 % No
to explore the data 1 0.05 % No
wanted to explore the 1 0.05 % No
I wanted to explore 1 0.05 % No
text I wanted to 1 0.05 % No

Internal links in - bdewilde.github.io

About Me
About Me
Archive
Archive
Intro to Automatic Keyphrase Extraction
Intro to Automatic Keyphrase Extraction
On Starting Over with Jekyll
On Starting Over with Jekyll
Friedman Corpus (3) — Occurrence and Dispersion
Friedman Corpus (3) — Occurrence and Dispersion
Background and Creation
Friedman Corpus (1) — Background and Creation
Data Quality and Corpus Stats
Friedman Corpus (2) — Data Quality and Corpus Stats
While I Was Away
While I Was Away
Intro to Natural Language Processing (2)
Intro to Natural Language Processing (2)
a brief, conceptual overview
Intro to Natural Language Processing (1)
A Data Science Education?
A Data Science Education?
Connecting to the Data Set
Connecting to the Data Set
Data, Data, Everywhere
Data, Data, Everywhere
← previous
Burton DeWilde

Bdewilde.github.io Spined HTML


Friedman Corpus (2) — Data Quality and Corpus Stats Burton DeWildeWell-nighMe Archive CV Friedman Corpus (2) — Data Quality and Corpus Stats 2013-10-20 corpus linguistics data quality domain expertise metadata Thomas Friedman With a full-text Friedman corpus finally in hand (see Background and Creation post), my first task was to verify data quality. Given “Garbage In, Garbage Out”, the fun stuff (analysis! plots! Friedman_ebooks?!) had to wait. Yes, it’s a pain in the ass, but this step is really important. Data Quality Since v2 of the NYTVendibleSearch API was unfamiliar to me (they reverted unbearable from v1 that my old lawmaking no longer ran), I used a bare-bones search query: “Thomas L. Friedman” –— without filtering. This was a mistake. As I should have unprotected beforehand, I unquestionably retrieved all wares mentioning Friedman anywhere in the headline, byline, or soul text, instead of only those wares written by Friedman. So I got many results like this: By MAUREEN DOWD; Thomas L. Friedman is on leave until October, writing a typesetting By Fareed Zakaria: In the global economy, says Thomas L. Friedman, intellectual work could be transmitted to intellectual workers anywhere on earth. To the Editor: Thomas L. Friedman (column, Jan. 5) says he has ‘‘no problem with a war for oil,’’ granting unrepealable provisions. No problem with killing or maiming innocent civilians for oil? Although I’m quite curious well-nigh the manyReportsto the Editor taking Friedman to task, such text doesn’t vest in a Friedman-only corpus, nor does Maureen Dowd’s tart wordplay or Fareed Zakaria’s whatever-it-is-that-he-writes. I moreover noticed that I wasn’t worldly-wise to get the vendible text for ~1300 results on worth of missing/broken URLs in the API response and weird/broken HTML at the given URL (no parser is perfect), rendering them powerfully useless in a hodgepodge of Friedman text. As it turned out, scrutinizingly all of those without full-text were neither news nor op-ed articles: >>> df['type_of_material'][df['full_text'].isnull()].value_counts() Summary 441 Letter 348 Blog 186 List 168 Op-Ed 99 News 85 Editors' Note 7 Schedule 5 Obituary; Biography 2 Editorial 2Vendible1 Interview 1 Review 1 Obituary 1 Wait a sec, why is an obituary in here? Friedman is (physically, if not intellectually) working and well! See for yourself —– this was definitely cruft, as were many of the other results. And they shouldn’t be in there. So, I filtered for wares unquestionably written by Thomas L. Friedman for which I had managed to scrape the full text. After imposing this important requirement, the type_of_material dispersal looked much better: >>> df['type_of_material'].value_counts() News 1757 Op-Ed 1640 An Analysis; News Analysis 96 An Analysis 53 Series 11 Biography 10 Special Report 3 Interview 3 An Analysis; Economic Analysis 2 Editorial 2 Review 2 Chronology 1 Op-Ed; Series 1 Biography; Series 1 Special Report; Chronology 1 Roughly half news, half op-eds, with a smattering of analyses and such over the years. Sounds like Friedman! As a final sanity check, though, I wanted to see how the whilom dispersal was distributed over time. So, I grouped results by year of publication and type of material, then plotted them together using matplotlib (Python’s de facto standard plotting library) and, just for kicks, prettyplotlib (a recently-released package that makes plots pretty). Here’s what I got: It is indeed pretty, but does it make sense? Yes, if you know a bit well-nigh Friedman’s career at The New York Times. [Insert scuttlebutt well-nigh how domain expertise matters in data science, à la Drew Conway’s venn diagram…] Friedman was hired in 1981 and sent to Beirut to imbricate the Lebanese Civil War; he won a Pulitzer prize for his war-time coverage in 1983. The pursuit year he was transferred to Jerusalem, where he served as Bureau Chief until 1988. In that year, he won flipside Pulitzer for his reporting on international wires —– and wrote a typesetting well-nigh it. Friedman moved on to American foreign policy, George Bush’s Secretary of State, and then the White House itself. In 1995 he became a foreign wires wordsmith writing in the Op-Eds section. In 2002 he won yet flipside Pulitzer, this one for his commentary on the global threat posed by terrorism. And he’s been yammering yonder overly since. The big transpiration from News to Op-Ed is evident in the plot, but what’s with the lack of wares in 1988? I saw nothing wrong in the data, so it may be that Friedman was simply too rented receiving Pulitzers and writing his first typesetting to report the news that year. *shrug* I moreover wondered well-nigh the overall number of articles, so I did a back-of-the-envelope calculation: Given that he’s a twice-weekly wordsmith (and written for holidays/vacations), we’d expect upwards of 100 op-eds per year. Indeed, that is roughly what we see. He was expressly productive in 2012, probably owing to a presidential referendum shitstorm, but seems on track for an stereotype year in 2013. Reasonably confident that I’d covered most of Friedman’s work at the NYT and that all my documents were what I thought they were, I started to dig deeper. Corpus Stats Before diving into natural language processing of the text, I wanted to explore the data at a corpus-wide scale. I once checked the number of wares by type and by year to verify data quality, but what else was there? As I mentioned in Pt. 1, the NYT API includes lots of metadata with articles. The keywords field is a list of subjects and entities (locations, people, organizations) included in a given article; aggregating counts from all such lists would probably requite a good idea of what Friedman has been writing well-nigh for all these years, right? To succeed this, I used a user-friendly datatype in Python’s collections module: from collections import Counter glocations = []; persons = []; subjects = []; organizations = [] for doc in friedman_docs: if not doc.get('keywords'): protract for keyword in doc['keywords']: if keyword['name'] == 'glocations': glocations.append(keyword.get('value')) elif keyword['name'] == 'persons': persons.append(keyword.get('value')) elif keyword['name'] == 'subject': subjects.append(keyword.get('value')) elif keyword['name'] == 'organizations': organizations.append(keyword.get('value')) glocations_counter = Counter(glocations) persons_counter = Counter(persons) subjects_counter = Counter(subjects) organizations_counter = Counter(organizations) For example, here are Friedman’s top ten subjects, given as NAME (count): UNITED STATES INTERNATIONAL RELATIONS (1396) INTERNATIONAL RELATIONS (605) PALESTINIANS (591) UNITED STATES ARMAMENT AND DEFENSE (496) POLITICS AND GOVERNMENT (425) ARMAMENT, DEFENSE AND MILITARY FORCES (420) TERRORISM (333) ECONOMIC CONDITIONS AND TRENDS (286) INTERNATIONAL TRADE AND WORLD MARKET (226) CIVIL WAR AND GUERRILLA WARFARE (174) Considering his bio, this looks totally reasonable, if a bit depressing. If you’re curious, his top locations were the Middle East, Israel, and Lebanon (which is not at all surprising), and his top organizations were the U.N., NATO, and the Palestine Liberation Organization, followed distantly by the Republican and Democratic Parties. On a lark, I made a pie orchestration of the equivalent persons keywords, where the percentages equal the number of times Friedman has mentioned a given person divided by the total number of people-mentions (multiplied by 100): In the top ten you see the usual subjects –— current and former presidents, George Bush’s Secretary of State (Mr. Baker), Middle Eastern heads of state, Gorbachev –— which together subsume scrutinizingly 50% of all mentions. The other half —– “EVERYONE ELSE” —– is a multitude whose 920 wedges can’t be visualized like this. So much for pie charts! Last but not least, here are some super simple stats for the Friedman corpus text: number of articles: 3,584 number of sentences: 115k number of words: 2.96M number of unique words: 71.9k stereotype sentence length: 24.9 words stereotype word length: 4.81 reports stereotype Flesh-Kincaid grade level: 11.8 Next time, I (finally!) get to what I consider the fundamental measures of corpus linguistics: word occurrence, word co-occurrence, and word dispersion. And more. ← previous ↑ next → Please enable JavaScript to view the comments powered by Disqus. comments powered by Disqus Burton DeWilde data scientist / physicist / filmmaker © 2014 Burton DeWilde. All rights reserved.