bdewilde.github.io - Background and Creation









Search Preview

Friedman Corpus (1) — Background and Creation

bdewilde.github.io
data scientist / physicist / filmmaker
.io > bdewilde.github.io

SEO audit: Content analysis

Language Error! No language localisation is found.
Title Friedman Corpus (1) — Background and Creation
Text / HTML ratio 63 %
Frame Excellent! The website does not use iFrame solutions.
Flash Excellent! The website does not have any flash contents.
Keywords cloud corpus Friedman web York data API language full Thomas English published Friedman’s articles = article genres Times Corpus –— corpora
Keywords consistency
Keyword Content Title Description Headings
corpus 13
Friedman 8
web 7
York 6
data 6
API 6
Headings
H1 H2 H3 H4 H5 H6
1 0 1 0 0 0
Images We found 0 images on this web page.

SEO Keywords (Single)

Keyword Occurrence Density
corpus 13 0.65 %
Friedman 8 0.40 %
web 7 0.35 %
York 6 0.30 %
data 6 0.30 %
API 6 0.30 %
language 6 0.30 %
full 5 0.25 %
Thomas 5 0.25 %
English 5 0.25 %
published 4 0.20 %
Friedman’s 4 0.20 %
articles 4 0.20 %
= 4 0.20 %
article 4 0.20 %
genres 4 0.20 %
Times 4 0.20 %
Corpus 4 0.20 %
–— 4 0.20 %
corpora 4 0.20 %

SEO Keywords (Two Word)

Keyword Occurrence Density
New York 6 0.30 %
in the 5 0.25 %
of the 5 0.25 %
the URL 4 0.20 %
York Times 3 0.15 %
to scrape 3 0.15 %
all of 3 0.15 %
over the 3 0.15 %
corpus of 3 0.15 %
in a 3 0.15 %
variety of 3 0.15 %
a variety 3 0.15 %
to be 3 0.15 %
makes sense 3 0.15 %
Burton DeWilde 2 0.10 %
Randomize the 2 0.10 %
a wide 2 0.10 %
wide range 2 0.10 %
range of 2 0.10 %
pprint import 2 0.10 %

SEO Keywords (Three Word)

Keyword Occurrence Density Possible Spam
New York Times 3 0.15 % No
a variety of 3 0.15 % No
managed to scrape 2 0.10 % No
a great way 2 0.10 % No
corpus the first 2 0.10 % No
great way to 2 0.10 % No
Cruz Co were 2 0.10 % No
I managed to 2 0.10 % No
makes sense —– 2 0.10 % No
u'US Fringe Festival' 2 0.10 % No
were to succeed 2 0.10 % No
Ted Cruz Co 2 0.10 % No
over the years 2 0.10 % No
if Ted Cruz 2 0.10 % No
u'What if Ted 2 0.10 % No
the URL for 2 0.10 % No
all of Thomas 2 0.10 % No
Co were to 2 0.10 % No
the New York 2 0.10 % No
to succeed in 2 0.10 % No

SEO Keywords (Four Word)

Keyword Occurrence Density Possible Spam
in the shutdown showdown?' 2 0.10 % No
if Ted Cruz Co 2 0.10 % No
succeed in the shutdown 2 0.10 % No
to succeed in the 2 0.10 % No
I managed to scrape 2 0.10 % No
were to succeed in 2 0.10 % No
Co were to succeed 2 0.10 % No
Cruz Co were to 2 0.10 % No
Ted Cruz Co were 2 0.10 % No
u'What if Ted Cruz 2 0.10 % No
a wide range of 2 0.10 % No
a great way to 2 0.10 % No
u'lead_paragraph' u'What if Ted 1 0.05 % No
u'keywords' u'lead_paragraph' u'What if 1 0.05 % No
Festival' u'keywords' u'lead_paragraph' u'What 1 0.05 % No
Fringe Festival' u'keywords' u'lead_paragraph' 1 0.05 % No
u'US Fringe Festival' u'keywords' 1 0.05 % No
u'print_headline' u'US Fringe Festival' 1 0.05 % No
Festival' u'print_headline' u'US Fringe 1 0.05 % No
Fringe Festival' u'print_headline' u'US 1 0.05 % No

Internal links in - bdewilde.github.io

About Me
About Me
Archive
Archive
Intro to Automatic Keyphrase Extraction
Intro to Automatic Keyphrase Extraction
On Starting Over with Jekyll
On Starting Over with Jekyll
Friedman Corpus (3) — Occurrence and Dispersion
Friedman Corpus (3) — Occurrence and Dispersion
Background and Creation
Friedman Corpus (1) — Background and Creation
Data Quality and Corpus Stats
Friedman Corpus (2) — Data Quality and Corpus Stats
While I Was Away
While I Was Away
Intro to Natural Language Processing (2)
Intro to Natural Language Processing (2)
a brief, conceptual overview
Intro to Natural Language Processing (1)
A Data Science Education?
A Data Science Education?
Connecting to the Data Set
Connecting to the Data Set
Data, Data, Everywhere
Data, Data, Everywhere
← previous
Burton DeWilde

Bdewilde.github.io Spined HTML


Friedman Corpus (1) — Background and Creation Burton DeWildeWell-nighMe Archive CV Friedman Corpus (1) — Background and Creation 2013-10-15 APIs corpora corpus linguistics natural language processing Thomas Friedman web scraping Much work in Natural Language Processing (NLP) begins with a large hodgepodge of text documents, tabbed a corpus, that represents a written sample of language in a particular domain of study. Corpora come in a variety of flavors: mono- or multi-lingual; category-specific or a representative sampling from a variety of categories, e.g. genres, authors, time periods; simply “plain” text or annotated with spare linguistic information, e.g. part-of-speech tags, full parse trees; and so on. They indulge for proposition testing and statistical wringer of natural language, but one must be very cautious well-nigh applying results derived from a given corpus to other domains. Many notable corpora have been created over the years, including the following: Brown corpus: the first 1-million-word electronic corpus of English, consisting of 500 texts spread wideness 15 genres in proportion to their value published in America, ca. 1961 Gutenberg corpus: the first and largest single hodgepodge of electronic books (approximately 50k), spanning a wide range of genres and authors British National Corpus (BNC): a 100M-word, general-purpose corpus of written and spoken British English representing late 20th-century usage Corpus ofTrendyAmerican English (COCA): the largest (and freely-searchable!) corpus of American English currently available, containing 450M words published since 1990 in fiction, newspapers, magazines, wonk journals, and spoken word Google Books: a mind-blowing 300+ billion words from millions of books published since 1500 in multiple languages (mostly English), with an n-gram searching interface for longitudinal comparisons of language use For downloading and standardized interfacing with a variety of small-ish corpora, trammels out NLTK’s corpus module. Brigham Young University has a web portal with fancy search functionality for several corpora, although their interface is clunky. Others can be hunted lanugo via Google or a list like this. But what if you wanted to study a particular subject, author, language, etc. for which a corpus hasn’t once been made available? What if, for example, you’re really interested in Thomas L. Friedman of the New York Times? He’s been writing unceasingly for the past 30+ years, in a couple of variegated genres, well-nigh a wide range of trendy issues, and all of these writings are misogynist online and once annotated with metadata. Sounds totally compelling, right? Well, if you were me, you would build your own corpus. A Friedman corpus. Corpus Creation Unlike most newspapers, the NYT has an spanking-new set of APIs for accessing their data, integrating it into new applications, and otherwise applying it to novel purposes. Specifically, I used theVendibleSearch API v2 to find all of Thomas Friedman’s wares over the years. An API Request Tool can be used to test out queries quickly, although a word of warning: since upgrading to v2 ofVendibleSearch, documentation and tool stability has been lacking… from pprint import pprint import requests # fill in your api key here... my_api_key = #### # parameters specifying what data is to be returned fields = ['web_url', 'snippet', 'lead_paragraph', 'abstract', 'print_page', 'blog', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'byline', 'type_of_material', '_id', 'word_count'] facet_fields = ['source', 'section_name', 'document_type', 'type_of_material', 'day_of_week'] # get request the server resp = requests.get('http://api.nytimes.com/svc/search/v2/articlesearch.json', params={'q': 'Thomas L. Friedman', 'page': 0, 'sort': 'newest', 'fl': ','.join(fields), 'facet_field': ','.join(facet_fields), 'facet_filter': 'true', 'api-key': my_api_key}, ) # trammels out all teh dataz pprint(resp.json()) The API provides spare content, depending on the parameters passed to it; a particularly useful one is the facets field, which lets you explore NYT-specific categories and subsets of the data returned by your keyword-based search query. Using Python’s seated str.format() method, I printed out a nice exhibit of the facets for this query: facet count ------------------------------------- type_of_material.............. 6847 News....................... 1977 Op-Ed...................... 1912 Letter..................... 1089 Summary.................... 1073 List....................... 304 source........................ 6847 The New York Times......... 6841 ........................... 3 International Herald Tribu* 2 CNBC....................... 1 document_type................. 6853 article.................... 6651 blogpost................... 187 multimedia................. 15 section_name.................. 6838 Opinion.................... 3119 New York and Region........ 1097 World...................... 553 World; Washington.......... 483 Arts; Books................ 361 day_of_week................... 6853 Sunday..................... 1755 Wednesday.................. 1559 Friday..................... 1049 Tuesday.................... 833 Thursday................... 748 Hm. It looks like the dataset is mostly news and op-eds —– makes sense —– published as New York Times wares –— moreover makes sense —– in the Opinion and NY/World sections of the paper –— again, makes sense. I don’t know unbearable to assess the distribution over days of the week or whether the other material types are towardly for Friedman (Spoiler Alert: I should’ve checked!), but this seems plausible enough. Downloading from the API is, by design, meant to be straightforward; it’s mostly just looping over the page parameter passed through the URL and aggregating the results. Here’s a snippet of a single document returned by the API as JSON: {u'_id': u'5254b0a738f0d8198974116f', u'abstract': u'Thomas L Friedman Op-Ed post contends that mainstream Republicans have a greater interest than Democrats in Pres Obama prevailing over Tea Party Republicans in the government shutdown showdown; holds that a Tea Party victory would serve to marginalize mainstream Republicans, and would make the party incapable of winning presidential elections.', u'blog': [], u'byline': {u'contributor': u'', u'original': u'By THOMAS L. FRIEDMAN', u'person': [{u'firstname': u'Thomas', u'lastname': u'FRIEDMAN', u'middlename': u'L.', u'organization': u'', u'rank': 1, u'role': u'reported'}]}, u'document_type': u'article', u'headline': {u'kicker': u'Op-Ed Columnist', u'main': u'U.S. Fringe Festival', u'print_headline': u'U.S. Fringe Festival'}, u'keywords': [{u'is_major': u'N', u'name': u'subject', u'rank': u'5', u'value': u'Shutdowns (Institutional)'}, {u'is_major': u'Y', u'name': u'subject', u'rank': u'9', u'value': u'Federal Budget (US)'}, {u'is_major': u'N', u'name': u'persons', u'rank': u'7', u'value': u'Obama, Barack'}, ... snip ... {u'is_major': u'N', u'name': u'organizations', u'rank': u'4', u'value': u'House of Representatives'}], u'lead_paragraph': u'What if Ted Cruz & Co. were to succeed in the shutdown showdown?', u'multimedia': [{u'height': 75, u'legacy': {u'thumbnail': u'images/2010/09/16/opinion/Friedman_New/Friedman_New-thumbStandard.jpg', u'thumbnailheight': u'75', u'thumbnailwidth': u'75'}, u'subtype': u'thumbnail', u'type': u'image', u'url': u'images/2010/09/16/opinion/Friedman_New/Friedman_New-thumbStandard.jpg', u'width': 75}], u'news_desk': u'Editorial', u'print_page': u'29', u'pub_date': u'2013-10-09T00:00:00Z', u'snippet': u'What if Ted Cruz & Co. were to succeed in the shutdown showdown?', u'source': u'The New York Times', u'type_of_material': u'Op-Ed', u'web_url': u'http://www.nytimes.com/2013/10/09/opinion/friedman-us-fringe-festival.html', u'word_count': u'900'} As you can see, the Times adds lots of metadata to each post! There are human-annotated entities (e.g. Barack Obama) and subjects (e.g. Shutdown), the publication date, word count, a link to Thomas Friedman’s new portrait, the URL for the digital article, as well as an utopian and the lead paragraph. Excellent! Except… Well, shit. Where’s the full vendible text?! Unfortunately, the New York Times doesn’t want you to read its journalistic output without unquestionably visiting the web site or ownership the paper (think: razzmatazz revenue), so they exclude that data from their API results. Irritating, yes, but there is hope: the URL for each vendible is included in the metadata. Getting Friedman’s full text is just a matter of web scraping! Except… that’s easier said than done. As I said, they want people to unquestionably visit their site –— robots don’t count. On my first struggle at a straightforward scrape, the site’s web admin obstructed me within a hundred calls or so. Probably should’ve seen that coming. I won’t get into the full details of how I managed to scrape well-nigh 6500 Friedman wares from the NYT website, but I will share some unstipulated guidelines for how to scrape a web site without getting caught. Send a proper User Agent (spoofing!) and full header information withal with the URL, including a language and charset. At least pretend to be a real web browser. Randomize the time interval between calls to the server. Sending a request exactly ten times per second is a unconfined way to get classified as non-human. In Python, try something like time.sleep(random.random()) –— the longer you wait, the less noticeable you’ll be. Randomize the order in which you wangle content. Requesting all of Thomas L. Friedman’s wares since 1981 in chronological order, one without the other, is moreover a unconfined way to identify yourself as a bot. Randomize your identity, i.e. your IP address. There are a number of ways to do this, including proxies and Tor. This may be non-trivial to set up, though, so find a good tutorial and follow along! Eventually, through somewhat nefarious means, I managed to scrape together a well-constructed Friedman corpus. Huzzah. In my next post, I examine the data and do some vital summary statistics and sanity checks. ← previous ↑ next → Please enable JavaScript to view the comments powered by Disqus. comments powered by Disqus Burton DeWilde data scientist / physicist / filmmaker © 2014 Burton DeWilde. All rights reserved.