# %load "Importing Data in Python.py"file=open('moby_dick.txt','r')# Print itprint(file.read())# Check whether file is closedprint(file.closed)# Close filefile.close()# Check whether file is closedprint(file.closed)
# Read & print the first 3 lineswithopen('moby_dick.txt')asfile:print(file.readline())print(file.readline())print(file.readline())print(file.readline(10))print(file.readline(50))print(file.readline(50))
1234567
CHAPTER 1. Loomings.
Call me Ishmael. Some years ago--never mind how long precisely--having
little or
no money in my purse, and nothing particular to in
terest me on
this is a special package about PEP 20.
importthis
1 2 3 4 5 6 7 8 9101112131415161718192021
TheZenofPython,byTimPetersBeautifulisbetterthanugly.Explicitisbetterthanimplicit.Simpleisbetterthancomplex.Complexisbetterthancomplicated.Flatisbetterthannested.Sparseisbetterthandense.Readabilitycounts.Specialcasesaren't special enough to break the rules.Althoughpracticalitybeatspurity.Errorsshouldneverpasssilently.Unlessexplicitlysilenced.Inthefaceofambiguity,refusethetemptationtoguess.Thereshouldbeone--andpreferablyonlyone--obviouswaytodoit.Althoughthatwaymaynotbeobviousatfirstunlessyou're Dutch.Nowisbetterthannever.Althoughneverisoftenbetterthan*right*now.Iftheimplementationishardtoexplain,it's a bad idea.Iftheimplementationiseasytoexplain,itmaybeagoodidea.Namespacesareonehonkinggreatidea--let's do more of those!
BDFL: Benevolent Dictator For Life, a.k.a. Guido van Rossum, Python’s creator.
Numpy arrays are a standard for storing numerical data.
Arrays are essential to other packages such as the scikit-learn, for machine learning.
Import numpy and matplotlib (or invoke them with the %pylab magic command).
importnumpyasnpimportmatplotlib.pyplotasplt
# or...%pylabinline# no need for preceeding functions (methods) with np. or plt.
1
Populating the interactive namespace from numpy and matplotlib
Import a csv file and assign the content to an array.
file='digits.csv'# Load the file as an array called digitsdigits=loadtxt(file,delimiter=',')# Print the datatype of digitsprint(type(digits))print(digits)# Select a rowim=digits[2,2:]print(im)
file='seaslug.txt'# Import file: datadata=loadtxt(file,delimiter=' ',dtype=str)# Print the first element of dataprint(data[0])
1
['b"b\'Time\'"' 'b"b\'Percent\'"']
Import a txt file as float numbers.
file='seaslug2.txt'# Import data as floats and skip the first row: data_floatdata_float=loadtxt(file,delimiter=' ',dtype=float,skiprows=1)# Print the 10th element of data_floatprint(data_float[9])# Plot a scatterplot of the datascatter(data_float[:,0],data_float[:,1])xlabel('time (min.)')ylabel('percentage of larvae')show()
file='digits2.csv'# Read the first 5 rows of the file into a DataFrame: datadata=pd.read_csv(file,nrows=5,header=None)# Print the datatype of dataprint(type(data))# Build a numpy array from the DataFrame: data_arraydata_array=data.values# Print the datatype of data_array to the shellprint(type(data_array))
Import another file; replace the missing data (NA).
file='titanic_corrupt.csv'# Assign filename: filefile='titanic_corrupt.csv'# Import file: datadata=pd.read_csv(file,sep=';',comment='#',na_values=['Nothing'])# Print the head of the DataFrameprint(data.head())
1 2 3 4 5 6 7 8 910111213
PassengerId Survived Pclass Sex Age SibSp Parch \
0 1 0 3 male 22.0 1 0
1 2 1 1 female 38.0 1 0
2 3 1 3 female 26.0 0 0
3 4 1 1 female 35.0 1 0
4 5 0 3 male 35.0 0 0
Ticket Fare Cabin Embarked
0 A/5 21171 7,25 NaN S
1 PC 17599 NaN NaN NaN
2 STON/O2. 3101282 7,925 NaN S
3 113803 53,1 C123 S
4 373450 8,05 NaN S
There are a number of datatypes that cannot be saved easily to flat files, such as lists and dictionaries.
If you want your files to be human readable, you may want to save them as text files in a clever manner (JSONs, which you will see in a later chapter, are appropriate for Python dictionaries).
If, however, you merely want to be able to import them into Python, you can serialize them.
All this means is converting the object into a sequence of bytes, or bytestream.
Import it.
importpickle# Save a dictionary into a pickle file.fav={'Airline':'8','Aug':'85','June':'69.4','Mar':'84.4'}pickle.dump(fav,open("save.p","wb"))# save.p# Open pickle file and load data: dwithopen('save.p','rb')asfile:d=pickle.load(file)# Print dprint(d)# Print datatype of dprint(type(d))
fromIPython.displayimportImage# for the following pictures...
SAS files
Advanced analytics
Multivariate analysis
Business intelligence
Data management
Predictive analytics
Standard for computational analysis
Code (instead of importing the package):
# Import sas7bdat packagefromsas7bdatimportSAS7BDAT# Save file to a DataFrame: df_saswithSAS7BDAT('sales.sas7bdat')asfile:df_sas=file.to_data_frame()# Print head of DataFrameprint(df_sas.head())# Plot histogram of DataFrame featurespd.DataFrame.hist(df_sas[['P']])plt.ylabel('count')plt.show()
The data are adapted from the website of the undergraduate text book Principles of Economics by Hill, Griffiths and Lim.
The chart would be:
Image('p.png')
Stata files
The data consist of disease extent for several diseases in various countries (more information can be found.
# Import pandasimportpandasaspd# Load Stata file into a pandas DataFrame: dfdf=pd.read_stata('disarea.dta')# Print the head of the DataFrame dfprint(df.head())
pd.DataFrame.hist(df[['disa10']])plt.xlabel('Extent of disease')plt.ylabel('Number of coutries')plt.show()
HDF5 files
Standard for storing large quantities of numerical data.
Datasets can be hundreds of gigabytes or terabytes.
HDF5 can scale to exabytes.
Code (instead of importing the package):
# Import packagesimportnumpyasnpimporth5py# Assign filename: filefile='LIGO_data.hdf5'# Load file: datadata=h5py.File(file,'r')# Print the datatype of the loaded fileprint(type(data))# Print the keys of the fileforkeyindata.keys():print(key)# Get the HDF5 group: groupgroup=data['strain']# Check out keys of groupforkeyingroup.keys():print(key)# Set variable equal to time series data: strainstrain=data['strain']['Strain'].value# Set number of time points to sample: num_samplesnum_samples=10000# Set time vectortime=np.arange(0,1,1/num_samples)# Plot dataplt.plot(time,strain[:num_samples])plt.xlabel('GPS Time (s)')plt.ylabel('strain')plt.show()
You can find the LIGO data plus loads of documentation and tutorials on Signal Processing with the data.
Image('strain.png')
MATLAB
“Matrix Laboratory”.
Industry standard in engineering and science.
Code (instead of importing the package):
# Import packageimportscipy.io# Load MATLAB file: matmat=scipy.io.loadmat('albeck_gene_expression.mat')# Print the datatype type of matprint(type(mat))# Print the keys of the MATLAB dictionaryprint(mat.keys())# Print the type of the value corresponding to the key 'CYratioCyt'print(type(mat['CYratioCyt']))# Print the shape of the value corresponding to the key 'CYratioCyt'print(np.shape(mat['CYratioCyt']))# Subset the array and plot itdata=mat['CYratioCyt'][25,5:]fig=plt.figure()plt.plot(data)plt.xlabel('time (min.)')plt.ylabel('normalized fluorescence (measure of expression)')plt.show()
This file contains gene expression data from the Albeck Lab at UC Davis. You can find the data and some great documentation.
'sqlite:///Northwind.sqlite' is called the connection string to the SQLite database.
The Chinook database contains information about a semi-fictional digital media store in which media data is real and customer, employee and sales data has been manually created.
Code (instead of importing the package):
# Save the table names to a list: table_namestable_names=engine.table_names()# Print the table names to the shellprint(table_names)
Query the DB
The final ; is facultative.
Code (instead of importing the package):
# Create engine: engineengine=create_engine('sqlite:///Chinook.sqlite')# Open engine connection: concon=engine.connect()# Perform query: rsrs=con.execute('SELECT * FROM Album')# Save results of the query to DataFrame: dfdf=pd.DataFrame(rs.fetchall())# Close connectioncon.close()# Print head of DataFrame dfprint(df.head())
Customize queries
Code (instead of importing the package):
# Create engine: engineengine=create_engine('sqlite:///Chinook.sqlite')# becomes facultative with many queries# Open engine in context manager# Perform query and save results to DataFrame: dfwithengine.connect()ascon:rs=con.execute('SELECT LastName, Title FROM Employee')df=pd.DataFrame(rs.fetchmany(size=3))df.columns=rs.keys()# set the DataFrame's column names to the corresponding names of the table columns# Print the length of the DataFrame dfprint(len(df))# Print the head of the DataFrame dfprint(df.head())
Code (instead of importing the package):
# Create engine: engineengine=create_engine('sqlite:///Chinook.sqlite')# becomes facultative with many queries# Open engine in context manager# Perform query and save results to DataFrame: dfwithengine.connect()ascon:rs=con.execute("SELECT * FROM Employee WHERE EmployeeId >= 6")df=pd.DataFrame(rs.fetchall())df.columns=rs.keys()# Print the head of the DataFrame dfprint(df.head())
Code (instead of importing the package):
# Create engine: engineengine=create_engine('sqlite:///Chinook.sqlite')# Open engine in context managerwithengine.connect()ascon:rs=con.execute('SELECT * FROM Employee ORDER BY BirthDate')df=pd.DataFrame(rs.fetchall())# Set the DataFrame's column namesdf.columns=rs.keys()# Print head of DataFrameprint(df.head())
Query the DB the Pandas way
Simpler code (instead of importing the package)!!!
# Import packagesfromsqlalchemyimportcreate_engineimportpandasaspd# Create engine: engineengine=create_engine('sqlite:///Chinook.sqlite')# Execute query and store records in DataFrame: dfdf=pd.read_sql_query("SELECT * FROM Album",engine)# Print head of DataFrameprint(df.head())# Open engine in context manager# Perform query and save results to DataFrame: df1withengine.connect()ascon:rs=con.execute("SELECT * FROM Album")df1=pd.DataFrame(rs.fetchall())df1.columns=rs.keys()# Confirm that both methods yield the same result: does df = df1 ? print(df.equals(df1))
Code (instead of importing the package):
# Import packagesfromsqlalchemyimportcreate_engineimportpandasaspd# Create engine: engineengine=create_engine('sqlite:///Chinook.sqlite')# Execute query and store records in DataFrame: dfdf=pd.read_sql_query("SELECT * FROM Employee WHERE EmployeeId >= 6 ORDER BY BirthDate",engine)# Print head of DataFrameprint(df.head())
INNER JOIN
Code (instead of importing the package):
importpandasaspdfromsqlalchemyimportcreate_engineengine=create_engine('sqlite:///Chinook.sqlite')# Open engine in context manager# Perform query and save results to DataFrame: dfwithengine.connect()ascon:rs=con.execute("SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID")df=pd.DataFrame(rs.fetchall())df.columns=rs.keys()# Print head of DataFrame dfprint(df.head())
Alternative code:
df=pd.read_sql_query("SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID",engine)# Print head of DataFrame dfprint(df.head())
Code (instead of importing the package):
# Execute query and store records in DataFrame: dfdf=pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000",engine)# Print head of DataFrameprint(df.head())
fromurllib.requestimporturlretrieve# import pandas as pd# Assign url of file: urlurl='http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'# Save file locallyurlretrieve(url,'winequality-red.csv')# Read file into a DataFrame and print its headdf=pd.read_csv('winequality-red.csv',sep=';')print(df.head())
# import matplotlib.pyplot as plt# import pandas as pd# Assign url of file: urlurl='http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'# Read file into a DataFrame: dfdf=pd.read_csv(url,sep=';')# Print the head of the DataFrameprint(df.head())# Plot first column of dfpd.DataFrame.hist(df.ix[:,0:1])plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')plt.ylabel('count')plt.show()
# import pandas as pd# Assign url of file: urlurl='http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'# Read in all sheets of Excel file: xlxl=pd.read_excel(url,sheetname=None)# Print the sheetnames (keys) to the shell !!!print(xl.keys())# Print the head of the first sheet (using its name, NOT its index)print(xl['1700'].head())
1234567
dict_keys(['1700', '1900'])
country 1700
0 Afghanistan 34.565000
1 Akrotiri and Dhekelia 34.616667
2 Albania 41.312000
3 Algeria 36.720000
4 American Samoa -14.307000
HTTP requests to import files from the web
requests is one of the most downloaded Python packages.
requests works with urllib.
Import the package.
fromurllib.requestimporturlopen,Request# Specify the urlurl="http://www.datacamp.com/teach/documentation"# This packages the request: requestrequest=Request(url)# Send the request and catches the response: responseresponse=urlopen(request)# Print the datatype of responseprint(type(response))# Be polite and close the response!response.close()
1
<class 'http.client.HTTPResponse'>
fromurllib.requestimporturlopen,Requesturl="http://docs.datacamp.com/teach/"request=Request(url)response=urlopen(request)# Extract the response: htmlhtml=response.read()# Print the htmlprint(html)# Be polite and close the response!response.close()
1
b'<!DOCTYPEhtml>\n<linkrel="shortcut icon"href="images/favicon.ico"/>\n<html>\n\n<head>\n<metacharset="utf-8">\n<metahttp-equiv="X-UA-Compatible"content="IE=edge">\n<metaname="viewport"content="width=device-width, initial-scale=1">\n\n<title>Home</title>\n<metaname="description"content="All Documentation on Course Creation">\n\n<linkrel="stylesheet"href="/teach/css/main.css">\n<linkrel="canonical"href="/teach/">\n<linkrel="alternate"type="application/rss+xml"title="DataCamp Teach Documentation"href="/teach/feed.xml"/>\n</head>\n\n\n<body>\n\n<headerclass="site-header">\n\n<divclass="wrapper">\n\n<aclass="site-title"href="/teach/">DataCampTeachDocumentation</a>\n\n</div>\n\n</header>\n\n\n<divclass="page-content">\n<divclass="wrapper">\n<p>TheTeachDocumentationhasbeenmovedto<ahref="https://www.datacamp.com/teach/documentation">https://www.datacamp.com/teach/documentation</a>!</p>\n\n<!-- Everybody can teach on DataCamp. The resources on this website explain all the steps to build your own course on DataCamp\'s interactive data science platform.\n\nInterested in partnering with DataCamp? Head over to the [Course Material](/teach/course-material.html) page to get an idea of the requirements to build your own interactive course together with DataCamp!\n\n## Table of Contents\n\n- [Course Material](/teach/course-material.html) - Content required to build a DataCamp course.\n- [Video Lectures](/teach/video-lectures.html) - Details on video recording and editing.\n- [DataCamp Teach](https://www.datacamp.com/teach) - Use the DataCamp Teach website to create DataCamp courses (preferred).\n- [datacamp R Package](https://github.com/datacamp/datacamp/wiki) - Use R Package to create DataCamp courses (legacy).\n- [Code DataCamp Exercises](/teach/code-datacamp-exercises.html)\n- [SCT Design (R)](https://github.com/datacamp/testwhat/wiki)\n- [SCT Design (Python)](https://github.com/datacamp/pythonwhat/wiki)\n- [Style Guide](/teach/style-guide.html) -->\n\n\n </div>\n </div>\n\n \n\n </body>\n\n</html>\n'
Using requests
importrequestsurl="http://docs.datacamp.com/teach/"r=requests.get(url)text=r.text# Print part of the html (split the paragraphs) instead of all with print(text)head=text.split('\n\n')print(head[0])print('')print(head[1])print('')print(head[2])print('')print(head[3])
1 2 3 4 5 6 7 8 910111213141516
<!DOCTYPE html><linkrel="shortcut icon"href="images/favicon.ico"/><html><head><metacharset="utf-8"><metahttp-equiv="X-UA-Compatible"content="IE=edge"><metaname="viewport"content="width=device-width, initial-scale=1"><title>Home</title><metaname="description"content="All Documentation on Course Creation"><linkrel="stylesheet"href="/teach/css/main.css"><linkrel="canonical"href="/teach/"><linkrel="alternate"type="application/rss+xml"title="DataCamp Teach Documentation"href="/teach/feed.xml"/></head>
Scraping the web
Scrape unstructured data.
Scrape structured data, parse it and extract the data from HTML using the BeautifulSoup package.
Import the packages.
importrequestsfrombs4importBeautifulSoupurl='https://www.python.org/~guido/'r=requests.get(url)html_doc=r.text# Create a BeautifulSoup object from the HTML: soupsoup=BeautifulSoup(html_doc,'lxml')# Prettify the BeautifulSoup object: pretty_souppretty_soup=soup.prettify()# Print the responseprint(type(pretty_soup))# Print part of the html (split the text), not all with print(pretty_soup)head=pretty_soup.split('</h3>')print(head[0])
<class'str'><html><head><title>
Guido's Personal Home Page
</title></head><bodybgcolor="#FFFFFF"text="#000000"><h1><ahref="pics.html"><imgborder="0"src="images/IMG_2192.jpg"/></a>
Guido van Rossum - Personal Home Page
</h1><p><ahref="http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm"><i>
"Gawky and proud of it."
</i></a></p><h3><ahref="http://metalab.unc.edu/Dave/Dr-Fun/df200004/df20000406.jpg">
Who
I Am
</a>
Other operations with BeautifulSoup.
importrequestsfrombs4importBeautifulSoupurl='https://www.python.org/~guido/'r=requests.get(url)html_doc=r.text# Create a BeautifulSoup object from the HTML: soupsoup=BeautifulSoup(html_doc,'lxml')# Get the title of Guido's webpage: guido_titleguido_title=soup.title# attribute# Print the title of Guido's webpage to the shellprint(guido_title)# Get Guido's text: guido_textguido_text=soup.get_text()# method# Print Guido's text to the shellprint(guido_text)
<title>Guido's Personal Home Page</title>Guido'sPersonalHomePageGuidovanRossum-PersonalHomePage"Gawky and proud of it."WhoIAmIamtheauthorofthePythonprogramminglanguage.Seealsomyresumeandmypublicationslist,abriefbio,assortedwritings,presentationsandinterviews(allaboutPython),somepicturesofme,mynewblog,andmyoldblogonArtima.com.Iam@gvanrossumonTwitter.IalsohaveaG+profile.InJanuary2013IjoinedDropbox.IworkonvariousDropboxproductsandhave50% for my Python work, no strings attached.Previously,IhaveworkedforGoogle,ElementalSecurity,ZopeCorporation,BeOpen.com,CNRI,CWI,andSARA.(Seemyresume.)IcreatedPythonwhileatCWI.HowtoReachMeYoucansendemailformetoguido(at)python.org.Ireadeverythingsentthere,butifyouaskmeaquestionaboutusingPython,it's likely that I won'thavetimetoanswerit,andwillinsteadreferyoutohelp(at)python.org,comp.lang.pythonorStackOverflow.Ifyouneedtotalktomeonthephoneorsendmesomethingbysnailmail,sendmeanemailandI'll gladly email you instructions on how to reach me.My NameMy name often poses difficulties for Americans.Pronunciation: in Dutch, the "G" in Guido is a hard G,pronounced roughly like the "ch" in Scottish "loch". (Listen to thesound clip.) However, if you'reAmerican,youmayalsopronounceitastheItalian"Guido".I'm nottoo worried about the associations with mob assassins that some peoplehave. :-)Spelling: my last name is two words, and I'dlikekeepitthatway,thespellingonsomeofmycreditcardsnotwithstanding.Dutchspellingrulesdictatethatwhenusedincombinationwithmyfirstname,"van"isnotcapitalized:"Guido van Rossum".Butwhenmylastnameisusedalonetorefertome,itiscapitalized,forexample:"As usual, Van Rossum was right."Alphabetization:inAmerica,Ishowupinthealphabetunder"V".ButinEurope,Ishowupunder"R".Andsomeofmyfriendsputmeunder"G"intheiraddressbook...MoreHyperlinksHere's a collection of essays relating to Pythonthat I'vewritten,includingtheforewordIwroteforMarkLutz' book"Programming Python".I own the official Python license.The Audio File Formats FAQI was the original creator and maintainer of the Audio File FormatsFAQ. It is now maintained by Chris Bagwellat http://www.cnpbagwell.com/audio-faq. And here is a link toSOX, to which I contributedsome early code."On the Internet, nobody knows you'readog."
More.
importrequestsfrombs4importBeautifulSoupurl='https://www.python.org/~guido/'r=requests.get(url)html_doc=r.text# create a BeautifulSoup object from the HTML: soupsoup=BeautifulSoup(html_doc,'lxml')# Print the title of Guido's webpageprint(soup.title)# Find all 'a' tags (which define hyperlinks): a_tagsa_tags=soup.find_all('a')# for <a>, hyperlinks# Print the URLs to the shellforlinkina_tags:print(link.get('href'))
API or Application Programming Interface are protocols and routines providing access to websites and web apps like OMDb, Wikipedia, Uber, Uber Developers, BGG, ImDB, Facebook, Instagram, and Twitter.
Most of data coming from APIS are JSON files.
Import the json package
importjson# Load JSON: json_datawithopen('a_movie.json','r')asjson_file:json_data=json.load(json_file)print(type(json_data))print(json_data['Title'])print(json_data['Year'])print('')# Print each key-value pair in json_dataforkinjson_data.keys():print(k+': ',json_data[k])
Pull some movie data down from the Open Movie Database (OMDB) using their API.
Pull it as text.
importrequestsurl='http://www.omdbapi.com/?t=social+network'r=requests.get(url)print(type(r))print('')# Print the text of the responseprint(r.text)
123
<class'requests.models.Response'>{"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Director":"David Fincher","Writer":"Aaron Sorkin (screenplay), Ben Mezrich (book)","Actors":"Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons","Plot":"Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.","Language":"English, French","Country":"USA","Awards":"Won 3 Oscars. Another 161 wins & 162 nominations.","Poster":"http://ia.media-imdb.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg","Metascore":"95","imdbRating":"7.7","imdbVotes":"478,258","imdbID":"tt1285016","Type":"movie","Response":"True"}
Pull it as JSON or a dictionary.
importrequestsurl='http://www.omdbapi.com/?t=social+network'r=requests.get(url)# Decode the JSON data into a dictionary: json_datajson_data=r.json()print(type(json_data))print('')# Print each key-value pair in json_dataforkinjson_data.keys():print(k+': ',json_data[k])
importrequestsurl='http://chroniclingamerica.loc.gov/search/titles/results/?terms=new%20york&format=json'r=requests.get(url)# Decode the JSON data into a dictionary: json_datajson_data=r.json()# Select the first element in the list json_data['items']: nyc_loc# dict of dictnyc_loc=json_data['items'][0]# Print each key-value pair in nyc_locforkinnyc_loc.keys():print(k+': ',nyc_loc[k])
1 2 3 4 5 6 7 8 9101112131415161718192021222324
county: ['New York']
place_of_publication: New York
oclc: 12928956
subject: ['New York (N.Y.)--Newspapers.', 'New York (State)--New York.--fast--(OCoLC)fst01204333']
alt_title: []
title: The New York thrice-a-week world.
type: title
edition: New York and Pennsylvania ed.
id: /lccn/sn85047837/
note: ['Democrat.', 'Description based on: Vol. 36, no. 3,670 (Oct. 4, 1895).', 'The words "New York thrice-a-week" appear in title ornament.']
place: ['New York--New York--New York']
essay: []
start_year: 1890
end_year: 1999
publisher: Press Pub. Co.
lccn: sn85047837
holding_type: ['Unspecified']
state: ['New York']
city: ['New York']
language: ['English']
country: New York
title_normal: new york thrice-a-week world.
url: http://chroniclingamerica.loc.gov/lccn/sn85047837.json
frequency: Three times a week
importrequestsurl='https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza'r=requests.get(url)# Decode the JSON data into a dictionary: json_datajson_data=r.json()# Print the Wikipedia page extractpizza_extract=json_data['query']['pages']['24768']['extract']print(pizza_extract)
1234
<p><b>Pizza</b> is a flatbread generally topped with tomato sauce and cheese and baked in an oven. It is commonly topped with a selection of meats, vegetables and condiments. The term was first recorded in the 10th century, in a Latin manuscript from Gaeta in Central Italy. The modern pizza was invented in Naples, Italy, and the dish and its variants have since become popular in many areas of the world.</p><p>In 2009, upon Italy's request, Neapolitan pizza was safeguarded in the European Union as a Traditional Speciality Guaranteed dish. The Associazione Verace Pizza Napoletana (the True Neapolitan Pizza Association) is a non-profit organization founded in 1984 with headquarters in Naples. It promotes and protects the "true Neapolitan pizza".</p><p>Pizza is sold fresh or frozen, either whole or in portions, and is a common fast food item in Europe and North America. Various types of ovens are used to cook them and many varieties exist. Several similar dishes are prepared from ingredients commonly used in pizza preparation, such as calzone and stromboli.</p><p></p>
The Twitter API and Authentification
Twitter has many APIs: the main API, the REST API, Streaming APIs (private, public), Firehouse (expensive), etc.
Consult the documentation to set an authentification key (available online).
tweepy package
The authentication looks like the following:
Code:
# Import packageimporttweepy,json# Store OAuth authentication credentials in relevant variablesaccess_token="---"access_token_secret="---"consumer_key="---"consumer_secret="---"# Pass OAuth details to tweepy's OAuth handlerauth=tweepy.OAuthHandler(consumer_key,consumer_secret)auth.set_access_token(access_token,access_token_secret)
Start streaming tweets
Code:
# Initialize Stream listenerl=MyStreamListener()# Create you Stream object with authenticationstream=tweepy.Stream(auth,l)# Filter Twitter Streams to capture data by the keywords:stream.filter(track=['clinton','trump','sanders','cruz'])
Code of MyStreamListener():
Creates a file called tweets.txt, collects streaming tweets as .jsons and writes them to the file tweets.txt; once 100 tweets have been streamed, the listener closes the file and stops listening.
# Import packageimportjson# String of path to file: tweets_data_pathtweets_data_path='tweets.txt'# Initialize empty list to store tweets: tweets_datatweets_data=[]# Open connection to filetweets_file=open(tweets_data_path,"r")# Read in tweets and store in list: tweets_dataforlineintweets_file:tweet=json.loads(line)tweets_data.append(tweet)# Close connection to filetweets_file.close()# Print the keys of the first tweet dictprint(tweets_data[0].keys())
Send the Twitter data to a DataFrame
Twitter data in a list of dictionaries tweets_data, where each dictionary corresponds to a single tweet.
The text in a tweet t1 is stored as the value t1['text']; similarly, the language is stored in t1['lang'].
Code:
# Import packageimportpandasaspd# Build DataFrame of tweet texts and languagesdf=pd.DataFrame(tweets_data,columns=['text','lang'])# Print head of DataFrameprint(df.head())
Analyze the tweets (NLP, regex)
A little bit of Twitter text analysis and plotting.
Use the statistical data visualization library seaborn.
Code:
# Import the regular expressions libraryimportre# The function tells you whether the first argument (a word) occurs within the 2nd argument (a tweet)defword_in_text(word,tweet):word=word.lower()text=tweet.lower()match=re.search(word,tweet)ifmatch:returnTruereturnFalse# Initialize list to store tweet counts[clinton,trump,sanders,cruz]=[0,0,0,0]# Iterate through df, counting the number of tweets in which# each candidate is mentionedforindex,rowindf.iterrows():clinton+=word_in_text('clinton',row['text'])trump+=word_in_text('trump',row['text'])sanders+=word_in_text('sanders',row['text'])cruz+=word_in_text('cruz',row['text'])# Import packagesimportmatplotlib.pyplotaspltimportseabornassns# Set seaborn stylesns.set(color_codes=True)# Create a list of labels:cdcd=['clinton','trump','sanders','cruz']# Plot histogramax=sns.barplot(cd,[clinton,trump,sanders,cruz])ax.set(ylabel="count")plt.show()
fromIPython.displayimportImage# for the following pictures...