Dirty Users Dirty Data Dirty Work

Today's DB Objectives:

Play with a mild front-end in Angular
Establish a high capacity workflow using ECE systems
Work through project specification gathering for a data-driven project
Web Scraping 101 (Regular Expressions, Beautiful Soup)
Programmatic connection to MySQL
Build the beginnings of the next great social media nutritional recipe sharing site

Lesson 0: An angular front-end

I built a quick front-end for your chinook customers. It doesn't have everything but it gives you an idea of how your front-end team will lean on your DB work. Here is a zip of the HTML/JS code. And here is a stub of the code working in plunker.

wget http://db.prof.ninja/class4/front-end.zip
mkdir public
mv front-end.zip public
cd public
unzip front-end.zip
Edit server.js and add the lines var path = require('path'); and router.use(express.static(path.resolve(__dirname, 'public')));
If you used PHP then this needs to be slightly adapted, in the front-end code adjust the paths to /api/customer etc.

Lesson 1: Our ECE workflow

This is so important I'm going to not put it here, but rather on the front page of the website.

By the way if you are struggling with PHP then checkout PHPFiddle, if you are struggling with Node checkout Node-Console.

I will walk you through a complete setup of a JSON-out Node and MySQL workflow:

SSH into mlb1.acad as in the workflow page
bash
mysql -h orioles1 -u USERNAME -p USERNAME
create table bananas (name varchar(20), ripeness int);
insert into bananas values ("Great Banana", 7);
Exit MySQL
git clone https://github.com/cisc437/node-api-boilerplate.git
cd node-api-boilerplate
vi server.js
Change USERNAME and PORT
ZZ (save and quit)
npm install
npm install forever
./node_modules/forever/bin/forever start server.js
Visit http://mlb1.acad.ece.udel.edu:PORT/ or curl localhost:PORT from CLI
./node_modules/forever/bin/forever stop server.js

Get this running now to make sure you have access to MySQL, a bigger more stable host than cloud9, and can serve content.

A useful tutorial with trick for tossing a large load at your server and pooling mysql connections.

Quick SQL quiz

Christopher was kind enough to collate some SQL statements from our book using the University database we created. Look at each of these queries and decide what you think it does:

select name from instructor where salary is null
select avg(salary) as av_salary from instructor where dept_names = 'Comp. Sci.'
select dept_name, avg(salary) as avg_salary from instructor group by dept_name
select avg(salary) from instructor
select course_id, semester, year, sec_id, avg(tot_cred) from takes natural join student where year = 2009 group by course_id, semester, year, sec_id having count(ID)>=2
select distinct name from instructor where name not in ('Mozart', 'Einstein')
select distinct(T.name) from instructor as T, instructor as S where T.salary > S.salary and S.dept.name = 'Biology'
select ID from student natural left outer join takes where course_id is null
select * from student join takes using (ID)

A Sample Project Specification Session

One of the toughest part of any project is delivering the right thing. Often you need to wireframe early to help the end-user imagine themselves using the new product before they realize what they'll need. The Kevin Bacon project is intentionally unspecified, much like a non-technical boss might request of you. I want to practice specification collection and the beginnings of database design with a sample project today.

Social Nutritional Recipe Analyzer

Wouldn't it be cool to let users create recipes and spit out the number of calories and miligrams of sodium? Then they could share the recipes with each other!

Let's start to build it

Gather with a partner and discuss what sort of data you'll need to make this product sing. Together come up with questions to ask me, the stakeholder, and perhaps even generate user-stories.

Imagine which tables will be needed to pull this off, and if you add a feature what tables will be added, if you change a feature how badly might your tables break?

Note that recipes and ingredients have a many-to-many relationship, that is, each ingredient might be in many recipes and each recipe might have many ingredients. How do you want to solve this? Can you do it in such a way that there is little data redundancy in the table?

Feel free to study some websites with nutritional data to get a feel for what is feasible.

Web-Scraping Dirty Data

When we build data-driven apps we often need to gather data from third-party sources. Today we'll do some data-pirating. Some sites explicitly forbid this so beware.

For our project I want to scrape CalorieCount.com. Let's begin by looking at the page layouts for a couple of ingredients and a couple of recipes. Inspect elements, source, network requests, etc. Can we get lucky and find a public facing API?

Now I want to show you my pythonic way of scraping sites. I like to do it python because the syntax is cleanest and I can putz around in the terminal while figuring out my structure before having to write the real program.

I prefer to use iPython for fast poking around (Anaconda has ipython and most of the great packages pre-installed). Also I like to get the concepts working before going into a harsh environment (our SSH terminals). So fire up a python/ipython terminal somewhere, maybe on an ece machine.

I'll code along with you on this but here are the commands and modules we'll want to use:

import re, urllib2, bs4 (re is regular expressions, bs4 is BeautifulSoup4)
resp = urllib2.urlopen("http://www.caloriecount.com/calories-bananas-i9040")
html = resp.read() (print html to see the mess)
soup = bs4.BeautifulSoup(html)

OK at this point we have the entire blob of html from the bananas calories page. Also we have made some soup, whatever that means.

Task 1: Use your R&I (Resourcefulness and Initiative) to extract the number of calories in a serving of bananas.

Task 2: Find the pattern for recipes/ingredients and scrape links to related banana-friendly recipes.

Task 3: Take your process for calories and turn it into a function, then apply it to an ingredient in a different recipe.

MySQL and the Language of your choice

Earlier we showed you Node/MySQL. http://stackoverflow.com/a/5687856/1347629 will deal with python/mysql.

How about C++ to MySQL?

How about PHP to MySQL?

Start a web scraping job to populate your app tables

To put it all together for today's class, let's take our rough data designs from earlier, and marry them to the rough web scraping we're doing. Can we now begin to fill out a recipe/ingredient/nutrition database one http request at a time?