Some articles about building web-based business — October 14, 2017

Some articles about building web-based business

https://levels.io/how-i-build-my-minimum-viable-products/ : a lot of useful tools were introduced there.

https://levels.io/go-fucking-do-it/

https://levels.io/how-i-built-a-remote-jobs-board/

https://levels.io/product-hunt-hacker-news-number-one/

https://levels.io/run-through-ideas-quickly/ 

 

How to change default path for Mac screenshot —
Data scientist – series 1: Intro to Python — September 30, 2017

Data scientist – series 1: Intro to Python

Udacity course: https://classroom.udacity.com/courses/ud1110/lessons/

Notes:

  1. string formatting:One particularly useful string method is format. The format method is used to construct strings by inserting values into template strings. Consider this example for generating log messages for a hypothetical web server.
    log_message = "IP address {} accessed {} at {}".format(user_ip, url, now)
    

    If the variables user_ipurl and now are defined then they will be substituted for the {} placeholder values

Web Crawler with Beautiful Soup —

Web Crawler with Beautiful Soup

Today I tried to play with Beautiful soup to retrieve website data, code as below:

import urllib2
from bs4 import BeautifulSoup as soup

quote_page = 'https://weworkremotely.com/'
page = urllib2.urlopen(quote_page)
read_soup = soup(page, "html.parser")

name_box_job = read_soup.find('span', attrs={'class': 'title'})
name_box_company = read_soup.find('span', attrs={'class':'company'})

job = name_box_job.text.strip()
company = name_box_company.text.strip()
print ("{} : {}".format(company, job))

The output is :

Citron Pharmaceutical : Clerical Customer Support

Basically I am accessing the weworkremotely.com, to retrieve the first item who is a span and under class “title” and and first item under class “company”.

But since there are many jobs posted on the website, I would like to retrieve all the posts and companies with same attribute class “title” and class “company”.

So instead of using read_soup.find, I should use read_soup.find_all, and a for loop to get all the items in a list.

import urllib2
from bs4 import BeautifulSoup as soup

quote_page = 'https://weworkremotely.com/'
page = urllib2.urlopen(quote_page)
read_soup = soup(page, "html.parser")
jobs = []
companys = []
name_box_job = read_soup.find_all('span', attrs={'class': 'title'})
name_box_company = read_soup.find_all('span', attrs={'class':'company'})

for n in range(len(name_box_job)):
    jobs.append(name_box_job[n].get_text())

for m in range(len(name_box_company)):

    jobs.append(name_box_company[m].get_text())

print jobs, companys

However, the output format from this code is very ugly, need to work on the improvement.

 

Reference websites:

http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html

http://altitudelabs.com/blog/web-scraping-with-python-and-beautiful-soup/

http://pwp.stevecassidy.net/dataweb/crawling.html

https://beautiful-soup-4.readthedocs.io/en/latest/

 

Word Cloud with Python — September 29, 2017

Word Cloud with Python

Today I played with python word cloud library, it is quite easy to use and output is interesting.

The word cloud library here: https://github.com/amueller/word_cloud,

And owner`s blog http://peekaboo-vision.blogspot.hk/2012/11/a-wordcloud-in-python.html

The code:

import numpy as np 
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
#%matplotlib inline
from PIL import Image

from subprocess import check_output
from wordcloud import WordCloud, STOPWORDS

#mpl.rcParams['figure.figsize']=(8.0,6.0) 
mpl.rcParams['font.size']=12                
mpl.rcParams['savefig.dpi']=100             
mpl.rcParams['figure.subplot.bottom']=.1

stopwords = set(STOPWORDS)
data = pd.read_csv('ted_main.csv')

wordcloud = WordCloud(
                          background_color='black',
                          stopwords=stopwords,
                          max_words=200,
                          max_font_size=40,
                          random_state=42,
                         ).generate(str(data['description']))

print(wordcloud)
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
fig.savefig("wordcloud.png", dpi=1200)

It is a very light weighted code, I used the TED talk subjects data as the input, and below is the output word cloud:

wordcloud-ted.png
TED talk subject word cloud

The above is the basic version, I tried a little more features such as adding more defined words to be skipped :

#more_stopwords = {'talk','TED','Information','event'}
#STOPWORDS = STOPWORDS.union(more_stopwords)

and add a mask to the output so the word cloud will be in a different shape:

image = Image.open('mask2.png')
mask = np.array(image)

wordcloud = WordCloud(
                          background_color='black',
                          stopwords=stopwords,
                          max_words=200,
                          max_font_size=40,
                          random_state=42,
                          mask = mask
                         ).generate(str(data['description']))

Output as below:

wordcloud-ted2.png

wordcloud-ted3.png

This is a very interesting topic and I will continue to expand on the current result, so just store the resources here and I will come back later to try them out:

http://luisvalesilva.com/datasimple/word_clouds.html

https://happygostacie.wordpress.com/2016/04/22/word-clouds-in-python-what-a-pil/ (this is interesting, can pick the color of word cloud based on the input mask)

http://minimaxir.com/2016/05/wordclouds/ (interesting post)

https://pypi.python.org/pypi/facebook_wordcloud/1.01b (word cloud library for Facebook chat history)

 

 

Python: The _imagingft C module is not installed when running wordcloud code —

Python: The _imagingft C module is not installed when running wordcloud code

When I try to run the wordcloud python library today, I received the error:” The _imagingft C module is not installed”.  The reason is freetype was not installed on my Mac. Tried a lot of methods, finally the below one worked:

I have homebrew installed already,

First,

brew install freetype

Then the following files are in /usr/local/lib: libfreetype.6.dylib libfreetype.a libfreetype.dylib

Then

pip install http://effbot.org/downloads/Imaging-1.1.6.tar.gz

Then run the wordcloud code, it works!

 

 

Github tips — September 28, 2017
Basic building block for a data scientist — September 9, 2017

Basic building block for a data scientist

Tool:

Python, R, SQL

Statistics:

Probability, statistical tests, distributions, maximum likelihood estimators

Techniques:

Machine learning: really understand when it is appropriate to use different techniques.

Math:

Multivariable Calculus and Linear Algebra

Data Cleaning:

Techniques to clean the data to be suitable for analysis.

Data Visualization & Communication:

e.g. : ggplot and d3.js

Software Engineering:

Understand how a data-driven product is designed and developed.

Machine learning / deep learning resources with Python — August 26, 2017