Yi's Notes

⟨Data Scientist|Photographer⟩

0%

WSL is a life-saver for enterprise developper that needs a bash-like environment.

Setting up cntlm will help us to make the WSL communicate with outside world. The steps are simple and straight forward.

Step 1: install cntlm and configure the proxy

To install cntlm in WSL ubuntu do the following: sudo apt install cntlm

However, if you are in the enterprise network, you will not be able to do so. So go to https://packages.ubuntu.com/bionic/net/cntlm download the relevant package.

To configure the cntlm for ubuntu, we need to set up conf in /etc/cntlm.conf: domain, username, password and proxy:port

An example of the cntlm.conf content:

1
2
3
4
5
6
7
8
9
Username        john.smith
Domain corp

Auth NTLMv2
PassNTLMv2 8B767A8C8A34AD69ED6DDD5BF42C0CB8

Proxy 10.11.1.23:8080 # company proxy
NoProxy localhost, 127.0.0.*, 10.*, 192.168.*
Listen 3128 # proxy for cntlm

One trick to obtain the correct Auth and PassXXXX setting is to run: sudo cntlm -M http://google.com and type the password to get the right setup at the end.

Now we are ready to do restart sudo service restart cntlm

Read more »

Nowadays, the Agile became very famous among software development. Data science is another trending discipline, Many and many more companies try to build and benefit from.

Scrum as methodologies tries to help build a good software in small increments. There are lots of attempts to apply agile software methods to data science[1][2], but the results are mostly unsatisfactory.

Based on my experiences, reading, and recent participating a SCRUM training with McGill University, I try to discuss:

  • why agile scrum framework is not a good fit for data science project
  • the agile concept following data science discipline
  • the adaption from the framework
  • the minimum viable product in the context of data product
Read more »

Recently, the new company set up everything on AWS. Amazon Linux which is basically Red Hat Linux / Centos, but it uses non-standard folder names.

While I try to install R package which uses a C compiler during installation, the compiler often can’t find a file it needs.

As stated in this thread, I am trying to install data.table, the way to solve it is trying to link the file to the standard location:

1
2
3
ln -s /usr/lib/gcc/x86_64-amazon-linux/6.4.1/include/omp.h /usr/local/include/
ln -s /usr/lib/gcc/x86_64-amazon-linux/6.4.1/libgomp.spec /usr/lib64/libgomp.spec
ln -s /usr/lib64/libgomp.so.1.0.0 /usr/lib64/libgomp.so

It seems this is not an ultimate solution, as each time, we need to find out the non-standard location of the file, and link it. What’s more, you need sudo to make it happen.

In this stack overflow thread, it has a better solution.

We need to first of all, create an .R folder under ~/ with a Makevars file in it, with the following content (it will tell to R which compiler to use):

1
2
3
CC = /usr/bin/gcc64
CXX = /usr/bin/g++
SHLIB_OPENMP_CFLAGS = -fopenmp

One trick, if you don’t have ssh on the server, and not really familiar with bash editor (nano, vim), you can edit the file by typing this in R: file.edit("~/.R/Makevars"). It will open the file in RStudio.

I gathered my side notes on recommendation project while doing SSENSE’s product display pages (PDP) where my developped model serves as an output on PDP pages’ widget.

Recommendation Motivation

Why we have recommendation nowadays? The internet goes from web directory (a list) to search engine (passive), now emerging with recommendation system (pro-active). All serve the need to help internet surfer discovers/finds relevant information with the overload of information.

Read more »

Recently, I worked on recommendation system in SSENSE. I have employed the collaborative filtering algorithm based on implicit feedback, which is the number of hits per item.

The methodology part is based on:

Hu, Yifan, Yehuda Koren, and Chris Volinsky. “Collaborative filtering for implicit feedback datasets.” Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. Ieee, 2008.

While the implicit feedback is the number of hits on the product/item page.

Read more »

When people are talking about data science in today’s business, they are thinking of Big data, Hadoop ecosystem, Open data, Skill set such as R, Python, Spark, and machine learning. I believe they got it wrong! Data science is focusing on data, and is about how to extract the information/pattern from data. All the others are just tools, like language, that help you to achieve the goal.

So what’s the difference between the old name such as data analyst, data miner and the fancy name data scientist? I once got it wrong by thinking data scientist is about mining big dataset via fancy machine learning black box and using R, Spark, Python. For sure these skills from programming to machine learning are the key to be a successful data scientist, but the core of data science is about innovation (the term in business world) or research (the term in academic world).

Data science in business and in academic are all doing the same thing, i.e. trying to use the new methodology, approach to tackle the old problems. Hence all data scientists need to have critical thinking, and bury classical approach.

A PDF version can be downloaded at the end of the article.

Spark essentials

Advantages of Apache Spark:

  • Compatible with Hadoop
  • Ease of development
  • Fast
  • Multiple language support
  • Unified stack: Batch, Streaming, Interactive Analytics

Transformation vs. Action:

  • A transformation will return an RDD. Since RDD are immutable, the transformation will return a new RDD.
  • An action will return a value.
Read more »

Recently I have learned hard to use docker, in order to take advantage of the latest technology advance.

One problem is after doing a lot of experiments, I have many containers left on exited status docker ps -a and the normal docker rm [container ID] cannot work efficiently.

So keep in memory the following code which can remove all exited containers gracefully: docker rm -v $(docker ps -a -q -f status=exited).

In order to use sparkR in RStudio, we have to call the package installed in SPARK_HOME.

Here is a script to be initiated at the beginning of the program:

1
2
3
4
5
6
# launch sparkR in R
Sys.setenv(SPARK_HOME='<spark_dir>')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
.libPaths()
library(SparkR)
sc <- sparkR.init(master='yarn-client')

Here is an example with my configuration with Hortonworks HDP:

1
2
3
4
5
6
# launch sparkR in R
Sys.setenv(SPARK_HOME='/usr/hdp/current/spark-client')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
.libPaths()
library(SparkR)
sc <- sparkR.init(master='yarn-client')