Building Simple Indonesian Article Recommendation

In this digital era, there are abundant amount of articles. It can appear anywhere, in the form of news, fiction story, and lots more. And the problems are starting to arise when this abundant articles are not handled smartly. People will start to be confused, not because of the lack of information, but rather because they don’t know where to start in this “flood” of articles.

So, there is where article recommendation comes into action. Many companies and organizations start to employ recommender system so that their customer can have better experience when using their service. It is one of the most applicable field where machine learning can take part in the real world situation.

Type of Recommender System

In general, recommender system can be separated into at least three categories.

  1. Content-based

Content-based recommender system works by using the content of an object as a base for making the recommendation. For example in the article website, the recommender system will use the title or the body of the articles to give recommendation. This approach can be more precise and logical in terms of the recommendation result.

2. Collaborative filtering

Collaborative filtering recommender system works by using the behavior of the users in the application. For example in the article website, the recommender system will try to give user A article that has been read by the user B, because the recommender system finds out that the user A and user B has the similar behavior of reading articles. This approach can give more serendipity to the users (which is quite good) but suffers the cold-start problem. Cold-start problem is a condition where the system hasn’t gathered enough user behavior data so that they can’t give good recommendation.

3. Hybrid System

This approach basically mix the two approaches that has been mentioned before so that the recommendation can avoid cold-start problem and still give high serendipity for user experience.

Article Recommendation System

In this post, I will try to give an example of Indonesian article recommendation system that uses content-based approach. The article data has three main parts that can be used to represent its content.

  1. Title
  2. Tags
  3. Body/content

For this simple example, we will use only the body to represent the content of the article. The main parts of a content-based recommendation system is in how we want to represent the given data as a vector that can be understood by the system. This step plays important role in deciding the quality of the recommendation.

First of all, we will to convert all the characters into lower case, since we don’t need the information whether the character is upper-case or lower-case. Then, we remove the stopwords. Those words play insignificant role on giving the information about the content of an article, so it is better just to remove them. Last, we will stem each word to its “base” form, so that we can retrieve information more easily (more detailed reason why we want to do stemming can be read in this very insightful reference).

We use PySastrawi as the library to do the preprocessing steps mentioned above, since our data is in Indonesian. For other language you can use your own natural language toolkit, such as nltk.

preprocessing step

After finishing the preprocessing step, we are ready to build the vector representation for each article. We will use Term Frequency Inverse Document Frequency (TF-IDF) method to build the vector. It is one of the basic and widely-used method for text-mining. We will use scikit-learn as a library for building the vector article.

convert article to vector

Now we already have the vector representation of each articles. Then the question is, how to use the vector to build recommendation system?

The answer is by calculating the similarity. If a user likes article A, we will give recommendation that is similar to article A. In order to get the similar article, we will calculate vector-to-vector distance, and give recommendation of articles that have the closest distance to article A.

getting similarity distance

The sorted_idx will contain the index of most similar article starting from the most similar one to the most irrelevant ones. Ten we can use the sorted_idx to retrieve the recommended articles, based on how many recommendation you want to give to the user.

Hope this articles can give you some insights on how a simple article recommendation can be built :))

AI and Machine Learning Enthusiast. Interested in Deep Learning and Computer Vision.