PCY Algorithm in Big Data Analytics

In this article, I am going to discuss a very important algorithm in big data analytics i.e PCY algorithm used for the frequent itemset mining. Submitted by Uma Dasgupta, on September 12, 2018

PCY Algorithm in Big Data Analytics

PCY algorithm was developed by three Chinese scientists Park, Chen, and Yu. This is an algorithm used in the field of big data analytics for the frequent itemset mining when the dataset is very large.

Consider we have a huge collection of data, and in this data, we have a number of transactions. For example, if we buy any product online it’s transaction is being noted. Let, a person is buying a shirt from any site now, along with the shirt the site advised the person to buy jeans also, with some discount. So, we can see that how two different things are made into a single set and associated. The main purpose of this algorithm is to make frequent item sets say, along with shirt people frequently buy jeans.

Example

For example:

Transaction         Items bought
Transaction 1       Shirt + jeans
Transaction 2       Shirt + jeans +Trouser
Transaction 3       Shirt +Tie
Transaction 4       Shirt +jeans +shoes

So, from the above example we can see that shirt is most frequently bought along with jeans, so, it is considered as a frequent itemset.

An example problem solved using PCY algorithm

Question: Apply PCY algorithm on the following transaction to find the candidate sets (frequent sets).

Given data

    Threshold value or minimization value = 3
    Hash function= (i*j) mod 10

        T1  =   {1, 2, 3}
        T2  =   {2, 3, 4}
        T3  =   {3, 4, 5}
        T4  =   {4, 5, 6}
        T5  =   {1, 3, 5}
        T6  =   {2, 4, 6}
        T7  =   {1, 3, 4}
        T8  =   {2, 4, 5} 
        T9  =   {3, 4, 6}
        T10 =   {1, 2, 4}
        T11 =   {2, 3, 5}
        T12=    {3, 4, 6}

Use buckets and concepts of Mapreduce to solve the above problem.

Solution

To identify the length or we can say repetition of each candidate in the given dataset.
Reduce the candidate set to all having length 1.
Map pair of candidates and find the length of each pair.
Apply a hash function to find bucket no.
Draw a candidate set table.

Step 1: Mapping all the elements in order to find their length.

    Items →    {1, 2, 3, 4, 5, 6}
    Key         1  2  3  4  5  6
    Value       4  6  8  8  6  4

Step 2: Removing all elements having value less than 1.

But here in this example there is no key having value less than 1. Hence, candidate set = {1, 2, 3, 4, 5, 6}

Step 3: Map all the candidate set in pairs and calculate their lengths.

    T1: {(1, 2) (1, 3) (2, 3)} = (2, 3, 3)
    T2: {(2, 4) (3, 4)} = (3 4)
    T3: {(3, 5) (4, 5)} = (5, 3)
    T4: {(4, 5) (5, 6)} = (3, 2)
    T5: {(1, 5)} = 1
    T6: {(2, 6)} = 1
    T7: {(1, 4)} = 2
    T8: {(2, 5)} = 2
    T9: {(3, 6)} = 2
    T10:______
    T11:______
    T12:______

Note: Pairs should not get repeated avoid the pairs that are already written before.

Listing all the sets having length more than threshold value: {(1,3) (2,3) (2,4) (3,4) (3,5) (4,5) (4,6)}

Step 4: Apply Hash Functions. (It gives us bucket number)

    Hash Function = ( i * j) mod 10
    (1, 3) = (1*3) mod 10 = 3
    (2,3) = (2*3) mod 10 = 6
    (2,4) = (2*4) mod 10 = 8
    (3,4) = (3*4) mod 10 = 2
    (3,5) = (3*5) mod 10 = 5
    (4,5) = (4*5) mod 10 = 0
    (4,6) = (4*6) mod 10 = 4

Now, arrange the pairs according to the ascending order of their obtained bucket number.

    Bucket no.             Pair
    0	                  (4,5)
    2	                  (3,4)
    3	                  (1,3)
    4	                  (4,6)
    5	                  (3,5)
    6	                  (2,3)
    8	                  (2,4)

Step 5: In this final step we will prepare the candidate set.

Bit vector	Bucket no.	Highest Support Count	Pairs	Candidate Set
1	0	3	(4,5)	(4,5)
1	2	4	(3,4)	(3,4)
1	3	3	(1,3)	(1,3)
1	4	3	(4,6)	(4,6)
1	5	5	(3,5)	(3,5)
1	6	3	(2,3)	(2,3)
1	8	3	(2,4)	(2,4)

Note: Highest support count is the no. of repetition of that vector.

Check the pairs which have the highest support count less than 3, and write those in the candidate set, if less than 3 then reject.

(NOTE: There are some exceptional cases where highest count support is less than 3, i.e. threshold value and for every candidate pair write bit vector as 1 means if HCS is greater than equal to threshold then bit vector is 1 otherwise 0).

Hence, The frequent itemsets are (4, 5), (3,4)

Conclusion

In this article, we have discussed a very important algorithm i.e PCY algorithm used in Big Data Analytics. We have also solved a simple question to understand its application more clearly. Let me mention one more thing that this is also very important from the examination point of view, so it’s a must do the algorithm for all of you who have BDA as a subject in their academics. If you have any further queries shoot them in the comment section, will try to solve them as soon as possible. See you in my next article till then stay healthy and keep learning!

Big Data - 5 Must-have Skills

Big Data - Dimensions