Quick links
Latest articles
Internship
Members
New...
Algorithms
Discrete Mathematics
Big data
Languages
C
C++
C++ STL
Java
Data Structure
C#.Net
Android
Kotlin
SQL
Web
PHP
Python
JavaScript
CSS
Ajax
Node.js
Web prog.
Programs
C
C++
DS
Java
C#
Python
Aptitude
C
C++
Java
DBMS
Interview
C
Embedded C
Java
SEO
HR
CS Subjects
CS Basics
O.S.
Networks
DBMS
Embedded Systems
Cloud Computing
Machine learning
CS Organizations
Linux
DOS
More...
Articles
Puzzles
News/Updates

Home » Big Data

PCY Algorithm in Big Data Analytics



In this article, I am going to discuss a very important algorithm in big data analytics i.e PCY algorithm used for the frequent itemset mining.
Submitted by Uma Dasgupta, on September 12, 2018

PCY algorithm was developed by three Chinese scientists Park, Chen, and Yu. This is an algorithm used in the field of big data analytics for the frequent itemset mining when the dataset is very large.

Consider we have a huge collection of data, and in this data, we have a number of transactions. For example, if we buy any product online it’s transaction is being noted. Let, a person is buying a shirt from any site now, along with the shirt the site advised the person to buy jeans also, with some discount. So, we can see that how two different things are made into a single set and associated. The main purpose of this algorithm is to make frequent item sets say, along with shirt people frequently buy jeans.

For example:

    Transaction         Items bought
    Transaction 1       Shirt + jeans
    Transaction 2       Shirt + jeans +Trouser
    Transaction 3       Shirt +Tie
    Transaction 4       Shirt +jeans +shoes

So, from the above example we can see that shirt is most frequently bought along with jeans, so, it is considered as a frequent itemset.

An example problem solved using PCY algorithm

Question: Apply PCY algorithm on the following transaction to find the candidate sets (frequent sets).

Given data

    Threshold value or minimization value = 3
    Hash function= (i*j) mod 10

        T1  =   {1, 2, 3}
        T2  =   {2, 3, 4}
        T3  =   {3, 4, 5}
        T4  =   {4, 5, 6}
        T5  =   {1, 3, 5}
        T6  =   {2, 4, 6}
        T7  =   {1, 3, 4}
        T8  =   {2, 4, 5} 
        T9  =   {3, 4, 6}
        T10 =   {1, 2, 4}

Use buckets and concepts of Mapreduce to solve the above problem.

Solution:

  1. To identify the length or we can say repetition of each candidate in the given dataset.
  2. Reduce the candidate set to all having length 1.
  3. Map pair of candidates and find the length of each pair.
  4. Apply a hash function to find bucket no.
  5. Draw a candidate set table.

Step 1: Mapping all the elements in order to find their length.

    Items →    {1, 2, 3, 4, 5, 6}
    Key         1  2  3  4  5  6
    Value       4  6  8  8  6  4

Step 2: Removing all elements having value less than 1.

But here in this example there is no key having value less than 1. Hence, candidate set = {1, 2, 3, 4, 5, 6}

Step 3: Map all the candidate set in pairs and calculate their lengths.

    T1: {(1, 2) (1, 3) (2, 3)} = (2, 3, 3)
    T2: {(2, 4) (3, 4)} = (3 4)
    T3: {(3, 5) (4, 5)} = (5, 3)
    T4: {(4, 5) (5, 6)} = (3, 2)
    T5: {(1, 5)} = 1
    T6: {(2, 6)} = 1
    T7: {(1, 4)} = 2
    T8: {(2, 5)} = 2
    T9: {(3, 6)} = 2
    T10:______
    T11:______
    T12:______

Note: Pairs should not get repeated avoid the pairs that are already written before.

Listing all the sets having length more than threshold value: {(1,3) (2,3) (2,4) (3,4) (3,5) (4,5) (4,6)}



Step 4: Apply Hash Functions. (It gives us bucket number)

    Hash Function = ( i * j) mod 10
    (1, 3) = (1*3) mod 10 = 3
    (2,3) = (2*3) mod 10 = 6
    (2,4) = (2*4) mod 10 = 8
    (3,4) = (3*4) mod 10 = 2
    (3,5) = (3*5) mod 10 = 5
    (4,5) = (4*5) mod 10 = 0
    (4,6) = (4*6) mod 10 = 4

Now, arrange the pairs according to the ascending order of their obtained bucket number.

    Bucket no.             Pair
    0	                  (4,5)
    2	                  (3,4)
    3	                  (1,3)
    4	                  (4,6)
    5	                  (3,5)
    6	                  (2,3)
    7	                  (2,4)

Step 5: In this final step we will prepare the candidate set.

Bit vector Bucket no. Highest Support Count Pairs Candidate Set
1 0 3 (4,5) (4,5)
1 2 4 (3,4) (3,4)
1 3 3 (1,3) (1,3)
1 4 3 (4,6) (4,6)
1 5 5 (3,5) (3,5)
1 6 3 (2,3) (2,3)
1 8 3 (2,4) (2,4)

Note: Highest support count is the no. of repetition of that vector.

Check the pairs which have the highest support count less than 3, and write those in the candidate set, if less than 3 then reject.

(NOTE: There are some exceptional cases where highest count support is less than 3, i.e. threshold value and for every candidate pair write bit vector as 1 means if HCS is greater than equal to threshold then bit vector is 1 otherwise 0).

Hence, The frequent itemsets are (4, 5), (3,4)

Conclusion:

In this article, we have discussed a very important algorithm i.e PCY algorithm used in Big Data Analytics. We have also solved a simple question to understand its application more clearly. Let me mention one more thing that this is also very important from the examination point of view, so it’s a must do the algorithm for all of you who have BDA as a subject in their academics. If you have any further queries shoot them in the comment section, will try to solve them as soon as possible. See you in my next article till then stay healthy and keep learning!






Quick links:
C FAQ(s) C Advance programs C/C++ Tips & Tricks Puzzles JavaScript CSS Python Linux Commands PHP Android Articles More...

Featured post:
Introduction to Linux (Its modes, Safety, Most popular Applications)
Linux Best Distribution Software (Distros) of 2018

Was this page helpful? Please share with your friends...

Are you a blogger? Join our Blogging forum.

Comments and Discussions



Languages: » C » C++ » C++ STL » Java » Data Structure » C#.Net » Android » Kotlin » SQL
Web Technologies: » PHP » Python » JavaScript » CSS » Ajax » Node.js » Web programming/HTML
Solved programs: » C » C++ » DS » Java » C#
Aptitude que. & ans.: » C » C++ » Java » DBMS
Interview que. & ans.: » C » Embedded C » Java » SEO » HR
CS Subjects: » CS Basics » O.S. » Networks » DBMS » Embedded Systems » Cloud Computing » Machine learning » CS Organizations » Linux » DOS
More: » Articles » Puzzles » News/Updates



© https://www.includehelp.com (2015-2018), Some rights reserved.