Summary of Research Papers from IIIT Hyd
Unsupervised Learning Based Approach for Plagiarism Detection inProgramming Assignments
Jitendra Yasaswi Bharadwaj katta, Srikailash G, Anil Chilupuri, Suresh Purini, C V Jawahar
Once there lived group of ants. Due to weather conditions like summer winter and rainy seasons Ants decided to roll out particles of soil and take it to out of the earth and make safer place to live in. Group work manifested and took form of ant hill. Before experiencing the fruits of the designed home by ants. A snake came from some where, occupied the place and started living. The hard-work of ants resulted in frustration. These kind of stealing is called obfuscation.
Today's paper deals with Plagiarism. Automatic Detection of Plagiarism in programming Assignments.
Martins define plagiarism as
" the usage of work without crediting its authors".
Easy access to enormous web content has turned plagiarism in to serious problem.
Authors took student submission code assignments as use case for evaluation, as they observe that code assignments are copied from friends or other sources. Few smart students modify content like variable names, while loop to for loop. In some cases dummy code is introduced to evade from being detected.
There are many code comparison tools out in the market. They employ a text-based approach or use features based on the property of the program at a syntactic level. Both of these approaches succumb to code obfuscation which is a huge obstacle for automatic software plagiarism detection.
Few of the techniques used by well known tools are based on searching similar n-grams or small character sequences between two source codes. other paper proposed high level features
1. lexical features
2. stylistic features
3. comments features
4. programmers text features
5. structure features
mainly for lexical, comments and programmer's text features, they represent source code as a set of characters n-grams. These features results in more natural language than programming language. Hence not much relevance to get accuracy. all above techniques focuses on content of the source code. introducing dummy code like adsf in code snippet 3 will mask from recognition.
This paper proposes a hybrid approach to address automatic plagiarism detection.
The key contributions of our work are:
• Use of source code metrics (static code-based features)
extracted during code compilation as feature representations of the the student solutions to the given programming assignments.
• Unsupervised learning based approach to detect potential plagiarized cases.
Performance comparison
Method
Proposed method accepts as input a set of correct student solutions. {x1,x2,x3 . . . xn}. extract source code metrics for each submission use them as feature representations and solution is mapped to a point in an n-dimensional (here - n = 55, features are extracted using MILEPOST GCC). The closeness is defined by Euclidean distance using t-SNE , a variation of Stochastic Neighbor Embedding
T-distributed Stochastic Neighbor Embedding (t-SNE) is a ml algorithm for visualizing model. It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.
The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an extremely small probability of being picked. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map.
Case1: Absence of plagiarized cases
Case2: Interchanging if-else code
Case3: Type define the frequently called functions
Case4: Presence of dead code
Case5: Interchange the position of functions
Results vary between 50% to 70% among different cases
Future work
- Identify and use additional dynamic features that could boost the performance of our method.
- Proposing a method to decide a good distance threshold (δ), which enables to detect partial plagiarized solution pairs confidently.
Comments