Skip to main content

Plagiarism detection in programming assignments


Summary of Research Papers from IIIT Hyd

Unsupervised Learning Based Approach for Plagiarism Detection inProgramming Assignments

Jitendra Yasaswi Bharadwaj katta, Srikailash G, Anil Chilupuri, Suresh Purini, C V Jawahar 


Once there lived group of ants. Due to weather conditions like summer winter and rainy seasons Ants decided to roll out particles of soil and take it to out of the earth and make safer place to live in. Group work manifested and took form of ant hill. Before experiencing the fruits of the designed home by ants. A snake came from some where, occupied the place and started living. The hard-work of ants resulted in frustration. These kind of stealing is called obfuscation.

Today's paper deals with Plagiarism. Automatic Detection of Plagiarism in programming Assignments.


Martins define plagiarism as
 " the usage of work without crediting its authors".

Easy access to enormous web content has turned plagiarism in to serious problem.

Authors took student submission code assignments as use case for evaluation, as they observe that code assignments are copied from friends or other sources. Few smart students modify content like variable names, while loop to for loop. In some cases dummy code is introduced to evade from being detected.




There are many code comparison tools out in the market. They employ a text-based approach or use features based on the property of the program at a syntactic level. Both of these approaches succumb to code obfuscation which is a huge obstacle for automatic software plagiarism detection.

Few of the techniques used by well known tools are based on searching similar n-grams or small character sequences between two source codes. other paper proposed high level features

1. lexical features
2. stylistic features
3. comments features
4. programmers text features
5. structure features

mainly for lexical, comments and programmer's text features, they represent source code as a set of characters n-grams. These features results in more natural language than programming language. Hence not much relevance to get accuracy. all above techniques focuses on content of the source code. introducing dummy code like adsf in code snippet 3 will mask from recognition.

This paper proposes a hybrid approach to address automatic plagiarism detection.

 The key contributions of our work are:
• Use of source code metrics (static code-based features)
extracted during code compilation as feature representations of the the student solutions to the given programming assignments.
• Unsupervised learning based approach to detect potential plagiarized cases.

Performance comparison


Method

Proposed method accepts as input a set of correct student solutions. {x1,x2,x3 . . . xn}. extract source code metrics for each submission use them as feature representations and solution is mapped to a point in an n-dimensional (here - n = 55, features are extracted using MILEPOST GCC). The closeness is defined by Euclidean distance using t-SNE , a variation of Stochastic Neighbor Embedding



T-distributed Stochastic Neighbor Embedding (t-SNE) is a ml algorithm for visualizing model. It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an extremely small probability of being picked. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map. 


 


Results are analysed in below cases


Case1: Absence of plagiarized cases
Case2: Interchanging if-else code
Case3: Type define the frequently called functions
Case4: Presence of dead code
Case5: Interchange the position of functions

Results vary between 50% to 70% among different cases




Future work

  • Identify and use additional dynamic features that could boost the performance of our method.
  • Proposing a method to decide a good distance threshold (δ), which enables to detect partial plagiarized solution pairs confidently.













Comments

Hadi shaikh said…
Plagiarism checker reddit I think this is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article.

Popular posts from this blog

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for...

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Bike sharing Dynamic Re-positioning

-By Xinghua Zheng1, Ming Tang1, Hankz Hankui Zhuo1*, Kevin X. Wen Paper Link Abstract Bike Sharing Systems (BSSs) have been adopted in many major cities of the world due to traffic congestion and carbon emissions. Although there have been approaches to exploiting either bike trailers via crowdsourcing or carrier vehicles to reposition bikes in the “right” stations in the “right” time, they do not jointly consider the usage of both bike trailers and carrier vehicles. In this paper, we aim to take advantage of both bike trailers and carrier vehicles to reduce the loss of demand with regard to the crowdsourcing of bike trailers and the fuel cost of carrier vehicles. In the experiment, we exhibit that our approach outperforms baselines in several datasets from bike sharing companies. Bike-sharing systems (BSSs) typically have a set of base stations that are strategically placed throughout a city and each station has a fixed number of docks, e.g., Capital Bike-share, ...