Skip to main content

Plagiarism detection in programming assignments


Summary of Research Papers from IIIT Hyd

Unsupervised Learning Based Approach for Plagiarism Detection inProgramming Assignments

Jitendra Yasaswi Bharadwaj katta, Srikailash G, Anil Chilupuri, Suresh Purini, C V Jawahar 


Once there lived group of ants. Due to weather conditions like summer winter and rainy seasons Ants decided to roll out particles of soil and take it to out of the earth and make safer place to live in. Group work manifested and took form of ant hill. Before experiencing the fruits of the designed home by ants. A snake came from some where, occupied the place and started living. The hard-work of ants resulted in frustration. These kind of stealing is called obfuscation.

Today's paper deals with Plagiarism. Automatic Detection of Plagiarism in programming Assignments.


Martins define plagiarism as
 " the usage of work without crediting its authors".

Easy access to enormous web content has turned plagiarism in to serious problem.

Authors took student submission code assignments as use case for evaluation, as they observe that code assignments are copied from friends or other sources. Few smart students modify content like variable names, while loop to for loop. In some cases dummy code is introduced to evade from being detected.




There are many code comparison tools out in the market. They employ a text-based approach or use features based on the property of the program at a syntactic level. Both of these approaches succumb to code obfuscation which is a huge obstacle for automatic software plagiarism detection.

Few of the techniques used by well known tools are based on searching similar n-grams or small character sequences between two source codes. other paper proposed high level features

1. lexical features
2. stylistic features
3. comments features
4. programmers text features
5. structure features

mainly for lexical, comments and programmer's text features, they represent source code as a set of characters n-grams. These features results in more natural language than programming language. Hence not much relevance to get accuracy. all above techniques focuses on content of the source code. introducing dummy code like adsf in code snippet 3 will mask from recognition.

This paper proposes a hybrid approach to address automatic plagiarism detection.

 The key contributions of our work are:
• Use of source code metrics (static code-based features)
extracted during code compilation as feature representations of the the student solutions to the given programming assignments.
• Unsupervised learning based approach to detect potential plagiarized cases.

Performance comparison


Method

Proposed method accepts as input a set of correct student solutions. {x1,x2,x3 . . . xn}. extract source code metrics for each submission use them as feature representations and solution is mapped to a point in an n-dimensional (here - n = 55, features are extracted using MILEPOST GCC). The closeness is defined by Euclidean distance using t-SNE , a variation of Stochastic Neighbor Embedding



T-distributed Stochastic Neighbor Embedding (t-SNE) is a ml algorithm for visualizing model. It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an extremely small probability of being picked. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map. 


 


Results are analysed in below cases


Case1: Absence of plagiarized cases
Case2: Interchanging if-else code
Case3: Type define the frequently called functions
Case4: Presence of dead code
Case5: Interchange the position of functions

Results vary between 50% to 70% among different cases




Future work

  • Identify and use additional dynamic features that could boost the performance of our method.
  • Proposing a method to decide a good distance threshold (δ), which enables to detect partial plagiarized solution pairs confidently.













Comments

Hadi shaikh said…
Plagiarism checker reddit I think this is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article.

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Cybersecurity Threats in Connected and Automated Vehicles based Federated Learning Systems

  Ranwa Al Mallah , Godwin Badu-Marfo , Bilal Farooq image Courtesy: Comparitech Abstract Federated learning (FL) is a machine learning technique that aims at training an algorithm across decentralized entities holding their local data private. Wireless mobile networks allow users to communicate with other fixed or mobile users. The road traffic network represents an infrastructure-based configuration of a wireless mobile network where the Connected and Automated Vehicles (CAV) represent the communicating entities. Applying FL in a wireless mobile network setting gives rise to a new threat in the mobile environment that is very different from the traditional fixed networks. The threat is due to the intrinsic characteristics of the wireless medium and is caused by the characteristics of the vehicular networks such as high node-mobility and rapidly changing topology. Most cyber defense techniques depend on highly reliable and connected networks. This paper explores falsified informat...

MLOps Drivenby Data Quality using ease.ml techniques

 Cedric Renggli, Luka Rimanic, Nezihe Merve Gurel, Bojan Karlas, Wentao Wu, Ce Zhang ETH Zurich Microsoft Research Paper Link ease.ml reference paper link Image courtesy 99designes Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective. Courtesy: google The term “MLOps” is used when this DevOps process is specifically applied to ML. Diffe...