Skip to main content

Plagiarism detection in programming assignments


Summary of Research Papers from IIIT Hyd

Unsupervised Learning Based Approach for Plagiarism Detection inProgramming Assignments

Jitendra Yasaswi Bharadwaj katta, Srikailash G, Anil Chilupuri, Suresh Purini, C V Jawahar 


Once there lived group of ants. Due to weather conditions like summer winter and rainy seasons Ants decided to roll out particles of soil and take it to out of the earth and make safer place to live in. Group work manifested and took form of ant hill. Before experiencing the fruits of the designed home by ants. A snake came from some where, occupied the place and started living. The hard-work of ants resulted in frustration. These kind of stealing is called obfuscation.

Today's paper deals with Plagiarism. Automatic Detection of Plagiarism in programming Assignments.


Martins define plagiarism as
 " the usage of work without crediting its authors".

Easy access to enormous web content has turned plagiarism in to serious problem.

Authors took student submission code assignments as use case for evaluation, as they observe that code assignments are copied from friends or other sources. Few smart students modify content like variable names, while loop to for loop. In some cases dummy code is introduced to evade from being detected.




There are many code comparison tools out in the market. They employ a text-based approach or use features based on the property of the program at a syntactic level. Both of these approaches succumb to code obfuscation which is a huge obstacle for automatic software plagiarism detection.

Few of the techniques used by well known tools are based on searching similar n-grams or small character sequences between two source codes. other paper proposed high level features

1. lexical features
2. stylistic features
3. comments features
4. programmers text features
5. structure features

mainly for lexical, comments and programmer's text features, they represent source code as a set of characters n-grams. These features results in more natural language than programming language. Hence not much relevance to get accuracy. all above techniques focuses on content of the source code. introducing dummy code like adsf in code snippet 3 will mask from recognition.

This paper proposes a hybrid approach to address automatic plagiarism detection.

 The key contributions of our work are:
• Use of source code metrics (static code-based features)
extracted during code compilation as feature representations of the the student solutions to the given programming assignments.
• Unsupervised learning based approach to detect potential plagiarized cases.

Performance comparison


Method

Proposed method accepts as input a set of correct student solutions. {x1,x2,x3 . . . xn}. extract source code metrics for each submission use them as feature representations and solution is mapped to a point in an n-dimensional (here - n = 55, features are extracted using MILEPOST GCC). The closeness is defined by Euclidean distance using t-SNE , a variation of Stochastic Neighbor Embedding



T-distributed Stochastic Neighbor Embedding (t-SNE) is a ml algorithm for visualizing model. It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an extremely small probability of being picked. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map. 


 


Results are analysed in below cases


Case1: Absence of plagiarized cases
Case2: Interchanging if-else code
Case3: Type define the frequently called functions
Case4: Presence of dead code
Case5: Interchange the position of functions

Results vary between 50% to 70% among different cases




Future work

  • Identify and use additional dynamic features that could boost the performance of our method.
  • Proposing a method to decide a good distance threshold (δ), which enables to detect partial plagiarized solution pairs confidently.













Comments

Hadi shaikh said…
Plagiarism checker reddit I think this is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article.

Popular posts from this blog

Cybersecurity Threats in Connected and Automated Vehicles based Federated Learning Systems

  Ranwa Al Mallah , Godwin Badu-Marfo , Bilal Farooq image Courtesy: Comparitech Abstract Federated learning (FL) is a machine learning technique that aims at training an algorithm across decentralized entities holding their local data private. Wireless mobile networks allow users to communicate with other fixed or mobile users. The road traffic network represents an infrastructure-based configuration of a wireless mobile network where the Connected and Automated Vehicles (CAV) represent the communicating entities. Applying FL in a wireless mobile network setting gives rise to a new threat in the mobile environment that is very different from the traditional fixed networks. The threat is due to the intrinsic characteristics of the wireless medium and is caused by the characteristics of the vehicular networks such as high node-mobility and rapidly changing topology. Most cyber defense techniques depend on highly reliable and connected networks. This paper explores falsified informat...

MLOps Drivenby Data Quality using ease.ml techniques

 Cedric Renggli, Luka Rimanic, Nezihe Merve Gurel, Bojan Karlas, Wentao Wu, Ce Zhang ETH Zurich Microsoft Research Paper Link ease.ml reference paper link Image courtesy 99designes Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective. Courtesy: google The term “MLOps” is used when this DevOps process is specifically applied to ML. Diffe...

An Efficient Algorithm for Cleaning Robots Using Vision Sensors

 Abhijeet Ravankar , Ankit A. Ravankar , Michiko Watanabe and Yohei Hoshino Paper Link image Courtesy: the Verge Public places like hospitals and industries are required to maintain standards of hygiene and cleanliness. Traditionally, the cleaning task has been performed by people. However, due to various factors like shortage of workers, unavailability of 24-h service, or health concerns related to working with toxic chemicals used for cleaning, autonomous robots have been seen as alternatives. In recent years, cleaning robots like Roomba have gained popularity. These cleaning robots have limited battery power, and therefore, efficient cleaning is important. Efforts are being undertaken to improve the efficiency of cleaning robots.  The most rudimentary type of cleaning robot is the one with bump sensors and encoders, which simply keeps cleaning the room while the battery has charge. Other approaches use dirt sensors attached to the robot to clean only the untidy portions of ...