Skip to main content

Plagiarism detection in programming assignments


Summary of Research Papers from IIIT Hyd

Unsupervised Learning Based Approach for Plagiarism Detection inProgramming Assignments

Jitendra Yasaswi Bharadwaj katta, Srikailash G, Anil Chilupuri, Suresh Purini, C V Jawahar 


Once there lived group of ants. Due to weather conditions like summer winter and rainy seasons Ants decided to roll out particles of soil and take it to out of the earth and make safer place to live in. Group work manifested and took form of ant hill. Before experiencing the fruits of the designed home by ants. A snake came from some where, occupied the place and started living. The hard-work of ants resulted in frustration. These kind of stealing is called obfuscation.

Today's paper deals with Plagiarism. Automatic Detection of Plagiarism in programming Assignments.


Martins define plagiarism as
 " the usage of work without crediting its authors".

Easy access to enormous web content has turned plagiarism in to serious problem.

Authors took student submission code assignments as use case for evaluation, as they observe that code assignments are copied from friends or other sources. Few smart students modify content like variable names, while loop to for loop. In some cases dummy code is introduced to evade from being detected.




There are many code comparison tools out in the market. They employ a text-based approach or use features based on the property of the program at a syntactic level. Both of these approaches succumb to code obfuscation which is a huge obstacle for automatic software plagiarism detection.

Few of the techniques used by well known tools are based on searching similar n-grams or small character sequences between two source codes. other paper proposed high level features

1. lexical features
2. stylistic features
3. comments features
4. programmers text features
5. structure features

mainly for lexical, comments and programmer's text features, they represent source code as a set of characters n-grams. These features results in more natural language than programming language. Hence not much relevance to get accuracy. all above techniques focuses on content of the source code. introducing dummy code like adsf in code snippet 3 will mask from recognition.

This paper proposes a hybrid approach to address automatic plagiarism detection.

 The key contributions of our work are:
• Use of source code metrics (static code-based features)
extracted during code compilation as feature representations of the the student solutions to the given programming assignments.
• Unsupervised learning based approach to detect potential plagiarized cases.

Performance comparison


Method

Proposed method accepts as input a set of correct student solutions. {x1,x2,x3 . . . xn}. extract source code metrics for each submission use them as feature representations and solution is mapped to a point in an n-dimensional (here - n = 55, features are extracted using MILEPOST GCC). The closeness is defined by Euclidean distance using t-SNE , a variation of Stochastic Neighbor Embedding



T-distributed Stochastic Neighbor Embedding (t-SNE) is a ml algorithm for visualizing model. It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an extremely small probability of being picked. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map. 


 


Results are analysed in below cases


Case1: Absence of plagiarized cases
Case2: Interchanging if-else code
Case3: Type define the frequently called functions
Case4: Presence of dead code
Case5: Interchange the position of functions

Results vary between 50% to 70% among different cases




Future work

  • Identify and use additional dynamic features that could boost the performance of our method.
  • Proposing a method to decide a good distance threshold (δ), which enables to detect partial plagiarized solution pairs confidently.













Comments

Hadi shaikh said…
Plagiarism checker reddit I think this is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article.

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Ownership at Large

 Open Problems and Challenges in Ownership Management -By John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Shan He, Ralf Lämmel, Erik Meijer, Silvia Sapora, and Justin Spahr-Summers Facebook Inc.  Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, ev...

Hybrid Approach to Automation, RPA and Machine Learning

- By Wiesław Kopec´, Kinga Skorupska, Piotr Gago, Krzysztof Marasek  Polish-Japanese Academy of Information Technology Paper Link Courtesy DZone   Abstract One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach.     The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process. To achieve these goals we propose a hybrid, human-centred approach to the development of software robots. This design and  implementation method combines the Living Lab approach with empowerment through part...