"Karmanye Vadhikaraste Maa Phaleshu Kada chana"
Meaning: "Focus on the task at hand, don't let your actions hinge on the outcome."

Experience Publications Code Blog Contact
I finished PhD thesis titled "Cross-view Embeddings for Information Retrieval" at Technical University of Valencia (UPV) and my advisors were Paolo Rosso at Technical University of Valencia, Spain and Rafael E. Banchs at Institute for Infocomm Research (I2R), Singapore. I was a member of Pattern Recognition and Human Language Technologies (PRHLT) Research Center and Natural Language Engineering Lab.

I am a researcher with engineering appetite. I like to keep my engineering and mathematical skills updated. Recently, I have joined Amazon as Machine Learning Scientist to be part of their Core ML team in Bangalore.

Research Interests

Information retrieval, machine learning, text mining, statistical natural language processing, deep-learning, data science


  • March, 2017: Joined Core ML team at Amazon as ML Scientist to take on cutting-edge ML research
  • January, 2017: Successfully defended PhD thesis
  • December, 2016: Paper accepted at ECIR 2017 titled "Learning to classify inappropriate query-completions"
  • December, 2016: Paper accepted at Information Processing & Management titled "Continuous space models for CLIR"
  • December 07, 2016: Tutorial on "Deep Learning for Information Retrieval" at FIRE 2016, Kolkata, India
  • November 03, 2016: Visited Search team of Wikimedia Foundation, San Francisco, USA
  • November 02, 2016: Talk at Bay Area NLP Meetup @ Galvanize, San Francisco, USA
  • October 28-30, 2016: Participating in GSoC Mentor Summit at Google (Mountain View), USA
  • Setember 22, 2016: PhD thesis submitted
  • June, 2016: Patent based on work at Microsoft Bing (London) has been filed at USPTO


  • Machine Learning Scientist at Amazon.com, Bangalore (March 2017-)
    • Core ML Team
  • Applied Scientist Intern at Bing, Microsoft London (May-August 2015)
    • Bing query formulation (Mentor: Nicola Cancedda)
  • Research Intern at Microsoft Research, India (June - Sept 2014)
    • Machine Translation for Code-mixed Text (Mentor: Monojit Chaudhury)
  • Research Intern at Institute of Infocomm Research (I2R), Singapore (Jan - Aug 2013)
  • Research Intern at FBK Trento, Italy (Sept-Oct 2012)
    • Graph-based Semi-Supervised Learning for Entity Linking (Mentor: Claudio Giuliano)
  • Mentor at Google Summer of Codes (April-Aug, 2012, 14, 16) [info]
  • Student Developer at Google Summer of Codes (April-Aug 2011) [info]


  • (September 2015) Dynamic Block Threshold for Query Autosuggest
    London Brown Bag, Microsoft Bing (London, UK)

  • (July 2014) Deep Learning for Mixed-Script IR
    Bing, Microsoft India Development Center (Hyderabad, India)

  • (Dec 2013) Encoding Transliteration Variation to Aid IR in Transliterated Space
    Microsoft Research (Bangalore, India)

  • (July 2013) Cross-language High Similarity Search
    Temasek Laboratories at Nanyang Technological University (TL@NTU) (Singapore)

  • (April 2013) Experimenting IR/NLP with Terrier [slides-code]
    HLT, Institute for Infocomm Research (Singapore)

  • (Sept 2012) Tutorial on Learning-to-Rank [slides]
    28th Conference of the Spanish Society for Natural Language Processing SEPLN 2012 (Castellón, Spain)




  • Continuous space models for CLIR [pdf]
    Parth Gupta, Rafael E. Banchs and Paolo Rosso
    Information Processing & Management, 2016 (Impact Factor: 1.397)

  • Cross-language plagiarism detection over continuous-space-and knowledge graph-based representations of language
    Parth Gupta, Marc Franco-Salvador, Paolo Rosso and Rafael E. Banchs
    Knowledge-Based Systems, 2016 (Impact Factor: 3.325)

  • A deep source-context feature for lexical selection in statistical machine translation
    Parth Gupta, Marta R. Costa-Jussà, Rafael E. Banchs and Paolo Rosso
    Pattern Recognition Letters, 2016 (Impact Factor: 1.586)

  • Squeezing bottlenecks: exploring the limits of autoencoder semantic representation capabilities
    Parth Gupta, Rafael E. Banchs and Paolo Rosso
    Neurocomputing, 2016 (Impact Factor: 2.005)

  • Methods for Cross-Language Plagiarism Detection [code]
    Alberto Barrón-Cedeño∗, Parth Gupta∗ and Paolo Rosso (∗ Equal contribution)
    Knowledge-Based Systems Vol. 50, 2013 (Impact Factor: 4.104)


  • Learning to classify inappropriate query-completions
    Parth Gupta and Jose Santos
    Proceedings of ECIR 2017 (Scotland, UK)

  • Query Expansion for Mixed-script Information Retrieval [pdf] [code] [demo]
    Parth Gupta, Kalika Bali, Rafael Banchs, Monojit Choudhury and Paolo Rosso
    Proceedings of SIGIR 2014 (Gold Coast, Australia)

  • Enrichment of Bilingual Dictionary through News Stream Data [data-code]
    Ajay Dubey, Parth Gupta, Vasudev Varma and Paolo Rosso
    Proceedings of LREC 2014 (Reykjavík, Iceland)

  • Cross-Language Plagiarism Detection using a Multilingual Semantic Network [pdf]
    Marc Franco Salvador, Parth Gupta and Paolo Rosso
    Proceedings of ECIR 2013 (Moscow, Russia)

  • Expected Divergence based Feature Selection for Learning to Rank [pdf] [poster][code]
    Parth Gupta and Paolo Rosso
    Proceedings of COLING 2012 (Mumbai, India)

  • Cross-language High Similarity Search using a Conceptual Thesaurus [pdf] [slides]
    Parth Gupta, Alberto Barrón-Cedeño and Paolo Rosso
    Proceedings of CLEF 2012 (Rome, Italy)

  • Detection of Paraphrastic Cases of Mono-lingual and Cross-lingual Plagiarism [pdf]
    Parth Gupta, Khushboo Singhal, Prasenjit Majumder and Paolo Rosso
    Proceedings of ICON 2011 (Chennai, India)


Modeling of terms across scripts through autoencoders [pdf]
Parth Gupta
SIGIR 2014 Doctoral Consortium (Gold Coast, Australia)

English-to-Hindi system description for WMT 2014: Deep Source-Context Features for Moses
Parth Gupta, Marta Costa-jussà Ruiz, Rafael E. Banchs, Paolo Rosso
The 9th ACL WMT workshop on Statistical Machine Translation, ACL 2014 (Baltimore, USA)

On Dimensionality Reduction Techniques for Cross-Language Information Retrieval [pdf]
Parth Gupta
Future Directions in Information Access Symposium (FDIA) in Conjunction with European Summer School in Information Retrieval (ESSIR), FDIA@ESSIR, 2013 (Granada, Spain)

  • Text Reuse with ACL: (Upward) Trends [pdf] [slides]
    Parth Gupta and Paolo Rosso
    Workshop on Rediscovering 50 Years of Discoveries, ACL 2012 (Jeju, South Korea)

  • Multiword Named Entities Extraction from Cross-Language Text Re-use [pdf] [slides]
    Parth Gupta, Khushboo Singhal and Paolo Rosso
    CREDISLAS Workshop, LREC 2012 (Istanbul, Turkey)

Working notes

  • Mapping Hindi-English Text Re-use Document Pairs [pdf] [slides]
    Parth Gupta and Khushboo Singhal
    Notebook Papers of Forum for Information Retrieval Evaluation, FIRE 2011 (Mumbai, India)

  • External & Intrinsic Plagiarism Detection: VSM & Discourse Markers based Approach [pdf]
    Sameer Rao, Parth Gupta, Khushboo Singhal, and Prasenjit Majumder
    Notebook Papers of CLEF 2011 LABs and Workshops, CLEF 2011 (Amsterdam, The Netherlands)

  • External Plagiarism Detection: N-Gram Approach using Named Entity Recognizer [pdf]
    Parth Gupta, Sameer Rao, and Prasenjit Majumder
    Notebook Papers of CLEF 2010 LABs and Workshops, CLEF 2010 (Padua, Italy)


  • Cross-view Embeddings for Information Retrieval [pdf]
    Parth Gupta
    Doctoral Thesis, UPV (Valencia, Spain)

  • Learning to Rank - Using Bayesian Networks [pdf]
    Parth Gupta
    Master's Thesis, DA-IICT (Gandhinagar, India)


  • jDNN: Deep learning tookkit (under development and very less documented) with CUDA support. [java]

  • Mixed-script Equivalents: The code used in "Query Expansion for Multi-script Information Retrieval" with trained models. [java]

  • Xapian: Large scale search engine library - an open source project with commercial support. I implemented the learning to rank module xapian-letor which also was extended as GSoC project in 2012. [c++]

  • FS-ED for Letor: Expected divergence based feature selection module. The feature selection works really well (statistically significant) if used with ranking algirithm based on large margin classifiers e.g. RankSVM. [java]

  • Terrier Wrapper: A wrapper on top of Terrier 3.5 to perform variaous operations from collecting term/document statistics, using stemmer off-the-shelf, creating term-document matrix etc. More details available on the code page. [java]

  • Replicated SoftMax (RSM): Implementation for modelling RSM type Restricted Boltzmann Machine (RBM) (email for the copy). [octave]

  • Cross-language PD: Detailed fragment identification algorithm presented in Knoledge-Based Systems paper. [java]

  • IR Evaluation Framework: IR evaluation framework with measures like MAP, NDCG@k, MRR, Recall etc. [perl]


Email: x@gmail.com (where, x = pargup8)