How Nesta uses NLP to process 7m job ads and shed light on the UK’s labor marketA case study on Nesta’s workflow for extracting 7 million job ads to better understand UK skill demand, using a custom mapping step to match skills to any government taxonomy.
Neuradicon: operational representation learning of neuroimaging reportsWatkins, Gray, Julius, Mah, Pinaya, Wright, Jha, Engleitner, Cardoso, Ourselin, Rees, Jaeger, Nachev (2023)Labelled data for each task was produced using the Prodigy labelling tool. Each report was labelled in a paired-annotation manner. [...] We used the grammatical dependency parse produced by the spaCy parser as input and implemented the patterns using the spaCy dependency matcher.
Introducing Prodigy-HFHugging Face BlogLast week, Explosion introduced Prodigy-HF, a new Prodigy plugin offering code recipes that directly integrate with the Hugging Face stack.
Toward a Critical Toponymy Framework for Named Entity Recognition: A Case Study of Airbnb in New York CityBrunila, LaViolette, CH-Wang, Verma, Féré, McKenzie (2023), EMNLP 2023All annotation was performed using Prodigy following an initial training session where annotators collaboratively annotated a randomly chosen set of samples.
Models as annotators in ProdigyHow to use models and LLMs as annotators to find disagreements and prioritize examples to annotate first.
You are what you read: Building a personal internet front-page with spaCy and ProdigyPyCon DE & PyData Berlin
Rulers, NER, and data iterationAbout the power of Rules + ML and the importance of iteration on your pipeline and your data.
Explosion in 2022: Our Year in ReviewIt's been another exciting year at Explosion! We've developed a new end-to-end neural coref component for spaCy, improved the speed of our CNN pipelines up to 60%, and published new pre-trained pipelines for Finnish, Korean, Swedish and Croatian. We've also released several updates to Prodigy and introduced new recipes to kickstart annotation with zero- or few-shot learning.
Setting your ML project up for success“What can you do to maximize probability of success for your Machine Learning solution? Throughout my 15 years as data scientist in academia, big pharma and through consulting, one common theme has emerged: the most reliable predictor of success for any NLP or ML-based solution is whether or not you involve the data science team early on.”
How the Guardian approaches quote extraction with NLPA case study of the Guardian's spaCy-Prodigy workflow to modularize quote extraction for content creation. This study includes iterative annotation guidelines and custom interface functionality.
Introducing Span Categorization in Prodigy and spaCyIn this video, we’ll show you how to use Prodigy for spaCy’s Span Categorizer. We’ll be annotating food recipes and looking into ways to help with consistent annotations and speed up the process with patterns and temporary models.
Automated Identification of Clinical Procedures in Free-Text Electronic Clinical Records with a Low-Code Named Entity Recognition WorkflowMacri, Teoh, Bacchi, Sun, Selva, Casson, Chan (2022), Methods of Information in MedicineThe use of a low-code annotation software tool [Prodigy] allows the rapid creation of a custom annotation dataset to train a NER model to identify clinical procedures stored in free-text electronic clinical notes.
Talking sense: using machine learning to understand quotesThe Guardian BlogHow the Guardian uses spaCy and Prodigy to train a machine learning model that helps extract quotes from news articles and match them to the correct source.
✨ prodigy v1.11.0Aug 12, 2020spaCy v3 support, annotation for overlapping and nested spans, better installation & more
Prodigy v1.10: Dependencies, relations, audio, video & moreVersion 1.10 of Prodigy includes tons of new features, including manual dependency and relation annotation, audio and video annotation, a new and improved image UI, new recipe callbacks, more settings for manual NER, plus various new config options and settings.
sense2vec reloaded: contextually-keyed word vectorsIn 2016 we trained a sense2vec model on the 2015 portion of the Reddit comments corpus, leading to a useful library and one of our most popular demos. That work is now due for an update. In this post, we present a new version and a demo NER project that we trained to usable accuracy in just a few hours.
Building new NLP solutions with spaCy and ProdigyPyData Berlin“Commercial machine learning projects are currently like start-ups: many projects fail, but some are extremely successful, justifying the total investment. While some people will tell you to embrace failure, I say failure sucks — so what can we do to fight it? In this talk, I will discuss how to address some of the most likely causes of failure for new NLP projects.”
Training an insults classifier with Prodigy in ~1 hourIn this video, we’ll show you how to use Prodigy to train a classifier to detect disparaging or insulting comments. Prodigy makes text classification particularly powerful, because you can try out new ideas very quickly.
Prodigy-Segment for Pixel SegmentationUse Meta’s “Segment Anything” model in Prodigy to help you select the right pixels in images.
Who said what: using machine learning to correctly attribute quotesThe Guardian Engineering BlogHow the Guardian uses spaCy and Prodigy to train a custom coreference resolution model.
Half hour of labeling power: Can we beat GPT?PyData NYCLarge Language Models (LLMs) offer a lot of value for modern NLP and can typically achieve surprisingly good accuracy on predictive NLP tasks. But can we do even better than that? In this workshop we show how to use LLMs at development time to create high-quality datasets and train specific, smaller, private and more accurate models for your business problems.
✨ prodigy v1.14.5Oct 24, 2023Toggle for character vs. token highlighting, CSS and JS from local and remote paths
Identifying Signs and Symptoms of Urinary Tract Infection from Emergency Department Clinical Notes Using Large Language ModelsIscoe, Socrates, Gilson, Chi, Li, Huang, Kearns, Perkins, Khandjian, Taylor (2023)For annotation we employed Prodigy, a scriptable annotation tool designed to maximize efficiency, enabling data scientists to perform the annotation tasks themselves and facilitating rapid iterative development in natural language processing (NLP) projects.
Large Language Models: From Prototype to ProductionEuroPython KeynoteLarge Language Models (LLMs) have shown some impressive capabilities and their impact is the topic of the moment. In this talk, Ines presents visions for NLP in the age of LLMs and a pragmatic, practical approach for how to use Large Language Models to ship more successful NLP projects from prototype to production today.
Concepts and measures of bureaucratic constraints in European Union laws from hand-coding to machine-learningFranchino, Migliorati, Pagano, Vignoli (2023)The models “learn” the relations between the text tokens and the entity categories from two randomly selected samples of sentences that are extracted from a pre-processed corpus and have been manually annotated using the Python-implemented platform “Prodigy”.
SpanCat with spaCy and Prodigy on real dataYouTube series by WJB Mattingly showing an end-to-end project, from cultivating and annotating data to training, testing and visualizing a model.
Slovak Dataset for Multilingual Question AnsweringHládek, Staš, Juhár, Koctúr (2023)We used the Prodigy annotation tool to annotate the questions and answers. One annotation task corresponds to one web application deployment and different configurations.
Custom Interfaces with blocksYou can create custom annotation layouts in Prodigy using the annotation widgets that Prodigy provides by using the blocks feature. This video explains how to use this feature by building a custom interface that can manually annotate and transcribe audio.
Finding Video Games with Sense2VecIn this video, we’ll show how you can improve the annotation experience by leveraging sense2vec to pre-fill named entities.
Finding Bad Labels for Text Classification with Jupyter and Prodigy In this video, we’ll show you how to use set up Prodigy to find bad labels in text classification tasks. While many of the techniques are applied to text classification, they can also be used for classification tasks in general.
Finding Duplicates in Tabular Data with Jupyter and ProdigyIn this video, we’ll show you how to use Prodigy to train a named entity recognition model from scratch, by taking advantage of semi-automatic annotation and modern transfer learning techniques.
Corpus-Level Evaluation for Event QA: The IndiaPoliceEvents Corpus Covering the 2002 Gujarat ViolenceHalterman, Keith, Sarwar, O’Connor (2021), ACL 2021Figure A2 shows a stylized version of the custom interface we built using the Prodigy annotation tool. Annotators are presented with an entire document, with sentences sequentially highlighted.
✨ prodigy v1.10.0Jun 16, 2020Dependency and relation annotation, audio, video, character-based NER & more
Image Captioning with Prodigy & PyTorchIn this video, we’ll show you how you can use Prodigy to script fully custom annotation workflows in Python, how to plug in your own machine learning models and how to mix and match different interfaces for your specific use case.
Rapid NLP annotationData Science SummitThis talk presents a fast, flexible and even somewhat fun approach to named entity annotation. Using our approach, a model can be trained for a new entity type in only a few hours, starting from only a feed of unannotated text and a handful of seed terms.
Building Prodigy: Our new tool for efficient machine teachingines.ioThe philosophy behind Prodigy’s features and its cloud-free design.
On the Creation of Classifiers to Support Assessment of E-PortfoliosGantikow, Isking, Libbrecht, Müller, Rebholz (2023)In this workflow, Prodigy selects and presents text examples that were classified with a very low degree of certainty. The annotator reviews the proposed classifications and corrects them, if necessary.
Impoliteness and morality as instruments of destructive informal social control in online harassment targeting Swedish journalistsBjörkenfeldt, Gustafsson (2023)In the annotation tool Prodigy used for this process, the tweets directed towards journalists were displayed alongside the initial tweet that initiated the conversation thread and the subsequent reply from the journalist.
Prodigy-ANN for Image Retrieval via CLIPDealing with a huge bucket of images that you want to annotate? The new image retrieval features in Prodigy-ANN (approximate nearest neighbors) might help!
Newsletter September 2023The latest edition of our newsletter, featuring our plans for premium models, LLMs, chain-of-thought prompting, upcoming events and talks, and exciting new Prodigy features. Plus exclusive discounts!
Task Routers in ProdigyHow to use the new task routers to customize how examples are assigned in multi-annotator workflows.
Inter-rater agreement for the annotation of neurologic signs and symptoms in electronic health recordsOommen, Howlett-Prieto, Carrithers, Hier (2023)Prodigy was used to annotate neurologic concepts in the EHR physician notes.
Large Disagreement Modelling“In this blogpost I’d like to talk about large language models. There’s a bunch of hype, sure, but there’s also an opportunity to revisit one of my favourite machine learning techniques: disagreement.”
Predicting relations between SOAP note sections: The value of incorporating a clinical information modelSocrates, Gilson, Lopez, Chi, Taylor, Chartash (2023), Journal of Biomedical InformaticsTo support human annotation, we first annotate 100 Assessment and Plan subsections manually using Prodigy, and then use spacy-transformers to fine-tune a general domain RoBERTa-base model pretrained on OntoNotes 5 for both the Assessment and Plan section NER tagging.
Fiscal data in text: Information extraction from audit reports using Natural Language ProcessingBeltran (2023), Data & Policy, Cambridge University PressI relied on the text annotation software Prodigy in Python that offers a friendly user interface where the reviewer can read the text and assign a label to each paragraph.
Training spaCy NER Models with ProdigyThis handy flowchart contains our most common tips, tricks, and best practices for training and updating spaCy named entity recognition models with Prodigy.
The triangulation of ethical leader signals using qualitative, experimental, and data science methodsBanks, Ross, Toth, Tonidandel, Goloujeh, Dou, Wesslen (2022)This additional text was labeled by the same coding team using Prodigy, [...] a flexible user interface tool built on top of spaCy, a leading open source library in python for natural language processing. We created a spaCy end‐to‐end project workflow including package versioning, data pre‐processing, data ingestion into a database, annotation sessions using Prodigy’s user interface, model training, model evaluation, python packaging, and visual app for testing the model.
Speech acts in the Dutch COVID-19 Press ConferencesSchueler, Marx (2022), Language Resources and EvaluationWe used the annotation tool Prodigy. Prodigy provides a simple interface in which the annotator sees a sentence and selects the applicable speech acts. The use of Prodigy considerably sped up the annotation process, allowing the annotators to annotate around 200 sentences per hour.
Diary of a spaCy project: Predicting GitHub TagsMany people assume that working on an NLP project involves a lot of machine learning. Our experience is that it's much less about flowing tensors, and more about making a tailored solution. This blogposts demonstrates how a typical spaCy project could be initiated, implemented and executed towards a custom solution.
Explosion in 2021: Our Year in ReviewThe year 2021 is coming to an end, and like the previous year, it was shaped by unique challenges that impacted our work together. For Explosion, it was a very productive year. We found an investor that fits our strategy, the work on Prodigy Teams is in full swing, and the team has grown a lot. So here's our look back at our highlights of the year 2021.
Explosion in 2020: Our Year in ReviewWhile 2020 hasn’t been easy for anyone, at Explosion we’ve considered ourselves relatively fortunate in this most interesting year. We’ve always worked remotely, so we’ve been able to take both pride and comfort in continuing to ship good software. Here’s a look back at what we’ve been up to.
Training a Named Entity Recognition Model with Prodigy and Transfer LearningIn this video, we’ll show you how to use Prodigy to train a named entity recognition model from scratch, by taking advantage of semi-automatic annotation and modern transfer learning techniques.
FAQ #1: Tips & tricks for NLP, annotation & training with Prodigy and spaCyIn this video, Ines talks about a few frequently asked questions and shares some general tips and tricks for how to structure your NLP annotation projects, how to design your label schemes and how to solve common problems.
Explosion in 2017: Our Year in ReviewWe founded Explosion in October 2016, so this was our first full calendar year in operation. We set ourselves ambitious goals this year, and we're very happy with how we achieved them. Here's what we got done.
Prodigy: A new tool for radically efficient machine teachingMachine learning systems are built from both code and data. It's easy to reuse the code but hard to reuse the data, so building AI mostly means doing annotation. This is good, because the examples are how you program the behaviour – the learner itself is really just a compiler. What's not good is the current technology for creating the examples. That's why we're pleased to introduce Prodigy, a downloadable tool for radically efficient machine teaching.
Prodigy in 2023: LLMs, task routers, QA and pluginsWe have made a ton of new updates in Prodigy this year with v1.12, v1.13, and v1.14 releases. So we decided to write a post about them.
Developing a Named Entity Recognition Dataset for TagalogMiranda (2023), IJCNLP-AACL 2023We used Prodigy as our annotation tool. We set up a web server on the Google Cloud Platform and routed the examples through Prodigy’s built-in task router.
Prodigy-PDF for PDF annotation and OCRWant to annotate PDF files? Our new Prodigy plugin can help with that! To explain how to use PDF segmentation and OCR, Vincent made a small demo video.
✨ prodigy v1.14.3Oct 6, 2023Inter-annotator agreement for document-level and token-level annotations, new plugins
Into the Single Cell Multiverse: an End-to-End Dataset for Procedural Knowledge Extraction in Biomedical TextsDannenfelser, Zhong, Zhang, Yao (2023), NeurIPSTissue, cell type, tool, and method were annotated using the Prodigy software tool developed by Explosion AI for easy tracking of token-level tags.
✨ prodigy v1.12.0Jul 5, 2023LLM-assisted workflows for annotation and prompt engineering, task routing for multi-annotator setups
How Good is the Model in Model-in-the-loop Event Coreference Resolution Annotation?Ahmed, Nath, Regan, Pollins, Krishnaswamy, Martin (2023)Figure 6 illustrates the interface design of the annotation methodology on the popular model-in-the-loop annotation tool - Prodigy. We use this tool for the simplicity it offers in plugging in the various ranking methods we explained.
Newsletter May 2023We got so much amazing feedback from the spaCy user survey, thank you all for your contributions! The most requested feature was spaCy integration with LLMs, which is why we’re so excited to announce spacy-llm!
Creating Custom Event Data Without Dictionaries: A Bag-of-TricksHalterman, Schrodt, Beger, Bagozzi, Scarborough (2023)While in the past the process of generating training case has been quite time consuming and tedious, newer approaches such as those incorporated into the web-based Prodigy annotation system allow this to be done much more quickly.
Deploying a Prodigy cloud service for Posh’s financial chatbotsA Prodigy case study of Posh AI's production-ready annotation platform and custom chatbot annotation tasks for banking customers.
Extracting Structured Information from Greek Legislation DataAlexios (2023)Worth noting is the existence of an application, called Prodigy, which takes advantage of an active learning framework and provides users with an interactive interface for data annotation.
Finetuning and Bulk Labelling Images with Prodigy In this video, we’ll show how you might be able to improve the annotation experience by using bulk labelling for image classification.
Bulk Labelling and ProdigyIn this video, we’ll show a bulk labelling technique that can help you prepare data for Prodigy.
Finding Bad Image Data using UMAP and ProdigyIn this video, we’ll show you how to use Prodigy to find bad examples in the Google QuickDraw dataset. We will be leveraging a technique that involves UMAP to find strange images semi-automatically.
Healthsea: an end-to-end spaCy pipeline for exploring health supplement effectsCreate better access to health with machine learning and natural language processing. Read about our journey of developing Healthsea, an end-to-end spaCy pipeline for analyzing user reviews to supplement products and extracting potential effects on health.
How We Analyzed Google’s Search ResultsThe MarkupUsing the Prodigy annotation tool, we created a user interface and a coder manual for two annotators to spot-check 741 stained images randomly sampled from our dataset.
Explosion in 2019: Our Year in ReviewAs 2019 draws to a close and we step into the 2020s, we thought we’d take a look back at the year and all we’ve accomplished. And we realized we had so much that we could give you a month-by-month rundown of everything that happened.
✨ prodigy v1.8.0May 20, 2019Support for spaCy v2.1, basic auth, multi-user sessions, review workflow & more
Training a new entity type with Prodigy – annotation powered by active learningIn this video, we’ll show you how to use Prodigy to train a phrase recognition system for a new concept. Specifically, we’ll train a model to detect references to drugs, using text from Reddit.