Summer projects
Undergraduate
This page lists possible projects in Language Technology for the Summer 2007 Vacation Scholarships. If you find that a project listed here is close to something you're interested in, but isn't quite what you were looking for, you should speak to the project supervisor to see if an appropriate project can be constructed. More generally you'll find that members of staff are usually open to suggestions for projects.
- Question Answering and Information Extraction
- Building a Web Service [Diego Mollá Aliod]
- A framework to test syntactic patterns [Diego Mollá Aliod]
- Question classification [Diego Mollá Aliod]
- Processing of complex questions [Diego Mollá Aliod]
- Intelligent Text Processing
- Parser Evaluation [Robert Dale]
- A Voting System for PDF Text Extraction [Robert Dale]
- Corpus-based Correction of OCR-introduced Spelling Errors [Robert Dale]
- Embodied Conversational Agents
- Fidelity in Talking Heads [Robert Dale]
- Can You Wreck a Nice Beach? [Robert Dale]
- Other Projects
- Analysing Podcasts [Steve Cassidy]
- Corpus Query Tool [Steve Cassidy]
- Language Analysis Software (Concordancer) [Debbie Richards]
- Working with AJAX in a Semantic Web Context [Rolf Schwitter]
- Towards an Automated Student Advisor [Rolf Schwitter]
- Project X [Rolf Schwitter]
Question Answering and Information Extraction
Most of the projects in this section are related to AnswerFinder. AnswerFinder is a question answering system that searches text documents to find the answer to questions asked in English
The AnswerFinder team is formed by people enthusiastic about applying innovative methods to practical solutions. By participating in one of the projects below you will develop a module that will be used as part of the AnswerFinder system.
Building a Web Service
Supervisor: Diego Mollá Aliod
The goal of this project is to convert AnswerFinder into a Web Service. The details of how this could be done are available in:
Required background: Good programming skills in C++; experience with web technology (e.g. Pass grade in COMP249).
Desired background: Experience with XML programming; experience with Web Services.
A framework to test syntactic patterns
Supervisor: Diego Mollá Aliod
A Masters student is currently developing a set of question-answering patterns as part of her project. The task of this project is to build a system that tests these patterns on a collection of questions and answer candidates.
Required background: Good programming skills, preferably in C++.
Desired background: Pass grade in SLP148, COMP248, or COMP348.
Question classification
Supervisor: Diego Mollá Aliod
AnswerFinder currently uses a very simple method to classify questions. The task of this project is to expand the current method by introducing patterns based on syntactic information and/or machine learning techniques.
Required background: Good programming skills, preferably in C++; Pass grade in SLP148, COMP248, or COMP348.
Desired background: Experience in programming in a group.
Processing of complex questions
Supervisor: Diego Mollá Aliod
AnswerFinder currently processes simple questions where only one answer type is expected. The task of this project is to analyse more complex questions and decompose them into simpler questions that can be fed to AnswerFinder. An example of a complex question is:
"Discuss the prevalence of steroid use among female athletes over the years. Include information regarding trends, side effects and consequences of such use."
The goal of the project is to automatically decompose the above question into simpler questions like:
- What is the prevalence of steroid use among female athletes over the years?
- What are the trends of such use?
- What are the side effects of such use?
- What are the consequences of such use?
Required background: Good programming skills, preferably in C++.
Desired background: Pass grade in SLP148, COMP248, or COMP348; experience in programming in a group.
Intelligent Text Processing
The projects in this section are related to information extraction and summarisation work being carried out in a number of projects in the CLT.Parser Evaluation
Supervisor: Robert Dale
The purpose of this project is to establish which of various publicly-available syntactic parsers provide the most useful and reliable results when applied to real texts, with the particular aim of supporting a sophisticated text summariser. The project involves (a) briefly reviewing the structure and content of existing parser evaluation exercises; (b) defining, in consultation with the supervisor, an evaluation exercise appropriate to our needs; (c) locating, downloading and installing a number of publicly-available parsers; (d) running tests and computing statistics; and (e) writing up the results.
The project would be particularly suitable for someone interested in a future honours project that focusses on either parsing or summarisation of real natural language texts.
Required background: Pass grade in SLP148, COMP248, or COMP348; comfort with installing and using software under Windows and Unix; good ability in a scripting language; attention to detail.
Desired background: A high degree of comfort with syntactic notions and terminology.
A Voting System for PDF Text Extraction
Supervisor: Robert Dale
There are many public-domain and commercial programs available for extracting text from PDF files. None is perfect, especially when the input document is typographically complex. Unfortunately, they all fail to provide accurate results in different ways. The purpose of this project is to explore the idea of using several different text extractors, then integrating the results by means of a voting mechanism, so that the returned results are consistent with what most of the programs believe to be the case. The project involves: (a) downloading and installing a number of PDF to text programs; (b) running these on a controlled sample of PDF files to determine the kinds of errors and inconsistencies produced; (c) working with the supervisor to devise an architecture that allows each system to vote on specific aspects of the extracted text; and (d) implementing a voting-based system that delivers the best results.
The project would be particularly suitable for someone interested in a future honours project that focusses on extracting information from real documents.
Required background: Pass grade in SLP148, COMP248, or COMP348; comfort with installing and using software under Windows and Unix; good programming skills in C++ or Java.
Desired background: A high degree of comfort with syntactic notions and terminology.
Corpus-based Correction of OCR-introduced Spelling Errors
Supervisor: Robert Dale
A common way to archive legacy documents is to run them through a scanner to produce a PDF file, to which a searchable text layer is added using optical character recognition (OCR). Unfortunately, OCR is not perfect, so spelling errors are introduced that damage the effectiveness of search techniques.
Using an existing corpus of several thousand scanned academic papers (in the ACL Anthology), this project aims to develop automatic spelling correction techniques that use the corpus itself as a source of evidence for spelling corrections. For example, if the misrecognised string spe11in8 appears in a document, a simple distance metric may find other similar strings, such as spelling, to be much more frequent in the corpus, and on the basis of frequency then choose this as a correction. There are, of course, a number of other factors involved, which makes the project more challenging than it at first seems.
Required background: Pass grade in SLP148, COMP248, or COMP348; excellent programming skills in C++ or Java.
Desired background: A decent spelling ability so you know when something is wrong.
Embodied Conversational Agents
The projects in this section are related to a new project starting in 2007 in the CLT, focussed on building a sophisticated Embodied Conversational Agent.Fidelity in Talking Heads
Supervisor: Robert Dale
There are now a number of publicly available talking heads, or Embodied Conversational Agents (ECAs) as they are known. These are programmable virtual heads which produce both synthesized speech and facial expressions on provision of suitably marked up input. The aim of this project is to experiment with the range of heads that are available, to determine (a) their ease of installation; (b) the utility of their APIs for both voice and face control; and (c) the quality and realism of the results, which we might think of as the head's fidelity.
This project is particularly suitable for students who are interested in the possibility of linking their future research to the Thinking Head project.
Required background: Comfort with installing and using software under Windows and Unix; some programming skills; pass grade in SLP148, COMP248, or COMP349; competence in producing documentation for others to use.
Desired background: Familiarity with VoiceXML, SALT and/or SSML.
Can You Wreck a Nice Beach?
Supervisor: Robert Dale
Each year, the major vendors of desktop speech recognition packages release new versions, claiming ever better accuracy in their recognition capabilities. The aim of this project is to construct a test environment for comparing the different packages, and using this to perform the first of what we hope will be a series of annual evaluations of the packages. The prokect involves (a) reviewing existing speech recognition evaluation exercises; (b) downloading and installing a number of systems; (c) devising an evaluation framework in consultation with the supervisor; and (d) testing the systems and writing up the results.
This project is particularly suitable for students who are interested in pursuing an honours project related to speech recognition.
Required background: Comfort with installing and using software under Windows and Unix; some programming skills; pass grade in SLP148, COMP248, or COMP349.
Desired background: Familiarity with VoiceXML, SALT and/or SSML.
Other Projects
Analysing Podcasts
Supervisor: Steve Cassidy
This project builds on some work we've done on analysing recordings of multi-party meetings to see how many speakers are present and when they're talking. I'd like to apply this technology to podcasts to segment them into speaker turns. The goal is to see how well we can do this and then to look at whether this makes listening to podcasts any better - eg. being able to skip forward by speaker turns. The project will involve running and modifying C and Tcl coded tools on audio data, signal processing and numerical/statistical programming.
Corpus Query Tool
Supervisor: Steve Cassidy
Linguists use large collections of text data as raw material for their work. A long time ago I wrote a web based tool to allow various text collections to be searched using regular expressions. The tool is long overdue for an overhaul and of course we've learned a lot since it was written. This project would write a new version of this tool that would allow regular expression style searches on many kinds of text and include a facility to deal with markup on the text in a flexible way. The code would probably be written in Python and be web based so this project would suit a graduate of COMP249.
Language Analysis Software (Concordancer)
Supervisor: Debbie Richards
This project involves working with the European Languages Department to develop a Language Analysis System (Concordancer). Applicants with a special interest in language analysis preferred along with a solid knowledge of regular search expressions.
As part of her PhD project, Ms Yvonne Breyer (Online Development Coordinator, Dept of European Languages), is currently designing a concordancer for the use in the context of language teaching and learning. She is looking to collaborate with a Computing student experienced in C++ in order to further develop and finalise the software.
The project will be supervised by A/Prof. Debbie Richards in the Computing Department.
Working with AJAX in a Semantic Web Context
Supervisor: Rolf Schwitter
In this research project you will implement an AJAX-based text editor that gives the user predictive feedback while he or she is writing a text in controlled natural language into an HTML form. In our case the controlled natural language is a well-defined subset of English that can be translated into a formal language and used to write machine-processable annotations for the Semantic Web. The controlled natural language processor as well as an implementation of the text editor already exist. The text editor is implemented as a Java applet which runs in a web browser and communictes with a server via socket interface. The server implements the controlled natural language processor and a reasoning service. Taking the current implementation as a starting point, you will explore in this research project if a lightweight AJAX-based solution can provide the same kind of interactivity as the current implementation. The application that you will build has some similarity to Google Suggest but it will display tailored linguistic information while a text is written.
Required background: Good programming skills, in particular JavaScript.
Desired background: Pass grade in COMP249.
Towards an Automated Student Advisor
Supervisor: Rolf Schwitter
In this research project you will integrate a VoiceXML interpreter with a reasoning engine using SRI's Open Agent Architecture. The idea is to build an infrastructure for a VoiceXML application which gives students advice on studying Computer Science at Macquarie University. The VoiceXML application should be able to make "similar" kinds of inferences about study patterns as a human student advisor. Although this is very complex task, you will explore in this research project how the architecture of such an application might look like and build up the necessary infrastructure.
Required background: Knowledge of VoiceXML and Prolog.
Desired background: Pass grade in COMP248 or COMP249 or COMP349.
Project X
Supervisor: Rolf Schwitter
Here you will tell me what you would like to do in a summer vacation project and present a project proposal. In particular, I am interested in natural language processing, automated theorem proving, and the Semantic Web. But I am happy to discuss any cool idea with you.