Semester 1 |
|||
Date | Presenter | Title | Abstract |
28 Feb |
Dr Glen Livingstone The University of Newcastle |
Fully Bayesian analysis of regime switching volatility models | [Click to show or hide] |
The main aim of this thesis is to develop an automatic MCMC estimation
procedure for STAR-GARCH models that can be applied to any data set that
could normally be modeled by any of the sub-classes of models that the
STAR-GARCH model generalises. This will remove the need for linearity
testing and model specification. The project will achieve three specific
objectives: |
|||
28 Feb |
Sidra Safar The University of Newcastle |
Non-iterative estimation methods for ordinal log-linear models | [Click to show or hide] |
The log-linear models have been a significant area of research in the field of categorical data analysis since the 1950s. However, until the mid-1970s, log-linear models only considered the modelling of nominal variables and did not make any assumption about the ordering of categories of an ordinal variable. Therefore, the log-linear models have been modified to incorporate the structure of any ordinal variable. This issue is especially relevant in most fields of social science. The ordinal log-linear models are amid the most widely used and powerful techniques to model association among the ordinal variables in categorical data analysis. Traditionally, the parameters from such models are estimated using iterative algorithms (such as the Newton-Raphson method, and iterative proportional fitting), but issues such as choice of poor initial values and contingency tables of larger dimensions can reduce the convergence rate as well as highly increase the number of iterations required for the algorithms to converge. More recent advances have suggested a method of non-iterative estimation that gives numerically similar estimates as that of the iterative methods for the estimation of linear- by-linear association parameter in an ordinal log-linear model for a two-way table. This presentation will highlight the iterative and non-iterative techniques commonly used to estimate the linear-by-linear association parameter from two-dimensional ordinal log- linear models. It will provide an overview of how the growing number of non-iterative estimation techniques fit into the problem. Several possibilities to extend the research on the non-iterative estimates in order to validate their further use are discussed. The presentation will also highlight the research undertaken so far to achieve this objective. This includes considering the two fundamental estimates for the analysis of the association between two categorical variables forming a contingency table and to determine their asymptotic characteristics. A computational study is carried out for contingency tables of varying sizes to show that these two estimates are asymptotically unbiased. It is also shown that both estimates are asymptotically normally distributed. On the basis of the standard errors, their relative efficiency has been established for 13 commonly analysed contingency tables that appear throughout the literature. Keywords: Ordinal log-linear models, non-iterative estimation, linear-by-linear association parameter, orthogonal polynomials, Newton Raphson, Iterative proportional fitting. |
|||
1 Mar |
Salman Cheema The University of Newcastle |
Aggregate association for dichotomous variables | [Click to show or hide] |
When analysing data that is made available from many government and corporate organisations, often it is de-identified so that the information on specific individuals is unknown. This ensures that strict privacy policies from individuals are maintained. In such cases, aggregate is all that is available to the analyst. Therefore the analysis of aggregate data has received increasing attention in the statistical, and allied, discipline over the last decade. This data is often summarised in the form of stratified 2x2 tables. Despite its relative youth in the statistical research, there is a wealth of literature in ecological inference (EI) that considers the association structure between categorical variables given only the aggregate information. However, current EI techniques suffer from major shortfalls in the assumptions that are required to be met. Recently, the development of aggregate association index (AAI) allows the analyst to quantify the overall extent of association between two dichotomous variables such that the assumptions underlying EI techniques are not violated. This presentation considers a number of issues concerning the analysis of association for aggregate with the AAI at the heart. One issue is the impact of the sample size on the magnitude of the AAI; two different strategies are suggested to cope with this issue. An alternative approach to dealing with the sample size issue is to consider a new index, referred to as the aggregate informative index (AII). The AII quantifies the extent of information that lies at the marginal level of data for 2x2 contingency tables. The F-statistic is also considered to evaluate the statistical significance of the information. We shall use Selikoff’s (1981) asbestosis data to demonstrate the applicability of AII and F-statistics. On the other hand, an adjustment is proposed to minimize the effect of sample size on the AAI directly, leading to the adjusted AAI. This index is considered in the context of Fisher’s (1935) criminal twin data. Key words: Aggregate association index, aggregate data, ecological inference, F-statistic, 2x2 contingency tables |
|||
1 Mar |
Lucy Leigh The University of Newcastle |
Developing new analytic approaches to deal with complex missing data in longitudinal analysis: with application in the Australian Longitudinal Study of Women’s Health (ALSWH) | [Click to show or hide] |
Missing data are a common problem in longitudinal research and if not dealt with correctly can lead to biased parameter estimates and variances. Incorrect methods for dealing with missing data are still common place. This research will be based on analysis of the old cohort (70-75yrs at baseline) of the Australian Longitudinal Study of Women’s Health (ALSWH; www.ALSWH.org.au ). Sleep data from five surveys will be used to extract latent sleep patterns. The relationship between sleep and other health measures such as morbidities and mortality will be investigated. There is a highly complex pattern of missing data within the ALSWH data set. This complexity consists of dropout due to death/ill-health, loss to follow-up, other dropout (random/non-random), survey non-response (random/non-random), missing covariates data as well as missing responses (random/non-random), item non-response (random/non-random) and mixed variable metrics. No current method for dealing with missing data fully addresses the complexity of missing data present in the ALSWH data set. The novel contribution of this research will therefore be the development of an analytical approach that deals with this complexity. This research will combine and build on the current literature involving pattern mixture models (PMM). It will combine features of the mixture-PMM, Bayesian estimation, and the Multiple Imputation PMM (MI PMM), and extend them by allowing for both Missing at Random (MAR) and Not Missing at Random (NMAR) dropout, as well as allowing for MAR and NMAR intermittent missing. Finally, it may be of use to combine the developed PMM with a joint model, using multi-state modelling or survival analysis to model the dropout due to death/ill- health. No previous work has attempted to tackle comprehensively such a complex array of missing data. Keywords: longitudinal, missing, PMM, latent class, missing covariates, monotone, non-monotone |
|||
22 Mar |
Prof Eric Beh The University of Newcastle |
On some recent advances in correspondence analysis: A Behsian approach :-) | [Click to show or hide] |
Despite not being a hot topic of discussion amongst Australian (and many other English speaking) statisticians, correspondence analysis has gained an international reputation for being a versatile and intuitive approach for graphically summarising the association between categorical variables. Its popularity is especially evident in Europe, and also amongst researchers in the ecological and marketing research disciplines. Over the past decade or so there has been increasing attention paid to the methodological development of correspondence analysis and its application in Australia. This talk will describe one very biased perspective to this development - mine. In particular I will pay special attention to developments that have been published, and have been in development, since 2012. I will also provide some insight into future developments. | |||
26 Apr |
Dr Frank Tuyl The University of Newcastle |
Can we please agree on this interval for the binomial parameter? | [Click to show or hide] |
For Bayesian estimation of the binomial parameter, when the aim is to "let the data speak for themselves", the uniform or Bayes-Laplace prior appears preferable to the reference/Jeffreys prior recommended by objective Bayesians like Berger and Bernardo. Here confidence intervals tend to be "exact" or "approximate", aiming for either minimum or mean coverage to be close to nominal. The latter criterion tends to be preferred, subject to "reasonable" minimum coverage. I will first re-iterate examples of how the highest posterior density (HPD) credible interval based on the uniform prior appears to outperform both common approximate intervals and Jeffreys prior based intervals, which usually represent credible intervals in review articles. Second, an important aspect of the recommended interval is that it may be seen to be invariant under transformation when taking into account the likelihood function. I will also show, however, that this use of the likelihood does not always lead to excellent, or even adequate, frequentist coverage. Third, this approach may be extended to nuisance parameter cases by considering an "appropriate" likelihood of the parameter of interest. For example, quantities arising from the 2x2 contingency table (e.g. odds ratio and relative risk) are important practical applications, apparently leading to intervals with better frequentist performance than that found for HPD or central credible intervals. Preliminary results suggest the same for "difficult" problems such as the ratio of two Normal means ("Fieller-Creasy") and the binomial N problem. |
|||
24 May |
Mr Jim Irish The University of Newcastle |
Some studies of surname frequency data (PhD confirmation seminar) | [Click to show or hide] |
Jim’s research deals with four aspects of the analysis of some very large samples or indeed population data on surname frequencies:
There has been a recent revival in interest in the mathematics of surname frequencies, much of which has been published in the physics literature (for a good reason!). The earliest papers were in the 1840s, followed by two important papers in 1875 and two in 1931, after which the topic was largely forgotten until about 2001. After outlining his proposed sequence of thesis chapters and the other items covered in the written report prepared for the Confirmation Committee, Jim will outline what he sees as the problem with current methods of identifying the form of distribution for rank-frequency distributions fitted to discrete data (such as word counts) and parameter estimation methods. The erroneous identification and estimation methods date from George Zipf’s 1932 book on word frequencies and a misunderstanding of a paper by Benoit Mandelbrot. |
|||
31 May |
Dr Paul Rippon The University of Newcastle |
Application of Smooth Tests of Goodness of Fit to Generalized Linear Models | [Click to show or hide] |
Statistical models are an essential part of data analysis across many diverse fields. However it is important to critically assess any fitted model, confirming that the model really is compatible with the data, before meaningful conclusions are possible. Generalized linear models (GLMs) provide a flexible modelling framework encompassing many commonly used models including the normal linear model, logistic regression model and Poisson regression model. This talk will describe how the smooth testing concept (originally proposed by Neyman (1937) and further developed by Rayner et al. (2009) among others) can be used to test the distributional assumption in a GLM. This will summarize some of the key points from the recently submitted PhD thesis including results of power and size studies, the practical interpretation of smooth tests in the model development context, and the development of an R package which implements the smooth test in a form that can be easily applied to models fitted using the standard glm() function within the R statistical computing environment. |
Semester 2 |
|||
Date | Presenter | Title | Abstract |
9 Aug |
Dr Katherine Uylangco The University of Newcastle |
Exogenous factors and the incidence of criminality | [Click to show or hide] |
(From a joint article with Dr Paul Docherty) An individual’s propensity to commit crime can be considered a trade-off between the expected benefits and costs of the criminal activity (Becker, 1968). We examine this trade-off by undertaking an empirical analysis of whether variables representing the business cycle, law enforcement and demographics explain variance in the incidence of property crimes and crimes against the person. The impact of the weather on crime rates is also examined, as prior studies have suggested a positive relationship between temperature and violent crimes (Field, 1992). A comprehensive model that includes all of these exogenous variables is used to explain the variation in changes in property crimes and crimes against the person in an Australian context. We also extend upon the existing literature by developing a forecast model that performs well out-of-sample. The success of this model has important policy implications, as it allows law enforcement agencies to predict and prepare for future crime rates. |
|||
11 Oct |
Dr Patrick McElduff Hunter Medical Research Institute |
Causal diagrams mediate variable selection in regression models | [Click to show or hide] |
Many researchers are often confronted with a situation in which they have a well-defined outcome and a large number of potential predictor variables. It is tempting for them to throw all the predictor variables into a regression model and/or to use stepwise procedures to select the most parsimonious model. As well as having low coverage probabilities this approach can cause spurious results. In this talk I discuss the use of causal diagrams (or Directed Acyclical Graphs) as a tool to assist researchers to fit more appropriate regression models. The talk will cover issues such as: adjusting for confounding; mediation analysis; proper adjustment in matched case-control studies; the potential for hidden confounding in surveys and cohort studies; and instrumental variables analysis. |
|||
25 Oct |
Prof John Rayner The University of Newcastle |
Extended analysis of partially ordered multi-factor designs | [Click to show or hide] |
(John Rayner, John Best & Olivier Thas) For multifactor experimental designs in which the levels of at least one of the factors are ordered we demonstrate the use of components that provide a deep nonparametric scrutiny of the data. The components assess generalized correlations and the resulting tests include and extend the Page and umbrella tests. |
|||
25 Oct |
Prof John Rayner The University of Newcastle |
Smooth Tests of Fit for Gaussian Mixtures | [Click to show or hide] |
(Thomas Seusse, John Rayner & Olivier Thas) Smooth tests were developed to test for a finite mixture distribution using two smooth models. Each has its own strengths and weaknesses. These tests are demonstrated by testing for a mixture of two normal distributions. Some sizes and powers are given, as is an example. |
|||
29 Oct |
Shanjeeda Shafi The University of Newcastle |
Mathematical Methods for Molecular Drug discovery and biomarker identification: towards new paradigms of drug efficacy and safety | [Click to show or hide] |
PhD Confirmation Seminar High throughput docking of small molecule ligands (candidate drugs) into high resolution protein structures is now standard in computational approaches to drug discovery. A candidate drug is usually a small molecule (~50 atoms) which acts by modifying the metabolic activity of a protein. By current estimates, it costs more than $1.3 billion and takes 12-15 years to bring a new drug to market. Predicting druggability and prioritising certain disease modifying targets for the drug development process is of high practical relevance in pharmaceutical research. Druggability of a molecule is characterised in part by it satisfying Lipinski's rule Rule-of-five, which identifies several key physicochemical properties, such as mass and hydrophobicity, that should be considered for compounds with oral delivery (Lipinski and Hopkins, 2004), and by other characteristics, which determine whether a chemical compound can be orally active in humans. There is still much debate in the industry as to what constitutes a 'good' hit - that is, one that will remain rule-of-five compliant after optimization. This is very difficult to assess with hits derived from high throughput screening, as a significant fraction of the molecules may be interacting sub-optimally with the receptor. There is still no consensus as to what constitutes an acceptable lead in terms of potency, molecular mass and hydrophobicity. Another challenge is to identify regions of chemical space that contain biologically active compounds for given biological targets (i.e. proteases, kinases, calpains etc). Lipinski and Hopkins (2004) suggested that within the continuum of chemical space, there should be discrete regions occupied by compounds with specific affinities towards particular biological targets. The question of which variables (or coordinate systems) would facilitate such segregation, however, was not delineated in their seminal paper – and is as yet not determined. The main aim of this research is to create superior indices for the estimation of true binding affinity of ligands and to distinguish between high affinity binders and non-essential binders of proteins. New cut points will be considered for create a scoring function of violation for each predictor using considering different approaches such as Bayesian mixtures (BayesMix) and non-Bayesian hybrid method based on model-based clustering with discriminant analysis and also cutpoint methods. Recently an alternative score for violations based on Lipinski's 4 variables, but using different cutpoints (Hudson et al., 2012) categorized molecules as druggable, if they satisfied less or equal to 4 violations. Our results to date now suggest an improved cutpoint of 5. We shall develop new mixture based modelling (nonlinear and linear Bayesian and non-Bayesian) to identify poor and good drug candidates and form the basis to create constructs for visualizing these in chemospace. We are currently creating constructs for visualizing binders in chemospace. Visualization methods such as PCA and correspondence analysis show promise. Hybrids of SOMs with HMMs and mixtures with Dirichlet priors are yet to be investigated. This research has the potential to significantly reduce false classification of drugs and therefore improve drug design where an appropriate predictor set needs to be identified for new drug innovations. |
|||
22 Nov |
James Totterdell The University of Newcastle |
Bayesian Hidden Markov Model for Homogeneous Segmentation of Heterogeneous DNA Sequences | [Click to show or hide] |
The advent of rapid DNA sequencing techniques has led to an exponential increase in the quantity of nucleotide sequences in online databases such as Genbank. In the face of such large sets of data statistical methods provide efficient means to screen for and identify structure within DNA sequences. Perhaps the most fundamental property of a DNA sequence is its base composition. It is known that the base composition of DNA sequences is not uniform. Models with homogeneous probabilistic structure do not adequately describe variation in base composition. Fluctuations in base composition can be better explained by alternating homogeneous domains, called segments. Segmentation models aim to partition compositionally heterogeneous domains into homogeneous segments which may be reflective of biological function. A number of different segmentation models have been proposed such as moving window, maximum likelihood estimation and recursive segmentation. Due to the latent nature of the segments a natural approach to segmentation uses hidden Markov models (HMMs). An HMM consists of an observed stochastic process the distribution of which is influenced by an underlying unobserved Markov chain. This thesis investigated the use of HMMs in DNA segmentation where parameter estimation was performed using Bayesian methods. The model to be considered was derived and algorithms that could be used for estimation were presented. Functions following these algorithms were implemented in R and were assessed through a simulation study. The functions were then used in the model estimation of real DNA sequences and the results were compared to past efforts in this area. |