LDA: documents with different sizes

Hey guys,
In the end, I decided to do a tutorial on Autoencoding Variational Inference For Topic Models as suggested by @eb8680 … I think this one is gonna be better than the last one about CVAEs :slight_smile:

I need a help with the following problem. The base code on examples/lda.py has nice vectorized nested plates looping through documents (L50) and words within each document (L56), as it assumes all documents have the same length. Unfortunately, this is is not valid assumption to real case examples, as documents have different lengths.

I’m doing the tutorial using NLTK’s Reuters dataset (+10,000 news articles). To feed the model, I pre-processed and vectorized the documents, creating a data structure docs, which is a Python list containing all the documents. Every element in the list is a 1D PyTorch tensor with indices corresponding to each word in the document. And as documents have different lengths, every tensor has different dimensions.

So, I replaced the 2 nested plates that started with with pyro.plate by 2 for loops… And it works… the problem is that it is extremely slow. The code below shows it (it’s basically the same original code with minor tweaks to deal with the variable-sized docs):

def model(self, docs=None, doc_sum=None):
    # Globals.
    with pyro.plate("topics", self.num_topics):
        topic_weights = pyro.sample("topic_weights", dist.Gamma(1. / self.num_topics, 1.))
        topic_words = pyro.sample("topic_words",
                                  dist.Dirichlet(torch.ones(self.vocab_size) / self.vocab_size))

    # Locals.
    for doc_idx in pyro.plate("documents", len(docs)):
        doc_topics = pyro.sample("doc_topics_%d" % doc_idx,
                                 dist.Dirichlet(topic_weights))
        for word_idx in pyro.plate("words_%d" % doc_idx, len(docs[doc_idx])):
            word_topics = pyro.sample(
                "word_topics_%d_%d" % (doc_idx, word_idx),
                dist.Categorical(doc_topics),
                infer={"enumerate": "parallel"}
            )
            pyro.sample("doc_word_%d_%d" % (doc_idx, word_idx),
                        dist.Categorical(topic_words[word_topics]),
                        obs=docs[doc_idx][word_idx])

Any ideas on how to accelerate this? I thought about padding the tensors with a dummy index, but I’m not sure whether that will mess up the algorithm/results…

Thanks!

1 Like

i don’t know about your specific problem but i tried to implement this model 2 years ago and i found it incredibly finicky, often yielding poor undiverse topics, etc. so i’m not sure if this is a great model for you to target, though of course i could be wrong

Yeah… Now that I solved the cuda piece (thanks for the help!), I’m seeing 15’/epoch… it’s extremely slow even with GPUs, and that’s because of the 2 nested for loops… so, I must find another way to implement it in Pyro (I’ll try padding the sentences with a dummy index, inspired in how LSTM networks with variable sized sentences are trained…)

Curiously I’ve seen pure PyTorch implementations that work perfectly and are very fast. The problem with these implementations is that hey mix sampling/inference code with the model code, and they use a handcrafted loss/objective function, which works but kind of defeats the purpose of an universal PPL. So, IMHO it’s very important to prove it is possible to implement these models in Pyro: it tests the “universal” in “Deep Universal Probabilistic Programming”…

I’ll keep trying, thanks for the perspective! Cheers!

I’m about to give up… I don’t think the problem is in the specific model… Maybe the problem is in the method… After several experiments, I’m starting to agree with this guy:

1 Like

@carlossouza is it still performance issues you’re having trouble with, convergence/numerical issues, or poor results? I’m inclined to agree with your point above that we ought to have at least one nice example of LDA working reliably in Pyro, even if the paper I pointed you to originally ends up not being the right starting point.

I also tried to implement the model with different sizes in Pyro following the tutorial and it produced really poor results in addition to its slowness [even some bugs in further steps: Latent Dirichlet Allocation Model and Possible bugs: Predictive vs Subsampling, Enumerate Error] . I assume it is a problem with Pyro not with the model because I also implemented LDA in python from scratch (using Collapsed Gibbs Sampling) and it works reasonably good on the same dataset (bunch of abstracts from articles). Is your problem only with performance or also results?

Hi @eb8680_2 ! Poor results… Actually after doing more experiments, I’m inclined to agree with @cprogrammer, there’s something odd with my Pyro implementation. I managed to make a pure PyTorch implementation to work (e.g. produce coherent topics), whilst my port of this implementation to Pyro didn’t (e.g. produced topics with no coherence).

Here are the implementations. It takes couple of minutes to train the models using GPUs, so I’m sharing them using Kaggle’s free GPUs:

ProdLDA pure PyTorch implementation working

ProdLDA port to Pyro not working

Both implementations use the exact same dataset… If you guys can spot the error & help fix it, that would be great: I already wrote a nice tutorial around the code (which I’m not including to save space)… it would be sad to throw it away :slight_smile:

Cheers!

1 Like

Hey guys,
I think I found 2 bugs:

  • In the output of the decoder, replaced log_softmax by softmax
  • Although in the guide I’m doing dist.LogNormal(theta_loc, (0.5 * theta_scale).exp()) (if I use theta_scale directly this generates NaN losses), I should not do it in the model. In the model I’m sampling theta with simply dist.LogNormal(theta_loc, theta_scale).

With these 2 modifications (code here), it looks like the topics are more coherent (I’m eyeballing). Here are some examples. Pls take a look and let me know if you think it’s good enough :slight_smile:

Oh, btw, there’s another problem… Now that finally the topics look like they are coherent, every once in a while the training “explodes” with NaN losses… don’t know how to fix that yet…

Top 20 words of some of the 100 topics:

[‘kashmir’ ‘hindu’ ‘sikh’ ‘srinagar’ ‘militant’ ‘inkatha’ ‘punjab’
‘separatist’ ‘amritsar’ ‘birendra’ ‘policeman’ ‘secession’ ‘india’
‘johannesburg’ ‘ira’ ‘gillespie’ ‘flared’ ‘pakistani’ ‘moslem’
‘superintendent’]

[‘lenin’ ‘prague’ ‘trotsky’ ‘havana’ ‘seoul’ ‘korea’ ‘zinoviev’ ‘korean’
‘roh’ ‘ctk’ ‘czechoslovak’ ‘peninsula’ ‘czechoslovakia’ ‘stalin’
‘disperse’ ‘revolution’ ‘ukrainian’ ‘ershad’ ‘pakistan’ ‘sinhalese’]

[‘abortion’ ‘defeated’ ‘incest’ ‘reconciliation’ ‘botha’ ‘nicaraguan’
‘legalized’ ‘robb’ ‘apartheid’ ‘impoverished’ ‘blanchard’ ‘brooklyn’
‘kaunda’ ‘succeed’ ‘ineligible’ ‘colombia’ ‘tragic’ ‘freeze’ ‘favor’
‘unborn’]

[‘soybean’ ‘brazil’ ‘usda’ ‘argentina’ ‘ton’ ‘semiconductor’ ‘harvest’
‘output’ ‘metric’ ‘forecast’ ‘chinn’ ‘vegetable’ ‘beef’ ‘opec’ ‘colony’
‘soared’ ‘strongest’ ‘sahara’ ‘spanish’ ‘midway’]

[‘nyse’ ‘industrials’ ‘outnumbered’ ‘composite’ ‘quarterly’ ‘dow’ ‘listed’
‘volume’ ‘rsqb’ ‘sluggish’ ‘index’ ‘lsqb’ ‘unchanged’ ‘temperature’
‘nationwide’ ‘geological’ ‘telegraph’ ‘noontime’ ‘minus’ ‘transistor’]

[‘royal’ ‘injury’ ‘offended’ ‘injured’ ‘hospital’ ‘aboard’ ‘parked’
‘jackpot’ ‘ireland’ ‘injuring’ ‘flying’ ‘scene’ ‘evacuate’ ‘toll’
‘coffee’ ‘struck’ ‘driver’ ‘exploded’ ‘character’ ‘trailer’]

[‘nobel’ ‘birthday’ ‘celebration’ ‘hirohito’ ‘sakharov’ ‘prague’
‘treasure’ ‘zealand’ ‘asia’ ‘hitler’ ‘royal’ ‘mansfield’ ‘rabuka’ ‘prize’
‘hunger’ ‘fijian’ ‘indian’ ‘cheering’ ‘sister’ ‘china’]

[‘franc’ ‘zurich’ ‘midmorning’ ‘bid’ ‘ounce’ ‘fixed’ ‘hong’ ‘bullion’
‘troy’ ‘kong’ ‘london’ ‘yen’ ‘gold’ ‘silver’ ‘compared’ ‘cdy’ ‘late’
‘swiss’ ‘dutch’ ‘mark’]

[‘wild’ ‘flower’ ‘repair’ ‘campbell’ ‘endangered’ ‘chain’ ‘airplane’
‘charging’ ‘payless’ ‘airline’ ‘oman’ ‘philippine’ ‘route’ ‘video’
‘flight’ ‘aeronautics’ ‘elephant’ ‘conservation’ ‘owner’ ‘dealer’]

[‘elliot’ ‘busfield’ ‘elevator’ ‘cart’ ‘hose’ ‘love’ ‘plate’ ‘cream’
‘ryan’ ‘gown’ ‘hunter’ ‘midshipman’ ‘ballet’ ‘danced’ ‘shark’ ‘tired’
‘honored’ ‘academy’ ‘alice’ ‘broken’]

[‘texaco’ ‘nyse’ ‘nabisco’ ‘composite’ ‘polaroid’ ‘unchanged’ ‘shamrock’
‘icahn’ ‘lsqb’ ‘kravis’ ‘rjr’ ‘actively’ ‘wilshire’ ‘kohlberg’ ‘kraft’
‘takeover’ ‘counter’ ‘nasdaq’ ‘industrials’ ‘blue’]

[‘air’ ‘plane’ ‘gallon’ ‘boeing’ ‘cdy’ ‘pump’ ‘bound’ ‘unleaded’ ‘marine’
‘sea’ ‘command’ ‘collided’ ‘jet’ ‘lebanon’ ‘booster’ ‘carrying’
‘airplane’ ‘islamic’ ‘aeronautics’ ‘clr’]

[‘cemetery’ ‘trible’ ‘vatican’ ‘nobel’ ‘emigrate’ ‘coffin’ ‘haiti’
‘archdiocese’ ‘toussaint’ ‘saint’ ‘richards’ ‘chafee’ ‘cathedral’
‘haitian’ ‘roman’ ‘honoring’ ‘pope’ ‘weicker’ ‘monsignor’ ‘rape’]

[‘owned’ ‘gallery’ ‘semiconductor’ ‘sale’ ‘quarterly’ ‘memory’ ‘whirlpool’
‘usg’ ‘pain’ ‘auction’ ‘census’ ‘partner’ ‘earnings’ ‘smithkline’
‘acquisition’ ‘net’ ‘share’ ‘product’ ‘aspirin’ ‘based’]

[‘angola’ ‘libyan’ ‘cuban’ ‘vanuatu’ ‘guerrilla’ ‘raid’ ‘asylum’ ‘ugandan’
‘curfew’ ‘ershad’ ‘plot’ ‘bombing’ ‘australian’ ‘iranian’ ‘sidon’
‘buckey’ ‘monitored’ ‘hepatitis’ ‘angolan’ ‘nephew’]

[‘cftc’ ‘commodity’ ‘sec’ ‘merc’ ‘regulatory’ ‘brady’ ‘future’ ‘silverado’
‘rostenkowski’ ‘mercantile’ ‘firm’ ‘dominate’ ‘index’ ‘swing’ ‘colorado’
‘bond’ ‘junk’ ‘volatility’ ‘trading’ ‘polaroid’]

[‘aristide’ ‘heseltine’ ‘haiti’ ‘haitian’ ‘sajudis’ ‘totally’ ‘accuse’
‘parliamentary’ ‘hurd’ ‘lithuania’ ‘politburo’ ‘khrushchev’ ‘walesa’
‘mazowiecki’ ‘suitable’ ‘thatcher’ ‘luck’ ‘lithuanian’ ‘balloting’
‘secession’]

[‘cent’ ‘pork’ ‘soybean’ ‘bushel’ ‘cattle’ ‘lespinasse’ ‘mixed’ ‘delivery’
‘feeder’ ‘livestock’ ‘nikkei’ ‘lower’ ‘yen’ ‘corn’ ‘witter’ ‘mercantile’
‘grain’ ‘victor’ ‘unleaded’ ‘bullish’]

[‘manila’ ‘rescue’ ‘mine’ ‘population’ ‘boucher’ ‘aquino’ ‘anwar’
‘gadhafi’ ‘corazon’ ‘flown’ ‘blasted’ ‘freed’ ‘shaft’ ‘sassan’ ‘evacuate’
‘exile’ ‘mubarak’ ‘squad’ ‘hull’ ‘baghdad’]

[‘prague’ ‘czechoslovakia’ ‘hunthausen’ ‘clinton’ ‘carson’ ‘turkey’
‘rinfret’ ‘sadat’ ‘ruth’ ‘duarte’ ‘czechoslovak’ ‘portuguese’ ‘chile’
‘morocco’ ‘polisario’ ‘robb’ ‘legalizing’ ‘holy’ ‘invited’ ‘vatican’]

[‘momentum’ ‘cuomo’ ‘endorse’ ‘wisconsin’ ‘tennessee’ ‘deadlocked’
‘distant’ ‘gore’ ‘coupled’ ‘marino’ ‘regained’ ‘andreas’ ‘mario’
‘colorado’ ‘koch’ ‘applies’ ‘fueled’ ‘primary’ ‘convention’ ‘attacking’]

[‘index’ ‘industrials’ ‘billboard’ ‘chicken’ ‘succeeds’ ‘redman’ ‘average’
‘exchange’ ‘rca’ ‘nikkei’ ‘minimum’ ‘lowering’ ‘elliot’ ‘coniston’
‘stock’ ‘nyse’ ‘pm’ ‘netherlands’ ‘outnumbered’ ‘testament’]

[‘bail’ ‘kahane’ ‘nosair’ ‘magistrate’ ‘van’ ‘grammer’ ‘nuys’ ‘bombing’
‘rabbi’ ‘indictment’ ‘diamond’ ‘wounding’ ‘false’ ‘custom’ ‘arrested’
‘explosive’ ‘submitting’ ‘gaza’ ‘rehabilitation’ ‘juan’]

[‘vincennes’ ‘ice’ ‘altitude’ ‘fatigue’ ‘terminal’ ‘airbus’ ‘inlet’ ‘ship’
‘flown’ ‘pilot’ ‘boeing’ ‘ash’ ‘takeoff’ ‘fighter’ ‘bomber’ ‘patmos’
‘monastery’ ‘plane’ ‘col’ ‘aircraft’]

[‘koppers’ ‘lorimar’ ‘acquisition’ ‘kinnock’ ‘promoted’ ‘justice’ ‘warner’
‘deductible’ ‘appeal’ ‘stamp’ ‘wrongdoing’ ‘dynamic’ ‘violates’
‘repurchase’ ‘bordallo’ ‘craft’ ‘squad’ ‘acquire’ ‘mariel’ ‘processed’]

[‘billboard’ ‘elevator’ ‘hose’ ‘busfield’ ‘postal’ ‘accepted’ ‘cream’
‘hudson’ ‘elliot’ ‘nikkei’ ‘naples’ ‘trapped’ ‘guinness’ ‘gown’
‘catherine’ ‘cart’ ‘constable’ ‘lackluster’ ‘diet’ ‘honored’]

[‘cent’ ‘silver’ ‘heating’ ‘cattle’ ‘crude’ ‘unleaded’ ‘mercantile’
‘gallon’ ‘pork’ ‘juice’ ‘cocoa’ ‘sweet’ ‘pound’ ‘florio’ ‘eighty’
‘pressured’ ‘delivery’ ‘livestock’ ‘precious’ ‘bundesbank’]

[‘ireland’ ‘crashed’ ‘irish’ ‘ira’ ‘bombing’ ‘sudan’ ‘arrested’ ‘occurred’
‘constabulary’ ‘sowan’ ‘shooting’ ‘injury’ ‘ulster’ ‘richter’ ‘coat’
‘bathroom’ ‘constable’ ‘injuring’ ‘mohawk’ ‘roman’]

[‘incumbent’ ‘mondale’ ‘quayle’ ‘pageant’ ‘bentsen’ ‘education’ ‘graduate’
‘tyler’ ‘tech’ ‘roosevelt’ ‘bono’ ‘liberal’ ‘tory’ ‘essentially’ ‘pie’
‘rarely’ ‘lip’ ‘toussaint’ ‘agrees’ ‘electoral’]

[‘survey’ ‘plus’ ‘tax’ ‘expense’ ‘poll’ ‘tribune’ ‘surveyed’ ‘latest’
‘minus’ ‘auction’ ‘deduction’ ‘job’ ‘premium’ ‘percent’ ‘unchanged’
‘income’ ‘error’ ‘sale’ ‘buyer’ ‘jackpot’]

[‘ussr’ ‘seidon’ ‘duchess’ ‘supersonic’ ‘forman’ ‘zieman’ ‘lindsey’
‘duarte’ ‘hud’ ‘conceicao’ ‘teenager’ ‘p’ ‘rison’ ‘bujang’
‘kristallnacht’ ‘drake’ ‘cbn’ ‘kalugin’ ‘barboza’ ‘batalla’]

[‘statehood’ ‘souter’ ‘cambodia’ ‘rouge’ ‘khmer’ ‘boucher’ ‘montana’
‘lautenberg’ ‘coat’ ‘genuine’ ‘judiciary’ ‘cambodian’ ‘jersey’ ‘changing’
‘beijing’ ‘boharski’ ‘wyo’ ‘comprehensive’ ‘constitutionally’ ‘nunn’]

[‘symphony’ ‘orchestra’ ‘baltimore’ ‘musician’ ‘ticketron’ ‘telecharge’
‘berrigan’ ‘teletron’ ‘knocked’ ‘temperature’ ‘utility’ ‘humidity’
‘musical’ ‘pageant’ ‘mcdermott’ ‘tunnel’ ‘hottest’ ‘cool’ ‘broadway’
‘deborah’]

[‘route’ ‘engine’ ‘digital’ ‘diamond’ ‘byrne’ ‘altitude’ ‘duty’ ‘sgt’
‘wind’ ‘plane’ ‘leg’ ‘hittle’ ‘burton’ ‘bird’ ‘odds’ ‘karpov’ ‘box’
‘driven’ ‘pick’ ‘langley’]

[‘opera’ ‘jackpot’ ‘mouse’ ‘prize’ ‘poem’ ‘apple’ ‘temperature’ ‘theater’
‘bird’ ‘sang’ ‘drawing’ ‘cartoon’ ‘transistor’ ‘aspirin’ ‘bain’ ‘debut’
‘character’ ‘pet’ ‘computer’ ‘packaging’]

[‘revenue’ ‘earnings’ ‘owned’ ‘income’ ‘levin’ ‘exempt’ ‘quarter’ ‘gm’
‘coastamerica’ ‘fiscal’ ‘motor’ ‘medicare’ ‘whirlpool’ ‘repurchase’
‘deduction’ ‘hardware’ ‘acquisition’ ‘fec’ ‘budget’ ‘billion’]

[‘portuguese’ ‘turkish’ ‘robb’ ‘forum’ ‘commissioner’ ‘ratification’
‘nominate’ ‘luxembourg’ ‘irs’ ‘retiring’ ‘rude’ ‘promoted’ ‘invaded’ ‘ec’
‘onetime’ ‘portugal’ ‘bradley’ ‘brussels’ ‘baghdad’ ‘civic’]

[‘ual’ ‘mall’ ‘pedestrian’ ‘drexel’ ‘sand’ ‘spouse’ ‘houstoun’ ‘coniston’
‘wolf’ ‘probate’ ‘estate’ ‘storer’ ‘pension’ ‘charitable’ ‘oral’
‘implant’ ‘deduction’ ‘milken’ ‘taxable’ ‘buyout’]

[‘manila’ ‘pan’ ‘student’ ‘vessel’ ‘brando’ ‘attempted’ ‘brigade’ ‘rabuka’
‘sakharov’ ‘dickman’ ‘sister’ ‘deadly’ ‘buddhist’ ‘squad’ ‘richter’
‘refugee’ ‘protest’ ‘hazelwood’ ‘passport’ ‘aquino’]

[‘philippine’ ‘ortega’ ‘sandinistas’ ‘cardinal’ ‘pittston’ ‘takeshita’
‘obando’ ‘concession’ ‘marcos’ ‘countryside’ ‘subic’ ‘prensa’ ‘tactical’
‘filipino’ ‘formal’ ‘managua’ ‘truce’ ‘insisted’ ‘resistance’ ‘japanese’]

[‘charter’ ‘negotiated’ ‘castro’ ‘mozambique’ ‘dialogue’ ‘stormed’
‘sharon’ ‘bent’ ‘autonomy’ ‘uta’ ‘amid’ ‘nationally’ ‘fidel’ ‘demanding’
‘marxist’ ‘missed’ ‘syria’ ‘stalled’ ‘likud’ ‘shouting’]

[‘steiger’ ‘assignment’ ‘aspirin’ ‘garrity’ ‘ranch’ ‘school’ ‘faculty’
‘fbi’ ‘extortion’ ‘discrimination’ ‘agent’ ‘cano’ ‘hoyt’ ‘eligible’
‘chelsea’ ‘student’ ‘prom’ ‘class’ ‘hispanic’ ‘board’]

[‘nida’ ‘lauer’ ‘robbery’ ‘murder’ ‘grammer’ ‘yates’ ‘stabbed’ ‘raping’
‘girl’ ‘clem’ ‘wound’ ‘sentenced’ ‘rehabilitation’ ‘lobban’ ‘medellin’
‘radakovich’ ‘homicide’ ‘gunshot’ ‘theft’ ‘tosh’]

[‘rust’ ‘shuttle’ ‘launch’ ‘unmanned’ ‘manned’ ‘photo’ ‘payload’
‘vandenberg’ ‘pad’ ‘ballet’ ‘botswana’ ‘nasa’ ‘space’ ‘craft’ ‘pontiff’
‘discovery’ ‘booster’ ‘newscast’ ‘orbit’ ‘aeronautics’]

[‘disapproved’ ‘totaled’ ‘percent’ ‘earnings’ ‘quarter’ ‘net’ ‘survey’
‘poll’ ‘census’ ‘statistical’ ‘handling’ ‘adult’ ‘index’ ‘salary’
‘employed’ ‘bureau’ ‘koppers’ ‘manufacturing’ ‘repurchase’ ‘structural’]

[‘infected’ ‘cell’ ‘cdc’ ‘virus’ ‘lang’ ‘epstein’ ‘infection’ ‘monet’
‘weight’ ‘nikolais’ ‘receptor’ ‘ticketron’ ‘fleisher’ ‘researcher’
‘diagnosed’ ‘disorder’ ‘marrow’ ‘teletron’ ‘hiv’ ‘telecharge’]

[‘baugh’ ‘swaggart’ ‘nosair’ ‘brawley’ ‘abrams’ ‘cruzan’ ‘haldeman’
‘feazell’ ‘tube’ ‘menorah’ ‘gracyalny’ ‘kidnapping’ ‘manslaughter’
‘murder’ ‘bail’ ‘coma’ ‘teitelbaum’ ‘rape’ ‘guardian’ ‘mistrial’]

[‘rangoon’ ‘aoun’ ‘burma’ ‘militia’ ‘hezbollah’ ‘maung’ ‘geagea’ ‘amal’
‘sinhalese’ ‘sein’ ‘lebanese’ ‘beirut’ ‘fighter’ ‘shiite’ ‘kurdish’
‘syrian’ ‘ne’ ‘enclave’ ‘honduran’ ‘hrawi’]

[‘drunk’ ‘cosby’ ‘billboard’ ‘winery’ ‘saved’ ‘license’ ‘stealing’ ‘plate’
‘academy’ ‘dance’ ‘census’ ‘watson’ ‘putnam’ ‘aaron’ ‘ear’ ‘school’
‘compact’ ‘cassette’ ‘snake’ ‘graduation’]

[‘fda’ ‘classroom’ ‘dartmouth’ ‘hanford’ ‘bowen’ ‘monkey’ ‘effective’
‘tissue’ ‘microbe’ ‘faculty’ ‘hatch’ ‘biomedical’ ‘usda’ ‘procedure’
‘essay’ ‘nih’ ‘ag’ ‘substance’ ‘curran’ ‘biological’]

[‘gop’ ‘gov’ ‘sasser’ ‘robb’ ‘dawkins’ ‘thornburgh’ ‘mead’ ‘durenberger’
‘democrat’ ‘republican’ ‘popov’ ‘incumbent’ ‘race’ ‘briefly’ ‘survey’
‘mate’ ‘simpson’ ‘tennessee’ ‘lyndon’ ‘fec’]

[‘grande’ ‘solomon’ ‘yates’ ‘warmus’ ‘grammer’ ‘pacific’ ‘gunter’ ‘clem’
‘icc’ ‘rating’ ‘cool’ ‘rio’ ‘nbc’ ‘drink’ ‘cbs’ ‘bird’ ‘nuys’
‘williamson’ ‘detective’ ‘bloomberg’]

[‘poindexter’ ‘bloom’ ‘videotape’ ‘webb’ ‘immunized’ ‘boesky’ ‘gesell’
‘hakim’ ‘secord’ ‘deposition’ ‘dixon’ ‘walsh’ ‘milken’ ‘transcript’
‘immunity’ ‘illegally’ ‘ruby’ ‘independently’ ‘pleading’ ‘mundy’]

[‘yen’ ‘industrials’ ‘tally’ ‘volume’ ‘dealer’ ‘unchanged’ ‘volvo’ ‘nyse’
‘mood’ ‘quarterly’ ‘listed’ ‘outnumbered’ ‘index’ ‘dividend’ ‘nikkei’
‘bid’ ‘laurentiis’ ‘share’ ‘triple’ ‘compared’]

[‘nest’ ‘nbc’ ‘turtle’ ‘nebinger’ ‘nesting’ ‘hose’ ‘aircraft’ ‘student’
‘mcdonnell’ ‘component’ ‘skin’ ‘australian’ ‘pauley’ ‘valdez’ ‘turkey’
‘efficient’ ‘encouraging’ ‘auto’ ‘maturity’ ‘rajneesh’]

[‘china’ ‘jail’ ‘benson’ ‘asia’ ‘contract’ ‘charity’ ‘editor’ ‘funding’
‘guideline’ ‘strike’ ‘holderman’ ‘intelligence’ ‘chinese’ ‘beijing’
‘cambodia’ ‘protesting’ ‘permission’ ‘greyhound’ ‘archbishop’ ‘local’]

[‘olympic’ ‘seoul’ ‘game’ ‘torch’ ‘cardinal’ ‘stadium’ ‘bofill’ ‘tunnel’
‘procession’ ‘viett’ ‘mcdermott’ ‘soccer’ ‘korea’ ‘korean’ ‘boycott’
‘athletic’ ‘albrecht’ ‘harmon’ ‘showcase’ ‘medal’]

[‘unchanged’ ‘quarterly’ ‘adjusted’ ‘revised’ ‘seasonally’ ‘decline’
‘smithkline’ ‘industrials’ ‘pharmaceutical’ ‘reduced’ ‘mortgage’ ‘net’
‘earnings’ ‘month’ ‘previous’ ‘imf’ ‘plc’ ‘jumped’ ‘nyse’ ‘sale’]

[‘yen’ ‘franc’ ‘swiss’ ‘zurich’ ‘bullion’ ‘ounce’ ‘gold’ ‘troy’ ‘bid’
‘kong’ ‘hong’ ‘tokyo’ ‘italian’ ‘dealer’ ‘london’ ‘milan’ ‘late’
‘canadian’ ‘dutch’ ‘silver’]

[‘asylum’ ‘expelled’ ‘cuban’ ‘halfway’ ‘mariel’ ‘benedict’ ‘claude’
‘extradition’ ‘riot’ ‘ferdinand’ ‘detention’ ‘payoff’ ‘plotting’ ‘anc’
‘bordallo’ ‘refugee’ ‘bribery’ ‘corazon’ ‘frame’ ‘arrested’]

[‘storm’ ‘cuito’ ‘avalanche’ ‘india’ ‘timber’ ‘tamil’ ‘mph’ ‘hindu’
‘shell’ ‘wind’ ‘patrol’ ‘offensive’ ‘boat’ ‘blew’ ‘evacuated’ ‘angolan’
‘angola’ ‘barrage’ ‘peninsula’ ‘yard’]

[‘acquire’ ‘acquisition’ ‘koppers’ ‘ual’ ‘plane’ ‘crashed’ ‘ussr’
‘championship’ ‘coniston’ ‘pilot’ ‘subsidiary’ ‘ltd’ ‘chris’
‘shareholder’ ‘blaze’ ‘purchase’ ‘elevator’ ‘lorimar’ ‘boeing’ ‘ton’]

[‘crashed’ ‘publisher’ ‘plane’ ‘spraying’ ‘charity’ ‘mediterranean’
‘asylum’ ‘arrive’ ‘rca’ ‘czechoslovakia’ ‘lamp’ ‘pilot’ ‘write’ ‘refugee’
‘irishman’ ‘hospital’ ‘prize’ ‘missing’ ‘marry’ ‘fluid’]

[‘matching’ ‘irs’ ‘gibbs’ ‘pesticide’ ‘corporation’ ‘bender’ ‘mckay’
‘barnard’ ‘endangered’ ‘meese’ ‘silverado’ ‘postal’ ‘ames’ ‘habitat’
‘wildlife’ ‘dividend’ ‘verify’ ‘commissioner’ ‘miller’ ‘totaling’]

[‘index’ ‘stock’ ‘nyse’ ‘market’ ‘composite’ ‘trading’ ‘dow’ ‘listed’
‘industrials’ ‘rsqb’ ‘volume’ ‘exchange’ ‘nitrogen’ ‘broker’ ‘lsqb’
‘superconducting’ ‘outnumbered’ ‘zero’ ‘airline’ ‘video’]

[‘route’ ‘banker’ ‘airline’ ‘authorized’ ‘import’ ‘delta’ ‘borer’ ‘spill’
‘carrier’ ‘laurentiis’ ‘qatar’ ‘exporter’ ‘air’ ‘maker’ ‘uta’ ‘vaccine’
‘operate’ ‘backup’ ‘cftc’ ‘purchase’]

[‘robb’ ‘boschwitz’ ‘wellstone’ ‘jesse’ ‘sofaer’ ‘hart’ ‘ratification’
‘treaty’ ‘governor’ ‘clinton’ ‘tribune’ ‘dawkins’ ‘bentsen’ ‘abm’
‘lowell’ ‘fec’ ‘senator’ ‘carson’ ‘durenberger’ ‘supporter’]

[‘coup’ ‘avril’ ‘uprising’ ‘haitian’ ‘expelled’ ‘enrile’ ‘township’
‘arrested’ ‘haiti’ ‘arrest’ ‘namphy’ ‘kashmir’ ‘fighter’ ‘laurel’
‘moslem’ ‘inkatha’ ‘shining’ ‘col’ ‘ireland’ ‘pakistan’]

nice! no idea what’s going on but here are some things you could try.

  • use softplus instead of exp to enforce positivity
  • use ClippedAdam with smallish clip_norm (e.g. 1.0) to combat big gradients
  • use poutine.scale to scale the model/guide so that the scale of the ELBO is O(1) instead of some large number

also do you know why the scale of your pyro/pytorch losses differs by a factor of more than 10 (?)

I didn’t forget you :slight_smile: … these past days have been pretty hectic at work, but tomorrow/during weekend I will have time to implement your suggestions, answer your question & finish the tutorial! Cheers!

2 Likes

@martinjankowiak, thanks again… answering your questions:

  1. Using softplus and even sigmoid activation worked to enforce positivity!
  2. ClippedAdam makes the topics collapse: it generates the same top words for all topics
  3. Don’t know how to use poutine.scale, and actually it wasn’t needed…

Now, to the most important part: I fixed the loss formula so both approaches would compute it the same way. However, ProdLDA port to Pyro loss is still 10x higher than pure PyTorch implementation. I tried lots of things, but could not make port to Pyro version to improve.

Eyeballing the top words on each topic, although the topics generated from port to Pyro version make some sense, they are still worse than the pure PyTorch implementation. Don’t know why this is happening.

But most important: the topics generated by this implementation of 2017’s ProdLDA are not as good as the ones generated by the basic LDA with Mean Field approximation from Blei/Ng/Jordan’s seminal paper from 2003! So, following the thought discussed above with @eb8680_2 (i.e. Pyro must have a nice LDA tutorial to support the universal in “deep universal probabilistic programming”), I will change the implementation to the mean field variational inference (the seminal paper from 2003), and hopefully it will work. It will be a more introductory tutorial, but will fulfill its purpose…

Cheers

2 Likes

@carlossouza have you tried using pyro.infer.TraceMeanField_ELBO in your ProdLDA example rather than Trace_ELBO to compute analytic KL divergence terms? That’s the biggest difference I see between your Pyro and PyTorch implementations after a first glance.

2 Likes

I think this is the change that improves the results the most… I’m not sure because the loss numbers are all about the same… but the topics look like more coherent in ProdLDA after this change!
Thanks

Great! I believe the losses are different because of the lack of normalizing constants in your PyTorch likelihood, as @fehiepsi mentioned on your other thread.