Thursday, August 24, 2017

ECLSA 12th Annual Conference: Benjamin Liebman--"Mass Digitalization of Chinese Court Decisions: How to Use Text as Data in the Field of Chinese Law."

The 12th Annual Conference of the European China Law Studies Association is being held now, hosted with the support of the University of Leiden and its Faculty of Law and the Leiden Institute for Area Studies (more information here and here).

A highlight of the conference this year was the opening plenary address delivered by Benjamin Liebman. This post provides my quick notes of that opening plenary; it is meant to be read as an invitation to read the paper from which the remarks were drawn (Liebman, Benjamin L. and Roberts, Margaret and Stern, Rachel E. and Wang, Alice Z., Mass Digitization of Chinese Court Decisions: How to Use Text as Data in the Field of Chinese Law (June 13, 2017). 21st Century China Center Research Paper No. 2017-01; Columbia Public Law Research Paper No. 14-551).  

Paper Abstract

Over the past five years, Chinese courts have placed tens of millions of court judgments online. We analyze the promise and pitfalls of using this remarkable new data source through the construction and examination of a dataset of 1,058,990 documents from Henan province. Courts posted judgments in roughly half of all cases in 2014 and, although the percent of cases posted online has likely risen since then, the single greatest challenge facing researchers remains documenting gaps in the data. We find that missing data varies widely by court, and that intermediate courts disclose significantly more documents than basic level courts. But court level, GDP per capita, population, and mediation rates are insufficient fully to explain variation in disclosure rates. Further work is needed to better understand how resources and incentives might be skewing the data. Despite incomplete information, however, a topic model of 20,321 administrative court judgments demonstrates how mass digitization of court decisions opens a new window into the practice of everyday law in China. Unsupervised machine learning combined with close reading of selected cases reveals surprising trends in administrative disputes as well as important research questions. Taken together, our findings suggest a need for humility and methodological pluralism among scholars seeking to use large-scale data from Chinese courts. The vast amount of incomplete data now available may frustrate attempts to find quick answers to existing questions, but the data excel at opening new pathways for research and at adding nuance to existing assumptions about the role of courts in Chinese society.
Notes of opeining plenary address:
The digitalizaiton of Chinese court cases has become enormous over the last few years.  More than 32 million cases are now on line at  The initial steps were taken in 2009 and mandatory since 2014. The rules on publication are a work in progress with constant changes. There are also other and older court data sets.  Computational social science has turned its attention to the study of authoritarian regimes. Still pretty rare to see the use of computational scocial science techniques to study legal texts.  But this may change. 

The initial study focused on the cases from Henan Province. Their data set inlcuded 1,058,990.  Three questions; (1) what is there, (2) how should we as scholars begin to use these data bases (3) how s the availability of big data affecting the study of Chinese institutions and related how they are functioning? Initial question--why would the Chinese embrace this sort of transparency.  It is new. Rapid changes form five years ago. This is an unusual move even when measured against democratic and civil law states, There is limited availability in the 1990s through the 2000s. The roots were in Henan Province as a consequence of reform that was galvanized by high profile wrongful convictions scandals.  To this was added the importance of emerging cultures of rapid marketization of the field (another source of research, the scope and nature of these markets for legal or case information) and some experimentation with artificial intelligence (AI). The goals: curb wrongdoing, increase standardization and raise the status of courts. Mass digitization versus transparency is an important element here. They prefer the former to the later term. 

Data limitations: They provide only a partial window. Only final results are published. The online platforms are unstable (frequent changes, inconsistent formatting, and inconsistent results on searches). Some cases disappear. Commercial sites often have more cases. Frequently changing rules are sometimes inconsistently applied. Examples include first instance decisions (should they be included or not?, issue around whether these were final decisions; answers changed after 2014; after 2016 only final decisions are made public) and divorce. Lastly there may be bias in the data.  

The Henan data set--scrapped straight from the Henan High Court website (on data scrapping versus crawling HERE). Now decisions are tied to SPC websites and Henan High Court no longer posts. Natural language processing script to sort and analyze key information in decisions. Why choose Henan? Henan was an early adopter and therefore an important baseline for later changes.  It is a representative of China (their "Middle America"). It was possible to combine analysis of data with qualitative work.  And they were able to explore differences across 184 Henan courts. The numbers are available on the paper (link provided above), 

What is missing? Existing scholarship largely overlooks what is missing. Courts at the time (2014) on average put 41% of their docket on line. There is a wide variation in practice (high: 83%, low 15%).  That wide variation is important, certainly with respect to the bias that may be produced in data  insights. It was also surprising in light of the assumption that there would be uniform and substantial compliance with rules.  That was not the case. This may be a function of variations in judicial resources.  Base level courts tend to find that harder than intermediate level courts for example.There are indications that compliance rates are much higher after 2016. That needs to be tested.


Notice what was also missing: mediated cases; roughly 25% of civil case were mediated in 2014. There are a number of other caveats that were worth highlighting.  These included: withdrawals and enforcement cases might not be on line; resource bias (GDP per capita and population not statistically significant); incentive bias (the role of court leaders may be significant); over compliance? (marriage, multiple documents for one case, and appeals). 

Additional research is needed.  Some conclusions and guidance for the future: humility and caution in using the data. Find and use good data is harder than it seems; may be time to avoid the huge studies and instead look for pockets of good data. There may be more and better standardization over time. 

 Court decisions as data: provide an example from administrative litigation. Looked at 31,710 administrative decisions (administrative lawsuits and non litigation enforcement). In the future will differentiate between private efforts to enforce and enforcement actions brought by officials. Administrative litigation as a key indicator of accountability. Official data of limited utility prior scholarship limited to small sample sizes.  What does larger sample size tell us? 

Methodology: Topic modelling. Used Structural Topic Model package (Roberts et al 2014); Roberts 2016 to estimate 50 topics. Algorithm using unsupervised machine learning to identify patterns of text that are likely to appear together. 50 is totally arbitrary. All documents are mixtures of multiple topis. Model estimates topic proportions fr each document. Reviewed highest frequency words for each topic and most representative documents, tagged manually. Then created a topic model of the findings.It is meant to show both frequency of words and their interrelations. 


Interesting findings: administrative litigation and enforcement actions are common where agencies are going to court to enforce fines and courts are rejecting agency application. There is a huge range of land related topics (23 of the topics and 35 % of words includes administrative litigation and enforcement; what the data reveals  that sometimes cases that appear unrelated to land are land related, examples OGI, and suits against the police.  Official statistics then are likely to underestimate the volume of land related cases. Main takeaway--what outsiders might consider as classic land cases do not show up as a large portion of the data set this masks the way that other forms of action are really land related.  Thus categorization sometimes inhibits clarity in analysis of underlying causes of action. 

Other take ways: There appears to be a using administrative litigation in aid of related civil suits (especially medical malpractice and labor disputes). There are a lot of withdrawals but there are also a lot of routine things going on.  Sometimes Sometimes it is possible to over read the significance in actions. There ought to be some more caution in conclusions form data. Now also possible to see which agencies tend to be common litigants and which tend to avoid the courts.  Weak agencies tend to find themselves in court more often--perhaps because they need the aid of the courts more often to augment their authority.

Topic modelling may be a better tool for generating questions than for producing answer.  Why do litigants bring identical cases together (case strings).  What is missing: environmental cases and business cases. Business are not common litigants; where are they going to resolve disputes?   One can unveil trends but perhaps not explain them yet. 

Future work: additional machine learning; updating with SPC data; deepening understanding of what is missing; expanding nationally and over time; and marketplace for legal information. This is really time consuming work.  Long term the marketization of data produces a question about the role of academics within data analytics: perhaps design and interpretation, perhaps the production of algorithms. The SPC has launched companies and the market remains illegal for the moment but things change. 

Conclusion: cross disciplinary value of this sort of research but there is a need for multi disciplinary approaches. There is a lot of data that is missing.  Lots of data doesn't mean complete data sets. Great insights into everyday law but only from one view,  Caution and humility necessary.  Question: how does digitization changes the practice of law. They suggest large and fundamental changes in the practice of law. From data to AI in the implementation of legal system.  Does this pose issues that are normative? A vast new area with tremendous challenges and opportunities.

No comments: